Regularization Theory - Learning Module

Loading content...

0/245

Regularization Motivation

The Curse of a Perfect Fit

In machine learning, success is measured not by how well you fit the training data, but by how well your model performs on unseen data. This distinction—seemingly obvious—leads to one of the most profound insights in statistical learning theory: a model that fits training data too perfectly will often generalize poorly.

This paradox, that more can be worse, underlies the entire theory of regularization. In this page, we explore why this happens, how regularization addresses it, and why regularization is not merely a practical trick but a fundamental necessity derived from the mathematics of learning.

What You Will Learn

By the end of this page, you will understand: (1) why overfitting occurs from a statistical perspective, (2) how hypothesis class complexity relates to generalization, (3) the mathematical tension between approximation and estimation error, (4) why regularization is a principled response to this tension, and (5) the multiple perspectives from which regularization can be understood.

Historical Context

The theoretical foundations of regularization trace back to Tikhonov's work on ill-posed problems in the 1960s, Vapnik and Chervonenkis's theory of generalization, and the bias-variance tradeoff formalized by Geman et al. (1992). Modern regularization theory synthesizes these ideas into a unified framework essential for understanding all modern machine learning.

The Overfitting Problem Revisited

We've encountered overfitting before, but now we approach it with the full machinery of statistical learning theory. Let us precisely define what overfitting means and why it occurs.

Formal Definition of Overfitting

Recall the central objects of learning theory:

True risk: $R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]$ — performance on the underlying distribution
Empirical risk: $\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$ — performance on training data

Definition (Overfitting): A model $h$ trained on $S$ is said to overfit if: $$\hat{R}_S(h) \ll R(h)$$

That is, the training error is much lower than the true error. The model has learned patterns in $S$ that do not generalize to the broader distribution $\mathcal{D}$.

Overfitting vs. Underfitting

Overfitting (low training error, high test error) occurs when models are too complex. Underfitting (high training error, high test error) occurs when models are too simple. The goal of regularization is to navigate between these extremes, finding the optimal complexity level.

The Generalization Gap

The generalization gap is the difference between true and empirical risk:

$$\text{Generalization Gap} = R(h_S) - \hat{R}_S(h_S)$$

where $h_S$ is the hypothesis selected by the learning algorithm on training set $S$.

A key insight from learning theory: the generalization gap depends on the complexity of the hypothesis class, not just the single hypothesis selected.

Even if $h_S$ appears simple, the fact that it was selected from a complex class $\mathcal{H}$ means the generalization gap can be large.

Why Does Overfitting Occur?

Overfitting arises from a fundamental asymmetry:

Training data is finite: We observe only $n$ samples from an infinite distribution.
Hypothesis classes can be vast: A complex $\mathcal{H}$ may contain hypotheses that fit $S$ perfectly by "memorizing" the specific samples.
Empirical risk minimization (ERM) has no foresight: ERM finds $\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$, treating all zero-training-error hypotheses as equally good.
Statistical fluctuations in $S$: The training set contains "noise" — random variations that don't reflect the true pattern.

A sufficiently expressive hypothesis class can fit this noise perfectly, mistaking random fluctuations for genuine patterns.

Signs of Overfitting
Indicator	Description	Example
Training ≪ Test Error	Large gap between training and validation performance	99% train accuracy, 70% test accuracy
High model complexity	Model has many more parameters than data points	Fitting 10 data points with degree-100 polynomial
Sensitivity to training data	Small changes in training set cause large changes in model	Adding one point drastically changes predictions
Extreme parameter values	Learned parameters are very large in magnitude	Polynomial coefficients in thousands
Memorization patterns	Model stores training examples rather than learning patterns	Near-perfect recall of training examples

The Bias-Variance Decomposition

The bias-variance decomposition provides a precise mathematical characterization of why overfitting occurs and how regularization helps. This decomposition applies to regression with squared loss, but the intuitions extend broadly.

Setup

Consider regression where the true relationship is: $$y = f(x) + \varepsilon, \quad \mathbb{E}[\varepsilon] = 0, \quad \text{Var}(\varepsilon) = \sigma^2$$

where $f(x) = \mathbb{E}[Y|X=x]$ is the Bayes optimal predictor.

Let $\hat{f}_S(x)$ denote the model learned from training set $S$. Since $S$ is random, $\hat{f}_S$ is a random variable over the space of functions.

The Decomposition

For squared loss, the expected prediction error at a point $x$ can be decomposed:

$$\mathbb{E}_S\left[(y - \hat{f}S(x))^2\right] = \underbrace{\sigma^2}{\text{Irreducible Error}} + \underbrace{\left(f(x) - \mathbb{E}_S[\hat{f}S(x)]\right)^2}{\text{Bias}^2} + \underbrace{\mathbb{E}_S\left[\left(\hat{f}_S(x) - \mathbb{E}_S[\hat{f}S(x)]\right)^2\right]}{\text{Variance}}$$

This can be written more concisely as:

$$\text{Expected Error} = \text{Noise} + \text{Bias}^2 + \text{Variance}$$

Interpreting the Three Terms

Noise (σ²): Irreducible error from inherent randomness in the data. No algorithm can eliminate this. Bias²: Error from the difference between the average prediction and the true function. High bias means the model is systematically wrong. Variance: Error from sensitivity to the particular training set. High variance means the model changes dramatically with different samples.

The Tradeoff

The profound insight is that bias and variance are typically in tension:

Model Complexity	Bias	Variance	Typical Behavior
Very simple (e.g., constant)	High	Low	Underfits: same prediction regardless of $x$
Moderate	Medium	Medium	Sweet spot: captures pattern, resists noise
Very complex (e.g., high-degree polynomial)	Low	High	Overfits: captures noise as if it were pattern

Why the tradeoff exists:

Simple models can't represent complex true functions (high bias), but their simplicity means different training sets yield similar models (low variance).
Complex models can represent almost any function (low bias), but they're sensitive to noise in each training set (high variance).

Bias-Variance and Generalization

The bias-variance decomposition explains the characteristic "U-shaped" curve of test error as a function of model complexity:

Low complexity: High bias dominates → high test error (underfitting)
Optimal complexity: Bias and variance balanced → minimum test error
High complexity: High variance dominates → high test error (overfitting)

Note that training error decreases monotonically with complexity (more complex models fit training data better), while test error exhibits the U-shape.

The gap between training and test error grows with complexity — this is exactly the generalization gap, and regularization controls it.

Beyond Squared Loss

The exact bias-variance decomposition holds for squared loss. For other losses (0-1 loss, cross-entropy), the decomposition takes different forms, but the fundamental insight remains: model complexity creates a tradeoff between underfitting (too simple) and overfitting (too complex).

Approximation vs. Estimation Error

Another lens on the same phenomenon decomposes the excess risk into approximation error and estimation error. This perspective is particularly useful in statistical learning theory.

The Decomposition

Let $h^* \in \mathcal{H}$ be the best hypothesis in our class: $$h^* = \arg\min_{h \in \mathcal{H}} R(h)$$

Let $h^{\text{Bayes}}$ be the Bayes optimal predictor (best among all measurable functions).

Let $\hat{h}_S$ be the hypothesis learned from training data $S$.

The excess risk of $\hat{h}_S$ over Bayes optimal decomposes as:

$$R(\hat{h}S) - R(h^{\text{Bayes}}) = \underbrace{\left(R(h^*) - R(h^{\text{Bayes}})\right)}{\text{Approximation Error}} + \underbrace{\left(R(\hat{h}S) - R(h^*)\right)}{\text{Estimation Error}}$$

Approximation Error

•Error from restricting to H
•How far is the best-in-class from optimal?
•Decreases with larger/richer H
•Zero if H contains Bayes optimal
•Related to bias in bias-variance
•Controlled by model selection

Estimation Error

•Error from finite sample size
•How far is learned model from best-in-class?
•Increases with larger/richer H
•Zero with infinite training data
•Related to variance in bias-variance
•Controlled by regularization

The Fundamental Tension

This decomposition reveals the fundamental tension in learning:

Larger $\mathcal{H}$:

✅ Lower approximation error (better can approximate $h^{\text{Bayes}}$)
❌ Higher estimation error (harder to find the right $h$ from finite data)

Smaller $\mathcal{H}$:

❌ Higher approximation error (may not contain good approximations)
✅ Lower estimation error (easier to find the right $h$)

This is the bias-complexity tradeoff restated: rich hypothesis classes have low approximation error but high estimation error; restricted classes have high approximation error but low estimation error.

Regularization addresses this by effectively reducing the complexity of $\mathcal{H}$ — trading increased approximation error for reduced estimation error.

A Unifying View

Both the bias-variance and approximation-estimation decompositions describe the same phenomenon from different angles. Bias ≈ approximation error (model class too restricted), Variance ≈ estimation error (insufficient data for the complexity). Regularization controls the variance/estimation component by constraining the effective hypothesis class.

Why Empirical Risk Minimization Alone Fails

Empirical Risk Minimization (ERM) is the natural approach: find the hypothesis that minimizes training error:

$$\hat{h}{\text{ERM}} = \arg\min{h \in \mathcal{H}} \hat{R}_S(h)$$

ERM is intuitive and computationally natural, but ERM alone can fail catastrophically when $\mathcal{H}$ is sufficiently expressive.

The Problem with Unrestricted ERM

Consider what happens when $\mathcal{H}$ is very rich (e.g., all measurable functions, or neural networks with unlimited capacity):

Zero training error is always achievable: There exist functions that exactly interpolate any training set.
Many such functions exist: Infinitely many hypotheses achieve zero training error.
ERM provides no guidance among them: All zero-training-error hypotheses are equally preferred by ERM.
Most such hypotheses generalize poorly: Random interpolations of noise have high true risk.

A Concrete Example: Polynomial Fitting

Consider fitting a polynomial to $n$ data points:

Data: $(x_1, y_1), \ldots, (x_n, y_n)$ with $n=10$ points

Hypothesis class: Polynomials of degree $d$

Degree $d$	Parameters	Training Error	Typical Test Error
1 (linear)	2	High	Moderate
3 (cubic)	4	Moderate	Low
9	10	Zero	Very high
50	51	Zero	Extremely high

With degree $\geq n-1$, we can exactly interpolate $n$ points. But these interpolating polynomials oscillate wildly between data points, producing absurd predictions.

The Interpolation Trap

When a model can perfectly fit the training data, ERM loses its discriminative power. All interpolating solutions have equal empirical risk (zero), so ERM cannot distinguish between genuinely good hypotheses and those that merely memorize noise. This is why additional structure—regularization—is essential.

Theoretical Perspective: VC Dimension

Recall that the VC dimension $d_{VC}(\mathcal{H})$ measures the complexity of a hypothesis class. Generalization bounds tell us:

$$R(h) \leq \hat{R}S(h) + O\left(\sqrt{\frac{d{VC}(\mathcal{H})}{n}}\right)$$

For ERM with large $d_{VC}$:

$\hat{R}S(\hat{h}{\text{ERM}}) \approx 0$ (achieved by overfitting)
But the bound term $\sqrt{d_{VC}/n}$ can be huge
The bound provides no useful guarantee

Regularization effectively reduces $d_{VC}$, tightening the generalization bound and ensuring low empirical risk actually implies low true risk.

The Need for Additional Structure

Since ERM alone is insufficient, we need additional mechanisms to control generalization:

Option 1: Restrict $\mathcal{H}$ explicitly

Use a smaller, less expressive hypothesis class
Risk: May have high approximation error

Option 2: Regularize the optimization

Keep $\mathcal{H}$ rich but add constraints/penalties
Effect: Implicitly restricts effective complexity

Option 3: Use more data

With $n \to \infty$, ERM becomes reliable
Reality: Data is always finite and often expensive

Regularization (Option 2) is the practical, principled solution. It allows using expressive hypothesis classes while controlling overfitting through mathematical constraints.

What Is Regularization?

Having established why we need additional structure beyond ERM, let us formally define regularization and its key properties.

Definition

Regularization is any modification to the learning algorithm that is intended to reduce generalization error but not training error.

More specifically, regularization replaces pure ERM: $$\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$$

with regularized ERM: $$\hat{h} = \arg\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \Omega(h) \right]$$

where:

$\Omega(h)$ is the regularizer (penalty function on hypothesis complexity)
$\lambda \geq 0$ is the regularization strength (hyperparameter)

The Core Insight

Regularization introduces a preference among hypotheses beyond their training error. A hypothesis that fits training data slightly worse but is "simpler" (lower Ω(h)) may be preferred over one that fits perfectly but is complex. This preference toward simplicity embodies Occam's Razor mathematically.

Properties of Good Regularizers

An effective regularizer $\Omega(h)$ should:

Penalize complexity: Higher values for hypotheses likely to overfit
Be computationally tractable: Add minimal overhead to optimization
Be meaningful for the problem: Capture relevant notions of simplicity
Have tunable strength: $\lambda$ controls the bias-variance tradeoff
Preserve key solutions: Still allow good approximations when $\lambda$ is small

Common Regularizers and Their Effects
Regularizer	Formula	Effect	When to Use
L2 (Ridge)	$\Omega(w) = \|w\|_2^2$	Shrinks all weights toward zero	Default choice, prevents large weights
L1 (Lasso)	$\Omega(w) = \|w\|_1$	Drives some weights to exactly zero	Feature selection, sparse models
Elastic Net	$\alpha\|w\|_1 + (1-\alpha)\|w\|_2^2$	Combines L1 and L2 effects	Correlated features, need sparsity + stability
Dropout	Random neuron masking	Ensemble averaging effect	Deep neural networks
Early Stopping	Stop before convergence	Limits effective model complexity	When validation error increases
Data Augmentation	Expand training set	Reduces variance, adds invariances	When domain knowledge suggests augmentations

The Regularization Strength λ

The hyperparameter $\lambda$ controls the tradeoff:

$\lambda = 0$: Pure ERM, no regularization

Minimizes training error exactly
Risk of overfitting with complex $\mathcal{H}$

$\lambda \to \infty$: Regularization dominates

Produces simplest hypothesis regardless of data
Risk of underfitting

Optimal $\lambda$: Balances fit and complexity

Typically found via cross-validation
Depends on data size, noise level, and $\mathcal{H}$

The existence of an optimal intermediate $\lambda$ reflects the bias-variance tradeoff: too little regularization → high variance; too much → high bias.

Multiple Views of Regularization

Regularization can be understood from multiple complementary perspectives. Each provides different insights and suggests different algorithms.

View 1: Optimization Constraint

Regularization as a constrained optimization problem:

$$\min_{h \in \mathcal{H}} \hat{R}_S(h) \quad \text{subject to} \quad \Omega(h) \leq C$$

This is equivalent (for appropriate $C$ and $\lambda$) to the penalized form:

$$\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \Omega(h) \right]$$

Interpretation: We seek the best fit among hypotheses with bounded complexity. The constraint $\Omega(h) \leq C$ defines a "capacity budget."

Practical implication: This view is useful for understanding geometric properties (constraint set shapes) and for algorithms like projected gradient descent.

View 2: Bayesian Prior

Regularization as maximum a posteriori (MAP) estimation with a prior:

$$\hat{h}{\text{MAP}} = \arg\max{h} \left[ P(S | h) \cdot P(h) \right]$$

Taking negative log and assuming Gaussian likelihood:

$$\hat{h}{\text{MAP}} = \arg\min{h} \left[ -\log P(S|h) - \log P(h) \right]$$

With appropriate priors:

L2 regularization ↔ Gaussian prior: $P(w) \propto \exp(-\lambda |w|_2^2)$
L1 regularization ↔ Laplace prior: $P(w) \propto \exp(-\lambda |w|_1)$

Interpretation: Regularization encodes prior beliefs about which hypotheses are more plausible. Simpler hypotheses have higher prior probability.

Practical implication: This view connects to Bayesian inference, uncertainty quantification, and hierarchical models.

View 3: Capacity Control

Regularization as controlling effective model capacity:

Even if $\mathcal{H}$ is infinite, regularization restricts the search to an effective hypothesis class $\mathcal{H}_{\text{eff}}(\lambda)$:

$$\mathcal{H}_{\text{eff}}(\lambda) = {h \in \mathcal{H} : \Omega(h) \leq C(\lambda)}$$

As $\lambda$ increases, $\mathcal{H}_{\text{eff}}$ shrinks, reducing effective VC dimension and improving generalization bounds.

Interpretation: We use a rich hypothesis class but explore only a "small" (low-complexity) part of it.

Practical implication: This justifies using neural networks with millions of parameters — regularization ensures we don't actually use all that capacity.

Three Views, One Phenomenon

These three views are mathematically equivalent under appropriate conditions, but they suggest different extensions: the constraint view inspires optimization algorithms, the Bayesian view suggests uncertainty quantification, and the capacity view connects to generalization theory. A complete understanding leverages all three perspectives.

Structural Risk Minimization

Structural Risk Minimization (SRM), introduced by Vapnik, provides a principled framework for regularization based directly on generalization bounds.

The SRM Principle

SRM is based on the bound: $$R(h) \leq \hat{R}_S(h) + \text{Complexity}(\mathcal{H}, n, \delta)$$

where the complexity term depends on the VC dimension or other complexity measures.

SRM approach:

Define a sequence of nested hypothesis classes: $\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots$
Each $\mathcal{H}_k$ has increasing complexity (e.g., VC dimension $d_k$)
For each class, find the ERM solution: $\hat{h}k = \arg\min{h \in \mathcal{H}_k} \hat{R}_S(h)$
Select the class that minimizes the bound: $\hat{k} = \arg\min_k \left[\hat{R}_S(\hat{h}_k) + \text{Complexity}(\mathcal{H}_k)\right]$

Connection to Regularization

SRM and regularization are deeply connected:

Regularization implements SRM implicitly:

Varying $\lambda$ traces out different effective hypothesis classes
Each $\lambda$ corresponds to a different complexity level
Selecting optimal $\lambda$ (e.g., via cross-validation) implements SRM

The SRM objective: $$\min_k \left[\hat{R}_S(\hat{h}_k) + \text{Complexity}(k)\right]$$

The regularization objective: $$\min_h \left[\hat{R}_S(h) + \lambda \Omega(h)\right]$$

With appropriate correspondence between $\lambda$ and the hierarchy index $k$, these are equivalent approaches to the same goal: balancing data fit with model complexity.

SRM in Practice

While the theoretical SRM framework uses generalization bounds to select complexity, in practice we typically use cross-validation to select regularization strength. Cross-validation empirically estimates the bound without computing it analytically, and often performs better because actual generalization bounds can be loose.

Examples of SRM Hierarchy

Polynomial regression:

$\mathcal{H}_k$ = polynomials of degree $\leq k$
Complexity increases with $k$
Select degree that balances fit and complexity

Neural networks:

$\mathcal{H}_k$ = networks with $\leq k$ hidden units
Or: networks with weight norm $\leq k$
Regularization controls effective capacity

Decision trees:

$\mathcal{H}_k$ = trees of depth $\leq k$
Pruning implements SRM by reducing tree complexity

Summary: Why Regularization Is Essential

We have established the fundamental motivation for regularization from multiple theoretical perspectives. Let us consolidate the key insights:

Key Takeaways

•Overfitting arises from finite samples: Complex models can fit training noise perfectly, mistaking random fluctuations for patterns.
•Bias-variance tradeoff: Model complexity trading off systematic error (bias) against sensitivity to training data (variance). Optimal complexity balances these.
•Approximation-estimation tradeoff: Rich hypothesis classes have low approximation error but high estimation error. The reverse holds for simple classes.
•ERM alone is insufficient: Pure empirical risk minimization fails with expressive hypothesis classes — needs additional structure.
•Regularization provides that structure: By adding a complexity penalty $\lambda\Omega(h)$, we prefer simpler hypotheses, controlling variance.
•Multiple equivalent views: Regularization can be understood as optimization constraint, Bayesian prior, or capacity control.
•Structural Risk Minimization: Provides a principled framework connecting regularization to generalization bounds.

What's Next

With the motivation established, we now examine regularization from specific perspectives in depth:

Page 2: Regularization as Constraint — The optimization perspective, constrained formulations, and geometric insights
Page 3: Regularization as Prior — The Bayesian interpretation, prior distributions, and MAP estimation
Page 4: Effects on Generalization — How regularization tightens generalization bounds
Page 5: Common Regularizers — Detailed analysis of L1, L2, Elastic Net, and modern regularizers

Each perspective deepens our understanding and suggests different practical approaches.

Page Complete

You now understand the fundamental motivation for regularization: the bias-variance and approximation-estimation tradeoffs, why ERM alone fails, and how regularization provides the additional structure necessary for generalization. This foundation supports the detailed analysis of specific regularization approaches in subsequent pages.

Regularization Motivation

The Curse of a Perfect Fit

What You Will Learn

Historical Context

The Overfitting Problem Revisited

We've encountered overfitting before, but now we approach it with the full machinery of statistical learning theory. Let us precisely define what overfitting means and why it occurs.

Formal Definition of Overfitting

Recall the central objects of learning theory:

True risk: $R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]$ — performance on the underlying distribution
Empirical risk: $\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$ — performance on training data

Definition (Overfitting): A model $h$ trained on $S$ is said to overfit if: $$\hat{R}_S(h) \ll R(h)$$

That is, the training error is much lower than the true error. The model has learned patterns in $S$ that do not generalize to the broader distribution $\mathcal{D}$.

Overfitting vs. Underfitting

The Generalization Gap

The generalization gap is the difference between true and empirical risk:

$$\text{Generalization Gap} = R(h_S) - \hat{R}_S(h_S)$$

where $h_S$ is the hypothesis selected by the learning algorithm on training set $S$.

A key insight from learning theory: the generalization gap depends on the complexity of the hypothesis class, not just the single hypothesis selected.

Even if $h_S$ appears simple, the fact that it was selected from a complex class $\mathcal{H}$ means the generalization gap can be large.

Why Does Overfitting Occur?

Overfitting arises from a fundamental asymmetry:

Training data is finite: We observe only $n$ samples from an infinite distribution.
Hypothesis classes can be vast: A complex $\mathcal{H}$ may contain hypotheses that fit $S$ perfectly by "memorizing" the specific samples.
Empirical risk minimization (ERM) has no foresight: ERM finds $\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$, treating all zero-training-error hypotheses as equally good.
Statistical fluctuations in $S$: The training set contains "noise" — random variations that don't reflect the true pattern.

A sufficiently expressive hypothesis class can fit this noise perfectly, mistaking random fluctuations for genuine patterns.

Signs of Overfitting
Indicator	Description	Example
Training ≪ Test Error	Large gap between training and validation performance	99% train accuracy, 70% test accuracy
High model complexity	Model has many more parameters than data points	Fitting 10 data points with degree-100 polynomial
Sensitivity to training data	Small changes in training set cause large changes in model	Adding one point drastically changes predictions
Extreme parameter values	Learned parameters are very large in magnitude	Polynomial coefficients in thousands
Memorization patterns	Model stores training examples rather than learning patterns	Near-perfect recall of training examples

The Bias-Variance Decomposition

Setup

Consider regression where the true relationship is: $$y = f(x) + \varepsilon, \quad \mathbb{E}[\varepsilon] = 0, \quad \text{Var}(\varepsilon) = \sigma^2$$

where $f(x) = \mathbb{E}[Y|X=x]$ is the Bayes optimal predictor.

Let $\hat{f}_S(x)$ denote the model learned from training set $S$. Since $S$ is random, $\hat{f}_S$ is a random variable over the space of functions.

The Decomposition

For squared loss, the expected prediction error at a point $x$ can be decomposed:

This can be written more concisely as:

$$\text{Expected Error} = \text{Noise} + \text{Bias}^2 + \text{Variance}$$

Interpreting the Three Terms

The Tradeoff

The profound insight is that bias and variance are typically in tension:

Model Complexity	Bias	Variance	Typical Behavior
Very simple (e.g., constant)	High	Low	Underfits: same prediction regardless of $x$
Moderate	Medium	Medium	Sweet spot: captures pattern, resists noise
Very complex (e.g., high-degree polynomial)	Low	High	Overfits: captures noise as if it were pattern

Why the tradeoff exists:

Simple models can't represent complex true functions (high bias), but their simplicity means different training sets yield similar models (low variance).
Complex models can represent almost any function (low bias), but they're sensitive to noise in each training set (high variance).

Bias-Variance and Generalization

The bias-variance decomposition explains the characteristic "U-shaped" curve of test error as a function of model complexity:

Low complexity: High bias dominates → high test error (underfitting)
Optimal complexity: Bias and variance balanced → minimum test error
High complexity: High variance dominates → high test error (overfitting)

Note that training error decreases monotonically with complexity (more complex models fit training data better), while test error exhibits the U-shape.

The gap between training and test error grows with complexity — this is exactly the generalization gap, and regularization controls it.

Beyond Squared Loss

Approximation vs. Estimation Error

Another lens on the same phenomenon decomposes the excess risk into approximation error and estimation error. This perspective is particularly useful in statistical learning theory.

The Decomposition

Let $h^* \in \mathcal{H}$ be the best hypothesis in our class: $$h^* = \arg\min_{h \in \mathcal{H}} R(h)$$

Let $h^{\text{Bayes}}$ be the Bayes optimal predictor (best among all measurable functions).

Let $\hat{h}_S$ be the hypothesis learned from training data $S$.

The excess risk of $\hat{h}_S$ over Bayes optimal decomposes as:

Approximation Error

•Error from restricting to H
•How far is the best-in-class from optimal?
•Decreases with larger/richer H
•Zero if H contains Bayes optimal
•Related to bias in bias-variance
•Controlled by model selection

Estimation Error

•Error from finite sample size
•How far is learned model from best-in-class?
•Increases with larger/richer H
•Zero with infinite training data
•Related to variance in bias-variance
•Controlled by regularization

The Fundamental Tension

This decomposition reveals the fundamental tension in learning:

Larger $\mathcal{H}$:

✅ Lower approximation error (better can approximate $h^{\text{Bayes}}$)
❌ Higher estimation error (harder to find the right $h$ from finite data)

Smaller $\mathcal{H}$:

❌ Higher approximation error (may not contain good approximations)
✅ Lower estimation error (easier to find the right $h$)

Regularization addresses this by effectively reducing the complexity of $\mathcal{H}$ — trading increased approximation error for reduced estimation error.

A Unifying View

Why Empirical Risk Minimization Alone Fails

Empirical Risk Minimization (ERM) is the natural approach: find the hypothesis that minimizes training error:

$$\hat{h}{\text{ERM}} = \arg\min{h \in \mathcal{H}} \hat{R}_S(h)$$

ERM is intuitive and computationally natural, but ERM alone can fail catastrophically when $\mathcal{H}$ is sufficiently expressive.

The Problem with Unrestricted ERM

Consider what happens when $\mathcal{H}$ is very rich (e.g., all measurable functions, or neural networks with unlimited capacity):

Zero training error is always achievable: There exist functions that exactly interpolate any training set.
Many such functions exist: Infinitely many hypotheses achieve zero training error.
ERM provides no guidance among them: All zero-training-error hypotheses are equally preferred by ERM.
Most such hypotheses generalize poorly: Random interpolations of noise have high true risk.

A Concrete Example: Polynomial Fitting

Consider fitting a polynomial to $n$ data points:

Data: $(x_1, y_1), \ldots, (x_n, y_n)$ with $n=10$ points

Hypothesis class: Polynomials of degree $d$

Degree $d$	Parameters	Training Error	Typical Test Error
1 (linear)	2	High	Moderate
3 (cubic)	4	Moderate	Low
9	10	Zero	Very high
50	51	Zero	Extremely high

With degree $\geq n-1$, we can exactly interpolate $n$ points. But these interpolating polynomials oscillate wildly between data points, producing absurd predictions.

The Interpolation Trap

Theoretical Perspective: VC Dimension

Recall that the VC dimension $d_{VC}(\mathcal{H})$ measures the complexity of a hypothesis class. Generalization bounds tell us:

$$R(h) \leq \hat{R}S(h) + O\left(\sqrt{\frac{d{VC}(\mathcal{H})}{n}}\right)$$

For ERM with large $d_{VC}$:

$\hat{R}S(\hat{h}{\text{ERM}}) \approx 0$ (achieved by overfitting)
But the bound term $\sqrt{d_{VC}/n}$ can be huge
The bound provides no useful guarantee

Regularization effectively reduces $d_{VC}$, tightening the generalization bound and ensuring low empirical risk actually implies low true risk.

The Need for Additional Structure

Since ERM alone is insufficient, we need additional mechanisms to control generalization:

Option 1: Restrict $\mathcal{H}$ explicitly

Use a smaller, less expressive hypothesis class
Risk: May have high approximation error

Option 2: Regularize the optimization

Keep $\mathcal{H}$ rich but add constraints/penalties
Effect: Implicitly restricts effective complexity

Option 3: Use more data

With $n \to \infty$, ERM becomes reliable
Reality: Data is always finite and often expensive

Regularization (Option 2) is the practical, principled solution. It allows using expressive hypothesis classes while controlling overfitting through mathematical constraints.

What Is Regularization?

Having established why we need additional structure beyond ERM, let us formally define regularization and its key properties.

Definition

Regularization is any modification to the learning algorithm that is intended to reduce generalization error but not training error.

More specifically, regularization replaces pure ERM: $$\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$$

with regularized ERM: $$\hat{h} = \arg\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \Omega(h) \right]$$

where:

$\Omega(h)$ is the regularizer (penalty function on hypothesis complexity)
$\lambda \geq 0$ is the regularization strength (hyperparameter)

The Core Insight

Properties of Good Regularizers

An effective regularizer $\Omega(h)$ should:

Penalize complexity: Higher values for hypotheses likely to overfit
Be computationally tractable: Add minimal overhead to optimization
Be meaningful for the problem: Capture relevant notions of simplicity
Have tunable strength: $\lambda$ controls the bias-variance tradeoff
Preserve key solutions: Still allow good approximations when $\lambda$ is small

Common Regularizers and Their Effects
Regularizer	Formula	Effect	When to Use
L2 (Ridge)	$\Omega(w) = \|w\|_2^2$	Shrinks all weights toward zero	Default choice, prevents large weights
L1 (Lasso)	$\Omega(w) = \|w\|_1$	Drives some weights to exactly zero	Feature selection, sparse models
Elastic Net	$\alpha\|w\|_1 + (1-\alpha)\|w\|_2^2$	Combines L1 and L2 effects	Correlated features, need sparsity + stability
Dropout	Random neuron masking	Ensemble averaging effect	Deep neural networks
Early Stopping	Stop before convergence	Limits effective model complexity	When validation error increases
Data Augmentation	Expand training set	Reduces variance, adds invariances	When domain knowledge suggests augmentations

The Regularization Strength λ

The hyperparameter $\lambda$ controls the tradeoff:

$\lambda = 0$: Pure ERM, no regularization

Minimizes training error exactly
Risk of overfitting with complex $\mathcal{H}$

$\lambda \to \infty$: Regularization dominates

Produces simplest hypothesis regardless of data
Risk of underfitting

Optimal $\lambda$: Balances fit and complexity

Typically found via cross-validation
Depends on data size, noise level, and $\mathcal{H}$

The existence of an optimal intermediate $\lambda$ reflects the bias-variance tradeoff: too little regularization → high variance; too much → high bias.

Multiple Views of Regularization

Regularization can be understood from multiple complementary perspectives. Each provides different insights and suggests different algorithms.

View 1: Optimization Constraint

Regularization as a constrained optimization problem:

$$\min_{h \in \mathcal{H}} \hat{R}_S(h) \quad \text{subject to} \quad \Omega(h) \leq C$$

This is equivalent (for appropriate $C$ and $\lambda$) to the penalized form:

$$\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \Omega(h) \right]$$

Interpretation: We seek the best fit among hypotheses with bounded complexity. The constraint $\Omega(h) \leq C$ defines a "capacity budget."

Practical implication: This view is useful for understanding geometric properties (constraint set shapes) and for algorithms like projected gradient descent.

View 2: Bayesian Prior

Regularization as maximum a posteriori (MAP) estimation with a prior:

$$\hat{h}{\text{MAP}} = \arg\max{h} \left[ P(S | h) \cdot P(h) \right]$$

Taking negative log and assuming Gaussian likelihood:

$$\hat{h}{\text{MAP}} = \arg\min{h} \left[ -\log P(S|h) - \log P(h) \right]$$

With appropriate priors:

L2 regularization ↔ Gaussian prior: $P(w) \propto \exp(-\lambda |w|_2^2)$
L1 regularization ↔ Laplace prior: $P(w) \propto \exp(-\lambda |w|_1)$

Interpretation: Regularization encodes prior beliefs about which hypotheses are more plausible. Simpler hypotheses have higher prior probability.

Practical implication: This view connects to Bayesian inference, uncertainty quantification, and hierarchical models.

View 3: Capacity Control

Regularization as controlling effective model capacity:

Even if $\mathcal{H}$ is infinite, regularization restricts the search to an effective hypothesis class $\mathcal{H}_{\text{eff}}(\lambda)$:

$$\mathcal{H}_{\text{eff}}(\lambda) = {h \in \mathcal{H} : \Omega(h) \leq C(\lambda)}$$

As $\lambda$ increases, $\mathcal{H}_{\text{eff}}$ shrinks, reducing effective VC dimension and improving generalization bounds.

Interpretation: We use a rich hypothesis class but explore only a "small" (low-complexity) part of it.

Practical implication: This justifies using neural networks with millions of parameters — regularization ensures we don't actually use all that capacity.

Three Views, One Phenomenon

Structural Risk Minimization

Structural Risk Minimization (SRM), introduced by Vapnik, provides a principled framework for regularization based directly on generalization bounds.

The SRM Principle

SRM is based on the bound: $$R(h) \leq \hat{R}_S(h) + \text{Complexity}(\mathcal{H}, n, \delta)$$

where the complexity term depends on the VC dimension or other complexity measures.

SRM approach:

Define a sequence of nested hypothesis classes: $\mathcal{H}_1 \subset \mathcal{H}_2 \subset \cdots$
Each $\mathcal{H}_k$ has increasing complexity (e.g., VC dimension $d_k$)
For each class, find the ERM solution: $\hat{h}k = \arg\min{h \in \mathcal{H}_k} \hat{R}_S(h)$
Select the class that minimizes the bound: $\hat{k} = \arg\min_k \left[\hat{R}_S(\hat{h}_k) + \text{Complexity}(\mathcal{H}_k)\right]$

Connection to Regularization

SRM and regularization are deeply connected:

Regularization implements SRM implicitly:

Varying $\lambda$ traces out different effective hypothesis classes
Each $\lambda$ corresponds to a different complexity level
Selecting optimal $\lambda$ (e.g., via cross-validation) implements SRM

The SRM objective: $$\min_k \left[\hat{R}_S(\hat{h}_k) + \text{Complexity}(k)\right]$$

The regularization objective: $$\min_h \left[\hat{R}_S(h) + \lambda \Omega(h)\right]$$

With appropriate correspondence between $\lambda$ and the hierarchy index $k$, these are equivalent approaches to the same goal: balancing data fit with model complexity.

SRM in Practice

Examples of SRM Hierarchy

Polynomial regression:

$\mathcal{H}_k$ = polynomials of degree $\leq k$
Complexity increases with $k$
Select degree that balances fit and complexity

Neural networks:

$\mathcal{H}_k$ = networks with $\leq k$ hidden units
Or: networks with weight norm $\leq k$
Regularization controls effective capacity

Decision trees:

$\mathcal{H}_k$ = trees of depth $\leq k$
Pruning implements SRM by reducing tree complexity

Summary: Why Regularization Is Essential

We have established the fundamental motivation for regularization from multiple theoretical perspectives. Let us consolidate the key insights:

Key Takeaways

•Overfitting arises from finite samples: Complex models can fit training noise perfectly, mistaking random fluctuations for patterns.
•Bias-variance tradeoff: Model complexity trading off systematic error (bias) against sensitivity to training data (variance). Optimal complexity balances these.
•Approximation-estimation tradeoff: Rich hypothesis classes have low approximation error but high estimation error. The reverse holds for simple classes.
•ERM alone is insufficient: Pure empirical risk minimization fails with expressive hypothesis classes — needs additional structure.
•Regularization provides that structure: By adding a complexity penalty $\lambda\Omega(h)$, we prefer simpler hypotheses, controlling variance.
•Multiple equivalent views: Regularization can be understood as optimization constraint, Bayesian prior, or capacity control.
•Structural Risk Minimization: Provides a principled framework connecting regularization to generalization bounds.

What's Next

With the motivation established, we now examine regularization from specific perspectives in depth:

Page 2: Regularization as Constraint — The optimization perspective, constrained formulations, and geometric insights
Page 3: Regularization as Prior — The Bayesian interpretation, prior distributions, and MAP estimation
Page 4: Effects on Generalization — How regularization tightens generalization bounds
Page 5: Common Regularizers — Detailed analysis of L1, L2, Elastic Net, and modern regularizers

Each perspective deepens our understanding and suggests different practical approaches.

Page Complete