Loading content...
In machine learning, success is measured not by how well you fit the training data, but by how well your model performs on unseen data. This distinction—seemingly obvious—leads to one of the most profound insights in statistical learning theory: a model that fits training data too perfectly will often generalize poorly.
This paradox, that more can be worse, underlies the entire theory of regularization. In this page, we explore why this happens, how regularization addresses it, and why regularization is not merely a practical trick but a fundamental necessity derived from the mathematics of learning.
By the end of this page, you will understand: (1) why overfitting occurs from a statistical perspective, (2) how hypothesis class complexity relates to generalization, (3) the mathematical tension between approximation and estimation error, (4) why regularization is a principled response to this tension, and (5) the multiple perspectives from which regularization can be understood.
The theoretical foundations of regularization trace back to Tikhonov's work on ill-posed problems in the 1960s, Vapnik and Chervonenkis's theory of generalization, and the bias-variance tradeoff formalized by Geman et al. (1992). Modern regularization theory synthesizes these ideas into a unified framework essential for understanding all modern machine learning.
We've encountered overfitting before, but now we approach it with the full machinery of statistical learning theory. Let us precisely define what overfitting means and why it occurs.
Recall the central objects of learning theory:
Definition (Overfitting): A model $h$ trained on $S$ is said to overfit if: $$\hat{R}_S(h) \ll R(h)$$
That is, the training error is much lower than the true error. The model has learned patterns in $S$ that do not generalize to the broader distribution $\mathcal{D}$.
Overfitting (low training error, high test error) occurs when models are too complex. Underfitting (high training error, high test error) occurs when models are too simple. The goal of regularization is to navigate between these extremes, finding the optimal complexity level.
The generalization gap is the difference between true and empirical risk:
$$\text{Generalization Gap} = R(h_S) - \hat{R}_S(h_S)$$
where $h_S$ is the hypothesis selected by the learning algorithm on training set $S$.
A key insight from learning theory: the generalization gap depends on the complexity of the hypothesis class, not just the single hypothesis selected.
Even if $h_S$ appears simple, the fact that it was selected from a complex class $\mathcal{H}$ means the generalization gap can be large.
Overfitting arises from a fundamental asymmetry:
Training data is finite: We observe only $n$ samples from an infinite distribution.
Hypothesis classes can be vast: A complex $\mathcal{H}$ may contain hypotheses that fit $S$ perfectly by "memorizing" the specific samples.
Empirical risk minimization (ERM) has no foresight: ERM finds $\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$, treating all zero-training-error hypotheses as equally good.
Statistical fluctuations in $S$: The training set contains "noise" — random variations that don't reflect the true pattern.
A sufficiently expressive hypothesis class can fit this noise perfectly, mistaking random fluctuations for genuine patterns.
| Indicator | Description | Example |
|---|---|---|
| Training ≪ Test Error | Large gap between training and validation performance | 99% train accuracy, 70% test accuracy |
| High model complexity | Model has many more parameters than data points | Fitting 10 data points with degree-100 polynomial |
| Sensitivity to training data | Small changes in training set cause large changes in model | Adding one point drastically changes predictions |
| Extreme parameter values | Learned parameters are very large in magnitude | Polynomial coefficients in thousands |
| Memorization patterns | Model stores training examples rather than learning patterns | Near-perfect recall of training examples |
The bias-variance decomposition provides a precise mathematical characterization of why overfitting occurs and how regularization helps. This decomposition applies to regression with squared loss, but the intuitions extend broadly.
Consider regression where the true relationship is: $$y = f(x) + \varepsilon, \quad \mathbb{E}[\varepsilon] = 0, \quad \text{Var}(\varepsilon) = \sigma^2$$
where $f(x) = \mathbb{E}[Y|X=x]$ is the Bayes optimal predictor.
Let $\hat{f}_S(x)$ denote the model learned from training set $S$. Since $S$ is random, $\hat{f}_S$ is a random variable over the space of functions.
For squared loss, the expected prediction error at a point $x$ can be decomposed:
$$\mathbb{E}_S\left[(y - \hat{f}S(x))^2\right] = \underbrace{\sigma^2}{\text{Irreducible Error}} + \underbrace{\left(f(x) - \mathbb{E}_S[\hat{f}S(x)]\right)^2}{\text{Bias}^2} + \underbrace{\mathbb{E}_S\left[\left(\hat{f}_S(x) - \mathbb{E}_S[\hat{f}S(x)]\right)^2\right]}{\text{Variance}}$$
This can be written more concisely as:
$$\text{Expected Error} = \text{Noise} + \text{Bias}^2 + \text{Variance}$$
Noise (σ²): Irreducible error from inherent randomness in the data. No algorithm can eliminate this. Bias²: Error from the difference between the average prediction and the true function. High bias means the model is systematically wrong. Variance: Error from sensitivity to the particular training set. High variance means the model changes dramatically with different samples.
The profound insight is that bias and variance are typically in tension:
| Model Complexity | Bias | Variance | Typical Behavior |
|---|---|---|---|
| Very simple (e.g., constant) | High | Low | Underfits: same prediction regardless of $x$ |
| Moderate | Medium | Medium | Sweet spot: captures pattern, resists noise |
| Very complex (e.g., high-degree polynomial) | Low | High | Overfits: captures noise as if it were pattern |
Why the tradeoff exists:
Simple models can't represent complex true functions (high bias), but their simplicity means different training sets yield similar models (low variance).
Complex models can represent almost any function (low bias), but they're sensitive to noise in each training set (high variance).
The bias-variance decomposition explains the characteristic "U-shaped" curve of test error as a function of model complexity:
Low complexity: High bias dominates → high test error (underfitting)
Optimal complexity: Bias and variance balanced → minimum test error
High complexity: High variance dominates → high test error (overfitting)
Note that training error decreases monotonically with complexity (more complex models fit training data better), while test error exhibits the U-shape.
The gap between training and test error grows with complexity — this is exactly the generalization gap, and regularization controls it.
The exact bias-variance decomposition holds for squared loss. For other losses (0-1 loss, cross-entropy), the decomposition takes different forms, but the fundamental insight remains: model complexity creates a tradeoff between underfitting (too simple) and overfitting (too complex).
Another lens on the same phenomenon decomposes the excess risk into approximation error and estimation error. This perspective is particularly useful in statistical learning theory.
Let $h^* \in \mathcal{H}$ be the best hypothesis in our class: $$h^* = \arg\min_{h \in \mathcal{H}} R(h)$$
Let $h^{\text{Bayes}}$ be the Bayes optimal predictor (best among all measurable functions).
Let $\hat{h}_S$ be the hypothesis learned from training data $S$.
The excess risk of $\hat{h}_S$ over Bayes optimal decomposes as:
$$R(\hat{h}S) - R(h^{\text{Bayes}}) = \underbrace{\left(R(h^*) - R(h^{\text{Bayes}})\right)}{\text{Approximation Error}} + \underbrace{\left(R(\hat{h}S) - R(h^*)\right)}{\text{Estimation Error}}$$
This decomposition reveals the fundamental tension in learning:
Larger $\mathcal{H}$:
Smaller $\mathcal{H}$:
This is the bias-complexity tradeoff restated: rich hypothesis classes have low approximation error but high estimation error; restricted classes have high approximation error but low estimation error.
Regularization addresses this by effectively reducing the complexity of $\mathcal{H}$ — trading increased approximation error for reduced estimation error.
Both the bias-variance and approximation-estimation decompositions describe the same phenomenon from different angles. Bias ≈ approximation error (model class too restricted), Variance ≈ estimation error (insufficient data for the complexity). Regularization controls the variance/estimation component by constraining the effective hypothesis class.
Empirical Risk Minimization (ERM) is the natural approach: find the hypothesis that minimizes training error:
$$\hat{h}{\text{ERM}} = \arg\min{h \in \mathcal{H}} \hat{R}_S(h)$$
ERM is intuitive and computationally natural, but ERM alone can fail catastrophically when $\mathcal{H}$ is sufficiently expressive.
Consider what happens when $\mathcal{H}$ is very rich (e.g., all measurable functions, or neural networks with unlimited capacity):
Zero training error is always achievable: There exist functions that exactly interpolate any training set.
Many such functions exist: Infinitely many hypotheses achieve zero training error.
ERM provides no guidance among them: All zero-training-error hypotheses are equally preferred by ERM.
Most such hypotheses generalize poorly: Random interpolations of noise have high true risk.
Consider fitting a polynomial to $n$ data points:
Data: $(x_1, y_1), \ldots, (x_n, y_n)$ with $n=10$ points
Hypothesis class: Polynomials of degree $d$
| Degree $d$ | Parameters | Training Error | Typical Test Error |
|---|---|---|---|
| 1 (linear) | 2 | High | Moderate |
| 3 (cubic) | 4 | Moderate | Low |
| 9 | 10 | Zero | Very high |
| 50 | 51 | Zero | Extremely high |
With degree $\geq n-1$, we can exactly interpolate $n$ points. But these interpolating polynomials oscillate wildly between data points, producing absurd predictions.
When a model can perfectly fit the training data, ERM loses its discriminative power. All interpolating solutions have equal empirical risk (zero), so ERM cannot distinguish between genuinely good hypotheses and those that merely memorize noise. This is why additional structure—regularization—is essential.
Recall that the VC dimension $d_{VC}(\mathcal{H})$ measures the complexity of a hypothesis class. Generalization bounds tell us:
$$R(h) \leq \hat{R}S(h) + O\left(\sqrt{\frac{d{VC}(\mathcal{H})}{n}}\right)$$
For ERM with large $d_{VC}$:
Regularization effectively reduces $d_{VC}$, tightening the generalization bound and ensuring low empirical risk actually implies low true risk.
Since ERM alone is insufficient, we need additional mechanisms to control generalization:
Option 1: Restrict $\mathcal{H}$ explicitly
Option 2: Regularize the optimization
Option 3: Use more data
Regularization (Option 2) is the practical, principled solution. It allows using expressive hypothesis classes while controlling overfitting through mathematical constraints.
Having established why we need additional structure beyond ERM, let us formally define regularization and its key properties.
Regularization is any modification to the learning algorithm that is intended to reduce generalization error but not training error.
More specifically, regularization replaces pure ERM: $$\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$$
with regularized ERM: $$\hat{h} = \arg\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \Omega(h) \right]$$
where:
Regularization introduces a preference among hypotheses beyond their training error. A hypothesis that fits training data slightly worse but is "simpler" (lower Ω(h)) may be preferred over one that fits perfectly but is complex. This preference toward simplicity embodies Occam's Razor mathematically.
An effective regularizer $\Omega(h)$ should:
Penalize complexity: Higher values for hypotheses likely to overfit
Be computationally tractable: Add minimal overhead to optimization
Be meaningful for the problem: Capture relevant notions of simplicity
Have tunable strength: $\lambda$ controls the bias-variance tradeoff
Preserve key solutions: Still allow good approximations when $\lambda$ is small
| Regularizer | Formula | Effect | When to Use |
|---|---|---|---|
| L2 (Ridge) | $\Omega(w) = |w|_2^2$ | Shrinks all weights toward zero | Default choice, prevents large weights |
| L1 (Lasso) | $\Omega(w) = |w|_1$ | Drives some weights to exactly zero | Feature selection, sparse models |
| Elastic Net | $\alpha|w|_1 + (1-\alpha)|w|_2^2$ | Combines L1 and L2 effects | Correlated features, need sparsity + stability |
| Dropout | Random neuron masking | Ensemble averaging effect | Deep neural networks |
| Early Stopping | Stop before convergence | Limits effective model complexity | When validation error increases |
| Data Augmentation | Expand training set | Reduces variance, adds invariances | When domain knowledge suggests augmentations |
The hyperparameter $\lambda$ controls the tradeoff:
$\lambda = 0$: Pure ERM, no regularization
$\lambda \to \infty$: Regularization dominates
Optimal $\lambda$: Balances fit and complexity
The existence of an optimal intermediate $\lambda$ reflects the bias-variance tradeoff: too little regularization → high variance; too much → high bias.
Regularization can be understood from multiple complementary perspectives. Each provides different insights and suggests different algorithms.
Regularization as a constrained optimization problem:
$$\min_{h \in \mathcal{H}} \hat{R}_S(h) \quad \text{subject to} \quad \Omega(h) \leq C$$
This is equivalent (for appropriate $C$ and $\lambda$) to the penalized form:
$$\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \Omega(h) \right]$$
Interpretation: We seek the best fit among hypotheses with bounded complexity. The constraint $\Omega(h) \leq C$ defines a "capacity budget."
Practical implication: This view is useful for understanding geometric properties (constraint set shapes) and for algorithms like projected gradient descent.
Regularization as maximum a posteriori (MAP) estimation with a prior:
$$\hat{h}{\text{MAP}} = \arg\max{h} \left[ P(S | h) \cdot P(h) \right]$$
Taking negative log and assuming Gaussian likelihood:
$$\hat{h}{\text{MAP}} = \arg\min{h} \left[ -\log P(S|h) - \log P(h) \right]$$
With appropriate priors:
Interpretation: Regularization encodes prior beliefs about which hypotheses are more plausible. Simpler hypotheses have higher prior probability.
Practical implication: This view connects to Bayesian inference, uncertainty quantification, and hierarchical models.
Regularization as controlling effective model capacity:
Even if $\mathcal{H}$ is infinite, regularization restricts the search to an effective hypothesis class $\mathcal{H}_{\text{eff}}(\lambda)$:
$$\mathcal{H}_{\text{eff}}(\lambda) = {h \in \mathcal{H} : \Omega(h) \leq C(\lambda)}$$
As $\lambda$ increases, $\mathcal{H}_{\text{eff}}$ shrinks, reducing effective VC dimension and improving generalization bounds.
Interpretation: We use a rich hypothesis class but explore only a "small" (low-complexity) part of it.
Practical implication: This justifies using neural networks with millions of parameters — regularization ensures we don't actually use all that capacity.
These three views are mathematically equivalent under appropriate conditions, but they suggest different extensions: the constraint view inspires optimization algorithms, the Bayesian view suggests uncertainty quantification, and the capacity view connects to generalization theory. A complete understanding leverages all three perspectives.
Structural Risk Minimization (SRM), introduced by Vapnik, provides a principled framework for regularization based directly on generalization bounds.
SRM is based on the bound: $$R(h) \leq \hat{R}_S(h) + \text{Complexity}(\mathcal{H}, n, \delta)$$
where the complexity term depends on the VC dimension or other complexity measures.
SRM approach:
SRM and regularization are deeply connected:
Regularization implements SRM implicitly:
The SRM objective: $$\min_k \left[\hat{R}_S(\hat{h}_k) + \text{Complexity}(k)\right]$$
The regularization objective: $$\min_h \left[\hat{R}_S(h) + \lambda \Omega(h)\right]$$
With appropriate correspondence between $\lambda$ and the hierarchy index $k$, these are equivalent approaches to the same goal: balancing data fit with model complexity.
While the theoretical SRM framework uses generalization bounds to select complexity, in practice we typically use cross-validation to select regularization strength. Cross-validation empirically estimates the bound without computing it analytically, and often performs better because actual generalization bounds can be loose.
Polynomial regression:
Neural networks:
Decision trees:
We have established the fundamental motivation for regularization from multiple theoretical perspectives. Let us consolidate the key insights:
With the motivation established, we now examine regularization from specific perspectives in depth:
Each perspective deepens our understanding and suggests different practical approaches.
You now understand the fundamental motivation for regularization: the bias-variance and approximation-estimation tradeoffs, why ERM alone fails, and how regularization provides the additional structure necessary for generalization. This foundation supports the detailed analysis of specific regularization approaches in subsequent pages.