Machine LearningStatistical Learning Theory

Regularization Theory

LevelAdvanced

Duration90 mins

TopicStatistical Learning Theory

4 / 5

Effects on Generalization

From Intuition to Guarantees

We have seen that regularization controls model complexity, acting as a constraint on parameters or encoding prior beliefs about simplicity. But how does this provably improve generalization? How do we go from intuitive notions of "simpler models generalize better" to rigorous mathematical guarantees?

This page connects regularization to the core of statistical learning theory—generalization bounds. We will see how regularization reduces effective capacity, tightens the generalization gap, and provides sample complexity improvements that enable learning with limited data.

What You Will Learn

By the end of this page, you will understand: (1) how regularization reduces effective model complexity, (2) the relationship between regularization and generalization bounds, (3) stability-based arguments for generalization, (4) how regularization affects sample complexity, (5) the role of margin theory in classification, and (6) practical implications for model selection.

Effective Capacity and Complexity Control

The hypothesis class $\mathcal{H}$ defines the set of all possible models. Regularization doesn't change $\mathcal{H}$, but it restricts the effective hypothesis class — the models that can actually be reached through optimization.

Effective Hypothesis Class

With regularization strength $\lambda$, define the effective hypothesis class:

$$\mathcal{H}_{\text{eff}}(\lambda) = {h_w : \Omega(w) \leq C(\lambda)}$$

where $C(\lambda)$ is the constraint budget corresponding to regularization strength $\lambda$.

Key insight: As $\lambda$ increases:

$C(\lambda)$ decreases
$\mathcal{H}_{\text{eff}}(\lambda)$ shrinks
Effective complexity decreases

Complexity Measures

Several measures quantify hypothesis class complexity:

1. VC Dimension: Max number of points that can be shattered $$d_{\text{VC}}(\mathcal{H}{\text{eff}}) \leq d{\text{VC}}(\mathcal{H})$$

2. Rademacher Complexity: Expected correlation with random noise $$\mathfrak{R}n(\mathcal{H}{\text{eff}}) \leq \mathfrak{R}_n(\mathcal{H})$$

3. Covering Numbers: How many balls needed to cover the class $$\mathcal{N}(\epsilon, \mathcal{H}_{\text{eff}}) \leq \mathcal{N}(\epsilon, \mathcal{H})$$

Regularization reduces all these complexity measures, tightening generalization bounds.

Effect of Regularization on Complexity Measures
Measure	Without Regularization	With Regularization
VC Dimension	May be infinite	Bounded by constraint
Rademacher	$\mathfrak{R}_n(\mathcal{H})$	$\mathfrak{R}n(\mathcal{H}{\text{eff}}) \lesssim \sqrt{C/n}$
Covering Number	Grows with $\mathcal{H}$	Controlled by $\Omega(w) \leq C$
Fat-Shattering	Depends on margin	Explicit margin control

Effective Degrees of Freedom

In linear regression with L2 regularization, the effective degrees of freedom provides a precise measure: $\text{df}(\lambda) = \sum_i \frac{d_i^2}{d_i^2 + \lambda}$ where $d_i$ are singular values of $X$. As λ → 0, df → p (full model). As λ → ∞, df → 0 (null model). This interpolates between extreme complexities.

Generalization Bounds with Regularization

Generalization bounds relate empirical risk to true risk, and regularization provably tightens these bounds.

Basic Generalization Bound

For a hypothesis class $\mathcal{H}$ with finite VC dimension $d$, with probability at least $1 - \delta$:

$$R(h) \leq \hat{R}_S(h) + O\left(\sqrt{\frac{d \log(n/d) + \log(1/\delta)}{n}}\right)$$

The gap between true and empirical risk — the generalization gap — scales as $O(\sqrt{d/n})$.

Regularization Tightens the Bound

With regularization, we work with $\mathcal{H}_{\text{eff}}$ instead of $\mathcal{H}$:

$$R(h) \leq \hat{R}S(h) + O\left(\sqrt{\frac{d{\text{eff}} \log(n/d_{\text{eff}}) + \log(1/\delta)}{n}}\right)$$

where $d_{\text{eff}} \leq d$ is the effective complexity.

The tradeoff:

Without regularization: $\hat{R}_S(h)$ may be 0, but gap is large
With regularization: $\hat{R}_S(h)$ may be positive, but gap is smaller

Optimal regularization minimizes the bound, not just $\hat{R}_S(h)$.

Rademacher Complexity Bounds

For bounded hypothesis classes, Rademacher complexity provides tight bounds.

Definition: The empirical Rademacher complexity is: $$\hat{\mathfrak{R}}S(\mathcal{H}) = \mathbb{E}\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i)\right]$$

where $\sigma_i \in {-1, +1}$ are Rademacher random variables.

Generalization bound: $$R(h) \leq \hat{R}_S(h) + 2\mathfrak{R}_n(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2n}}$$

With L2 regularization (bounded $|w|_2 \leq B$): $$\mathfrak{R}_n(\mathcal{H}B) \leq \frac{BR{\max}}{\sqrt{n}}$$

where $R_{\max}$ bounds $|x|_2$. The bound improves with smaller $B$ (stronger regularization).

Regularization Controls Rademacher Complexity

The key theoretical insight: regularization directly controls the Rademacher complexity of the hypothesis class. For linear classifiers with constraint $|w|_2 \leq B$, we have $\mathfrak{R}_n \leq O(B/\sqrt{n})$. Smaller B (stronger regularization) → smaller Rademacher complexity → tighter generalization bound.

Stability and Generalization

An alternative route to understanding generalization is through algorithmic stability: algorithms that don't change much when training data changes tend to generalize well.

Uniform Stability

Definition: An algorithm $\mathcal{A}$ is $\beta$-uniformly stable if for all training sets $S$ differing in one point:

$$\sup_{z} |\ell(\mathcal{A}(S), z) - \ell(\mathcal{A}(S'), z)| \leq \beta$$

where $S$ and $S'$ differ in exactly one training example.

Intuition: Replacing one training point changes predictions by at most $\beta$.

Generalization theorem: If $\mathcal{A}$ is $\beta$-stable, then: $$|R(\mathcal{A}(S)) - \hat{R}_S(\mathcal{A}(S))| \leq O(\beta)$$

Stable algorithms have small generalization gaps.

Regularization Induces Stability

Theorem: For strongly convex loss with $\lambda$-regularization, the ERM solution satisfies: $$\beta = O\left(\frac{1}{\lambda n}\right)$$

Implication: Stronger regularization (larger $\lambda$) → smaller $\beta$ → more stable → better generalization.

For Ridge regression: $$|\hat{w}S - \hat{w}{S'}|_2 \leq \frac{C}{\lambda n}$$

The regularized solution changes by at most $O(1/(\lambda n))$ when one point is replaced.

Key insight: Regularization makes the algorithm insensitive to individual training points. This is exactly what prevents overfitting — the model can't "memorize" specific points if removing them doesn't change predictions much.

Unstable (Unregularized)

•Output changes drastically with single point
•Can fit noisy training points exactly
•High variance in predictions
•Large generalization gap
•Memorization possible

Stable (Regularized)

•Output robust to single point changes
•Smoothing effect over training set
•Low variance in predictions
•Small generalization gap
•Memorization prevented

Stability vs. Capacity Control

Stability and capacity control are two different lenses on the same phenomenon. Capacity control says regularization restricts the hypothesis class. Stability says regularization makes the algorithm less sensitive to data. Both lead to the same conclusion: better generalization. The stability view is particularly powerful for analyzing algorithms beyond ERM.

Margin Theory for Classification

For classification, regularization has a particularly elegant interpretation through margin theory: it encourages large-margin solutions that provably generalize better.

Margin Definition

For binary classification with linear classifier $h_w(x) = \text{sign}(w^\top x)$, the margin of a point $(x_i, y_i)$ is:

$$\gamma_i = y_i \cdot (w^\top x_i)$$

Positive margin means correct classification; larger margin means more confident.

The geometric margin normalizes by $|w|_2$: $$\tilde{\gamma}_i = \frac{y_i \cdot (w^\top x_i)}{|w|_2}$$

This is the Euclidean distance from $x_i$ to the decision boundary.

Margin-Based Generalization Bounds

Theorem (Margin Bound): For linear classifiers with margin $\gamma$ on all training points:

$$R(h) \leq \hat{R}_S(h) + O\left(\frac{R_X^2}{\gamma^2 n}\right)^{1/2}$$

where $R_X = \max_i |x_i|_2$.

Key insight: Larger margin $\gamma$ → tighter bound → better generalization.

The bound doesn't depend on the number of features $d$ — it depends on the margin! This is why SVMs can work in very high or even infinite dimensions.

Regularization Maximizes Margin

Consider the hard-margin SVM problem: $$\max_w \gamma \quad \text{s.t.} \quad y_i(w^\top x_i) \geq \gamma, \quad |w|_2 = 1$$

This is equivalent to: $$\min_w |w|_2^2 \quad \text{s.t.} \quad y_i(w^\top x_i) \geq 1$$

This is L2 regularization! Minimizing $|w|_2^2$ subject to correct classification is the same as maximizing margin.

For soft-margin SVM: $$\min_w \frac{1}{2}|w|_2^2 + C\sum_i \xi_i$$

The regularization term $|w|_2^2$ encourages large margins; the slack variables $\xi_i$ allow margin violations for noisy data. $C$ controls the tradeoff.

The SVM Insight

Support Vector Machines crystallize the regularization-margin connection. The SVM objective is precisely L2-regularized hinge loss. The regularization term encourages large margins, which are provably associated with good generalization. This theoretical foundation made SVMs the dominant classifier for years before deep learning.

Sample Complexity Improvements

Sample complexity is the number of training examples needed to achieve a desired accuracy. Regularization directly improves sample complexity.

Without Regularization

For hypothesis class $\mathcal{H}$ with VC dimension $d$, to achieve error $\epsilon$ with probability $1 - \delta$:

$$n = O\left(\frac{d \log(1/\epsilon) + \log(1/\delta)}{\epsilon^2}\right)$$

Sample complexity scales with VC dimension. For high-dimensional models ($d$ large), this can be prohibitive.

With Regularization

With constraint $|w|_2 \leq B$, the effective sample complexity becomes:

$$n = O\left(\frac{B^2 R_X^2 + \log(1/\delta)}{\epsilon^2}\right)$$

Key difference: Depends on $B^2$ (weight bound), not $d$ (dimension)!

Implications:

Even with $d \gg n$, learning is possible if $B$ is appropriately constrained
The "curse of dimensionality" can be avoided with proper regularization
Explains why regularized models work in high-dimensional settings

Sample Complexity: Regularized vs. Unregularized
Setting	Unregularized	Regularized ($\|w\|_2 \leq B$)
Dependence on $d$	$O(d/\epsilon^2)$	$O(1)$ (no $d$ dependence!)
High-dimensional ($d > n$)	Impossible	Possible with small $B$
Practical regime	Need $n \gg d$	Need $n \gtrsim B^2 R_X^2 / \epsilon^2$
Controlling factor	VC dimension	Weight constraint

Learning in High Dimensions

This is why regularized linear models work even when the number of features exceeds the number of samples (p > n setting). The effective sample complexity depends on the regularization strength, not the dimension. Strong regularization compensates for limited data by restricting the hypothesis class.

The Bias-Complexity Tradeoff in Sample Complexity

Sample complexity depends on regularization through a tradeoff:

Strong regularization (small $B$):

Low sample complexity (can learn from little data)
High approximation error (may not fit true function)

Weak regularization (large $B$):

High sample complexity (need more data)
Low approximation error (can fit complex functions)

Optimal regularization: Choose $B$ (or $\lambda$) to minimize total error given available sample size.

Regularization and the Bias-Variance Tradeoff

The bias-variance decomposition provides direct insight into how regularization affects generalization.

Formal Statement

For squared loss, the expected prediction error decomposes as:

$$\mathbb{E}[(Y - \hat{f}(X))^2] = \sigma^2 + \text{Bias}^2 + \text{Variance}$$

For regularized estimators:

Bias (increases with $\lambda$): $$\text{Bias} = f(x) - \mathbb{E}[\hat{f}_\lambda(x)]$$

Variance (decreases with $\lambda$): $$\text{Variance} = \mathbb{E}[(\hat{f}\lambda(x) - \mathbb{E}[\hat{f}\lambda(x)])^2]$$

Ridge Regression: Explicit Analysis

For Ridge regression with design matrix $X$:

$$\hat{w}_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$$

Let $X = U D V^\top$ be the SVD with singular values $d_1, \ldots, d_p$.

Bias term: $$|\mathbb{E}[\hat{w}_\lambda] - w^|^2 = \sum_j \left(\frac{\lambda}{d_j^2 + \lambda}\right)^2 (v_j^\top w^)^2$$

Variance term: $$\text{Var}(\hat{w}_\lambda) = \sigma^2 \sum_j \frac{d_j^2}{(d_j^2 + \lambda)^2}$$

As $\lambda$ Varies

$\lambda$	Bias	Variance	Total Error
0 (OLS)	0	High (esp. if $d_j$ small)	High variance
Small	Small	Reduced	Improved
Optimal	Medium	Medium	Minimized
Large	High	Low	High bias
∞	Maximum	0	Only bias

Key insight: There exists an optimal $\lambda^*$ that minimizes total error. This is the oracle regularization — choosing it is the goal of cross-validation and other model selection procedures.

Ill-Conditioned Data

Regularization is particularly important when the design matrix is ill-conditioned (some singular values near zero). Without regularization, the variance term explodes because (1/d_j)² can be huge. Regularization adds λ to the denominator, stabilizing the solution. This is why Ridge regression is also called "regularized least squares."

Double Descent and Modern Perspectives

Recent discoveries have challenged the classical bias-variance picture, revealing richer phenomena in overparameterized models.

The Classical Picture

Classical learning theory predicts:

Test error increases as model complexity exceeds optimal
Overparameterized models ($d > n$) should fail
U-shaped test error curve

The Double Descent Phenomenon

Recent observations show a double descent curve:

Underparameterized regime ($d < n$): Classical U-shape, optimal at intermediate $d$
Interpolation threshold ($d \approx n$): Test error spikes (worst point)
Overparameterized regime ($d > n$): Test error decreases again!

At the interpolation threshold, the model exactly fits training data but with extremely large weights. Beyond it, there are many interpolating solutions, and the algorithm (implicit regularization) selects good ones.

Implicit Regularization

Even without explicit regularization, some algorithms have implicit regularization effects:

Gradient descent from small initialization:

Converges to minimum-norm solution
Acts like L2 regularization

Early stopping:

Limits effective complexity
Number of iterations acts like inverse regularization

Stochastic gradient descent:

Noise provides implicit regularization
Flatter minima are more stable

Neural networks:

Architecture and initialization induce biases
Width and depth affect implicit complexity

Reconciling Classical and Modern Views

Classical learning theory isn't wrong — it's incomplete. The classical bounds apply to worst-case or uniform analysis. Modern deep learning operates in regimes where the specific algorithm, initialization, and structure induce additional regularization beyond what explicit penalties provide. Understanding both classical bounds and modern phenomena is essential.

Explicit Regularization in the Overparameterized Regime

Even in overparameterized settings, explicit regularization helps:

Benefits:

Smoother double descent curve (less dramatic peak)
More robust to noise at interpolation threshold
Faster optimization convergence
Better properties for downstream tasks (transfer learning)

Practical guidance:

Regularization remains valuable even for large models
The optimal $\lambda$ may be smaller than classical theory suggests
Implicit and explicit regularization interact — tune accordingly

Summary: Effects on Generalization

We have established rigorous connections between regularization and generalization. Let us consolidate the key insights:

Key Takeaways

•Effective capacity: Regularization reduces the effective hypothesis class complexity, even without changing the architecture.
•Tighter generalization bounds: Rademacher complexity and VC bounds improve with regularization, proven via $\mathfrak{R}_n \lesssim B/\sqrt{n}$.
•Algorithmic stability: Regularization induces stability $\beta = O(1/(\lambda n))$, ensuring small changes in training data cause small changes in output.
•Margin theory: L2 regularization maximizes classification margins, with generalization bounds depending on margin, not dimension.
•Sample complexity: Regularization enables learning in high dimensions ($p > n$) by making complexity depend on weight bounds, not feature count.
•Bias-variance tradeoff: Regularization increases bias but decreases variance, with optimal $\lambda$ minimizing total error.
•Modern perspectives: Double descent and implicit regularization extend classical understanding to overparameterized regimes.

What's Next

With the theoretical foundations established, the final page provides a comprehensive tour of Common Regularizers — the specific penalty functions used in practice. We'll analyze L1 (Lasso), L2 (Ridge), Elastic Net, and modern techniques like dropout and batch normalization, understanding their properties and when to use each.

Page Complete

You now understand how regularization provably improves generalization through multiple theoretical lenses: capacity control, algorithmic stability, margin theory, and sample complexity. These insights connect the intuitive "simpler models generalize better" to rigorous mathematical guarantees.

4 / 5

Loading learning content...

Machine LearningStatistical Learning Theory

Regularization Theory

LevelAdvanced

Duration90 mins

TopicStatistical Learning Theory

4 / 5

Effects on Generalization

From Intuition to Guarantees

What You Will Learn

Effective Capacity and Complexity Control

Effective Hypothesis Class

With regularization strength $\lambda$, define the effective hypothesis class:

$$\mathcal{H}_{\text{eff}}(\lambda) = {h_w : \Omega(w) \leq C(\lambda)}$$

where $C(\lambda)$ is the constraint budget corresponding to regularization strength $\lambda$.

Key insight: As $\lambda$ increases:

$C(\lambda)$ decreases
$\mathcal{H}_{\text{eff}}(\lambda)$ shrinks
Effective complexity decreases

Complexity Measures

Several measures quantify hypothesis class complexity:

1. VC Dimension: Max number of points that can be shattered $$d_{\text{VC}}(\mathcal{H}{\text{eff}}) \leq d{\text{VC}}(\mathcal{H})$$

2. Rademacher Complexity: Expected correlation with random noise $$\mathfrak{R}n(\mathcal{H}{\text{eff}}) \leq \mathfrak{R}_n(\mathcal{H})$$

3. Covering Numbers: How many balls needed to cover the class $$\mathcal{N}(\epsilon, \mathcal{H}_{\text{eff}}) \leq \mathcal{N}(\epsilon, \mathcal{H})$$

Regularization reduces all these complexity measures, tightening generalization bounds.

Effect of Regularization on Complexity Measures
Measure	Without Regularization	With Regularization
VC Dimension	May be infinite	Bounded by constraint
Rademacher	$\mathfrak{R}_n(\mathcal{H})$	$\mathfrak{R}n(\mathcal{H}{\text{eff}}) \lesssim \sqrt{C/n}$
Covering Number	Grows with $\mathcal{H}$	Controlled by $\Omega(w) \leq C$
Fat-Shattering	Depends on margin	Explicit margin control

Effective Degrees of Freedom

Generalization Bounds with Regularization

Generalization bounds relate empirical risk to true risk, and regularization provably tightens these bounds.

Basic Generalization Bound

For a hypothesis class $\mathcal{H}$ with finite VC dimension $d$, with probability at least $1 - \delta$:

$$R(h) \leq \hat{R}_S(h) + O\left(\sqrt{\frac{d \log(n/d) + \log(1/\delta)}{n}}\right)$$

The gap between true and empirical risk — the generalization gap — scales as $O(\sqrt{d/n})$.

Regularization Tightens the Bound

With regularization, we work with $\mathcal{H}_{\text{eff}}$ instead of $\mathcal{H}$:

$$R(h) \leq \hat{R}S(h) + O\left(\sqrt{\frac{d{\text{eff}} \log(n/d_{\text{eff}}) + \log(1/\delta)}{n}}\right)$$

where $d_{\text{eff}} \leq d$ is the effective complexity.

The tradeoff:

Without regularization: $\hat{R}_S(h)$ may be 0, but gap is large
With regularization: $\hat{R}_S(h)$ may be positive, but gap is smaller

Optimal regularization minimizes the bound, not just $\hat{R}_S(h)$.

Rademacher Complexity Bounds

For bounded hypothesis classes, Rademacher complexity provides tight bounds.

Definition: The empirical Rademacher complexity is: $$\hat{\mathfrak{R}}S(\mathcal{H}) = \mathbb{E}\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i)\right]$$

where $\sigma_i \in {-1, +1}$ are Rademacher random variables.

Generalization bound: $$R(h) \leq \hat{R}_S(h) + 2\mathfrak{R}_n(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2n}}$$

With L2 regularization (bounded $|w|_2 \leq B$): $$\mathfrak{R}_n(\mathcal{H}B) \leq \frac{BR{\max}}{\sqrt{n}}$$

where $R_{\max}$ bounds $|x|_2$. The bound improves with smaller $B$ (stronger regularization).

Regularization Controls Rademacher Complexity

Stability and Generalization

An alternative route to understanding generalization is through algorithmic stability: algorithms that don't change much when training data changes tend to generalize well.

Uniform Stability

Definition: An algorithm $\mathcal{A}$ is $\beta$-uniformly stable if for all training sets $S$ differing in one point:

$$\sup_{z} |\ell(\mathcal{A}(S), z) - \ell(\mathcal{A}(S'), z)| \leq \beta$$

where $S$ and $S'$ differ in exactly one training example.

Intuition: Replacing one training point changes predictions by at most $\beta$.

Generalization theorem: If $\mathcal{A}$ is $\beta$-stable, then: $$|R(\mathcal{A}(S)) - \hat{R}_S(\mathcal{A}(S))| \leq O(\beta)$$

Stable algorithms have small generalization gaps.

Regularization Induces Stability

Theorem: For strongly convex loss with $\lambda$-regularization, the ERM solution satisfies: $$\beta = O\left(\frac{1}{\lambda n}\right)$$

Implication: Stronger regularization (larger $\lambda$) → smaller $\beta$ → more stable → better generalization.

For Ridge regression: $$|\hat{w}S - \hat{w}{S'}|_2 \leq \frac{C}{\lambda n}$$

The regularized solution changes by at most $O(1/(\lambda n))$ when one point is replaced.

Unstable (Unregularized)

•Output changes drastically with single point
•Can fit noisy training points exactly
•High variance in predictions
•Large generalization gap
•Memorization possible

Stable (Regularized)

•Output robust to single point changes
•Smoothing effect over training set
•Low variance in predictions
•Small generalization gap
•Memorization prevented

Stability vs. Capacity Control

Margin Theory for Classification

For classification, regularization has a particularly elegant interpretation through margin theory: it encourages large-margin solutions that provably generalize better.

Margin Definition

For binary classification with linear classifier $h_w(x) = \text{sign}(w^\top x)$, the margin of a point $(x_i, y_i)$ is:

$$\gamma_i = y_i \cdot (w^\top x_i)$$

Positive margin means correct classification; larger margin means more confident.

The geometric margin normalizes by $|w|_2$: $$\tilde{\gamma}_i = \frac{y_i \cdot (w^\top x_i)}{|w|_2}$$

This is the Euclidean distance from $x_i$ to the decision boundary.

Margin-Based Generalization Bounds

Theorem (Margin Bound): For linear classifiers with margin $\gamma$ on all training points:

$$R(h) \leq \hat{R}_S(h) + O\left(\frac{R_X^2}{\gamma^2 n}\right)^{1/2}$$

where $R_X = \max_i |x_i|_2$.

Key insight: Larger margin $\gamma$ → tighter bound → better generalization.

The bound doesn't depend on the number of features $d$ — it depends on the margin! This is why SVMs can work in very high or even infinite dimensions.

Regularization Maximizes Margin

Consider the hard-margin SVM problem: $$\max_w \gamma \quad \text{s.t.} \quad y_i(w^\top x_i) \geq \gamma, \quad |w|_2 = 1$$

This is equivalent to: $$\min_w |w|_2^2 \quad \text{s.t.} \quad y_i(w^\top x_i) \geq 1$$

This is L2 regularization! Minimizing $|w|_2^2$ subject to correct classification is the same as maximizing margin.

For soft-margin SVM: $$\min_w \frac{1}{2}|w|_2^2 + C\sum_i \xi_i$$

The regularization term $|w|_2^2$ encourages large margins; the slack variables $\xi_i$ allow margin violations for noisy data. $C$ controls the tradeoff.

The SVM Insight

Sample Complexity Improvements

Sample complexity is the number of training examples needed to achieve a desired accuracy. Regularization directly improves sample complexity.

Without Regularization

For hypothesis class $\mathcal{H}$ with VC dimension $d$, to achieve error $\epsilon$ with probability $1 - \delta$:

$$n = O\left(\frac{d \log(1/\epsilon) + \log(1/\delta)}{\epsilon^2}\right)$$

Sample complexity scales with VC dimension. For high-dimensional models ($d$ large), this can be prohibitive.

With Regularization

With constraint $|w|_2 \leq B$, the effective sample complexity becomes:

$$n = O\left(\frac{B^2 R_X^2 + \log(1/\delta)}{\epsilon^2}\right)$$

Key difference: Depends on $B^2$ (weight bound), not $d$ (dimension)!

Implications:

Even with $d \gg n$, learning is possible if $B$ is appropriately constrained
The "curse of dimensionality" can be avoided with proper regularization
Explains why regularized models work in high-dimensional settings

Sample Complexity: Regularized vs. Unregularized
Setting	Unregularized	Regularized ($\|w\|_2 \leq B$)
Dependence on $d$	$O(d/\epsilon^2)$	$O(1)$ (no $d$ dependence!)
High-dimensional ($d > n$)	Impossible	Possible with small $B$
Practical regime	Need $n \gg d$	Need $n \gtrsim B^2 R_X^2 / \epsilon^2$
Controlling factor	VC dimension	Weight constraint

Learning in High Dimensions

The Bias-Complexity Tradeoff in Sample Complexity

Sample complexity depends on regularization through a tradeoff:

Strong regularization (small $B$):

Low sample complexity (can learn from little data)
High approximation error (may not fit true function)

Weak regularization (large $B$):

High sample complexity (need more data)
Low approximation error (can fit complex functions)

Optimal regularization: Choose $B$ (or $\lambda$) to minimize total error given available sample size.

Regularization and the Bias-Variance Tradeoff

The bias-variance decomposition provides direct insight into how regularization affects generalization.

Formal Statement

For squared loss, the expected prediction error decomposes as:

$$\mathbb{E}[(Y - \hat{f}(X))^2] = \sigma^2 + \text{Bias}^2 + \text{Variance}$$

For regularized estimators:

Bias (increases with $\lambda$): $$\text{Bias} = f(x) - \mathbb{E}[\hat{f}_\lambda(x)]$$

Variance (decreases with $\lambda$): $$\text{Variance} = \mathbb{E}[(\hat{f}\lambda(x) - \mathbb{E}[\hat{f}\lambda(x)])^2]$$

Ridge Regression: Explicit Analysis

For Ridge regression with design matrix $X$:

$$\hat{w}_\lambda = (X^\top X + \lambda I)^{-1} X^\top y$$

Let $X = U D V^\top$ be the SVD with singular values $d_1, \ldots, d_p$.

Bias term: $$|\mathbb{E}[\hat{w}_\lambda] - w^|^2 = \sum_j \left(\frac{\lambda}{d_j^2 + \lambda}\right)^2 (v_j^\top w^)^2$$

Variance term: $$\text{Var}(\hat{w}_\lambda) = \sigma^2 \sum_j \frac{d_j^2}{(d_j^2 + \lambda)^2}$$

As $\lambda$ Varies

$\lambda$	Bias	Variance	Total Error
0 (OLS)	0	High (esp. if $d_j$ small)	High variance
Small	Small	Reduced	Improved
Optimal	Medium	Medium	Minimized
Large	High	Low	High bias
∞	Maximum	0	Only bias

Ill-Conditioned Data

Double Descent and Modern Perspectives

Recent discoveries have challenged the classical bias-variance picture, revealing richer phenomena in overparameterized models.

The Classical Picture

Classical learning theory predicts:

Test error increases as model complexity exceeds optimal
Overparameterized models ($d > n$) should fail
U-shaped test error curve

The Double Descent Phenomenon

Recent observations show a double descent curve:

Underparameterized regime ($d < n$): Classical U-shape, optimal at intermediate $d$
Interpolation threshold ($d \approx n$): Test error spikes (worst point)
Overparameterized regime ($d > n$): Test error decreases again!

Implicit Regularization

Even without explicit regularization, some algorithms have implicit regularization effects:

Gradient descent from small initialization:

Converges to minimum-norm solution
Acts like L2 regularization

Early stopping:

Limits effective complexity
Number of iterations acts like inverse regularization

Stochastic gradient descent:

Noise provides implicit regularization
Flatter minima are more stable

Neural networks:

Architecture and initialization induce biases
Width and depth affect implicit complexity

Reconciling Classical and Modern Views

Explicit Regularization in the Overparameterized Regime

Even in overparameterized settings, explicit regularization helps:

Benefits:

Smoother double descent curve (less dramatic peak)
More robust to noise at interpolation threshold
Faster optimization convergence
Better properties for downstream tasks (transfer learning)

Practical guidance:

Regularization remains valuable even for large models
The optimal $\lambda$ may be smaller than classical theory suggests
Implicit and explicit regularization interact — tune accordingly

Summary: Effects on Generalization

We have established rigorous connections between regularization and generalization. Let us consolidate the key insights:

Key Takeaways

•Effective capacity: Regularization reduces the effective hypothesis class complexity, even without changing the architecture.
•Tighter generalization bounds: Rademacher complexity and VC bounds improve with regularization, proven via $\mathfrak{R}_n \lesssim B/\sqrt{n}$.
•Algorithmic stability: Regularization induces stability $\beta = O(1/(\lambda n))$, ensuring small changes in training data cause small changes in output.
•Margin theory: L2 regularization maximizes classification margins, with generalization bounds depending on margin, not dimension.
•Sample complexity: Regularization enables learning in high dimensions ($p > n$) by making complexity depend on weight bounds, not feature count.
•Bias-variance tradeoff: Regularization increases bias but decreases variance, with optimal $\lambda$ minimizing total error.
•Modern perspectives: Double descent and implicit regularization extend classical understanding to overparameterized regimes.

What's Next

Page Complete

4 / 5