Machine LearningStatistical Learning Theory

The Learning Problem

LevelAdvanced

Duration90 mins

TopicStatistical Learning Theory

3 / 5

True Risk vs Empirical Risk

The Central Question of Generalization

We have established that machine learning seeks to minimize true risk $R(h)$—the expected loss on new data—but can only directly minimize empirical risk $\hat{R}_S(h)$—the average loss on training data.

This raises the central question of statistical learning:

When does low training error imply low test error?

Or more precisely: What is the relationship between $R(h)$ and $\hat{R}_S(h)$, and how can we bound the difference?

This question is not merely theoretical. Every time you train a model and evaluate on a test set, you are witnessing the empirical manifestation of this relationship. Understanding it deeply enables you to predict when models will generalize, diagnose failures, and design better learning systems.

What You Will Learn

By the end of this page, you will understand concentration inequalities that bound the deviation between true and empirical risk, the decomposition of generalization error into approximation and estimation components, the sources of error in learning and how each can be addressed, and why simply looking at training error is never sufficient for assessing model quality.

Empirical Risk as a Statistical Estimate

The Monte Carlo Perspective

For any fixed hypothesis $h$, the empirical risk is a sample mean:

$$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$

Each term $\ell(h(x_i), y_i)$ is an i.i.d. random variable (since $(x_i, y_i) \sim \mathcal{D}$ independently). The empirical risk is therefore a Monte Carlo estimate of the expectation:

$$\mathbb{E}[\hat{R}S(h)] = \mathbb{E}{(x,y) \sim \mathcal{D}}[\ell(h(x), y)] = R(h)$$

The empirical risk is an unbiased estimator of the true risk.

Variance of the Estimate

Being unbiased doesn't mean being accurate. The variance of $\hat{R}_S(h)$ is:

$$\text{Var}(\hat{R}_S(h)) = \frac{1}{n} \text{Var}(\ell(h(x), y)) = \frac{\sigma_h^2}{n}$$

where $\sigma_h^2 = \text{Var}_{(x,y) \sim \mathcal{D}}(\ell(h(x), y))$ depends on the hypothesis and distribution.

For bounded loss $\ell \in [0, 1]$, we have $\sigma_h^2 \leq 1/4$ (maximum variance for a bounded random variable).

The key observation: variance decreases as $1/n$. More data means a more precise estimate of the true risk.

For the standard deviation: $$\text{SD}(\hat{R}_S(h)) = \frac{\sigma_h}{\sqrt{n}} = O\left(\frac{1}{\sqrt{n}}\right)$$

The Square Root Law

The $1/\sqrt{n}$ convergence rate is a fundamental fact of statistics. Halving the estimation error requires quadrupling the sample size. This slow rate explains why big data is valuable: going from 10K to 1M samples reduces error by a factor of 10 (since $\sqrt{100} = 10$).

Asymptotic Convergence

By the Law of Large Numbers, as $n \to \infty$: $$\hat{R}_S(h) \xrightarrow{a.s.} R(h)$$

The empirical risk converges almost surely to the true risk. Combined with the Central Limit Theorem, we get:

$$\sqrt{n}(\hat{R}_S(h) - R(h)) \xrightarrow{d} \mathcal{N}(0, \sigma_h^2)$$

For large $n$, the deviation between empirical and true risk is approximately Gaussian with standard deviation $\sigma_h / \sqrt{n}$.

These classical results are comforting but not sufficient for learning theory. They describe the behavior of a fixed hypothesis, but learning algorithms select hypotheses based on data. We need bounds that account for this selection.

Concentration Inequalities: Finite-Sample Bounds

Asymptotic results tell us what happens as $n \to \infty$, but we always have finite data. Concentration inequalities provide finite-sample bounds on how far $\hat{R}_S(h)$ can deviate from $R(h)$.

Hoeffding's Inequality

For bounded i.i.d. random variables $Z_1, \ldots, Z_n \in [a, b]$ with mean $\mu$:

$$\mathbb{P}\left[\left|\frac{1}{n}\sum_{i=1}^n Z_i - \mu\right| > \epsilon\right] \leq 2\exp\left(-\frac{2n\epsilon^2}{(b-a)^2}\right)$$

Application to Learning: If $\ell \in [0, 1]$, then for any fixed hypothesis $h$:

$$\mathbb{P}[|\hat{R}_S(h) - R(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}$$

This is exponentially decreasing in both $n$ and $\epsilon^2$.

Key Concentration Inequalities
Inequality	Conditions	Bound on $\mathbb{P}[\|\bar{Z} - \mu\| > \epsilon]$	Key Feature
Markov	$Z \geq 0$	$\mathbb{E}[Z]/(\mu + \epsilon)$	Weakest, most general
Chebyshev	Finite variance	$\sigma^2 / (n\epsilon^2)$	Uses variance info
Hoeffding	Bounded $[a,b]$	$2e^{-2n\epsilon^2/(b-a)^2}$	Exponential, suboptimal constants
Bernstein	Bounded + variance	$2e^{-n\epsilon^2/(2\sigma^2 + 2(b-a)\epsilon/3)}$	Adapts to variance
McDiarmid	Bounded differences	$2e^{-2n\epsilon^2/\sum c_i^2}$	Functions of independent RVs

Inverting the Bound

The concentration inequality can be inverted to get a confidence interval. Setting $\delta = 2e^{-2n\epsilon^2}$ and solving for $\epsilon$:

$$\epsilon = \sqrt{\frac{\ln(2/\delta)}{2n}}$$

Result: With probability at least $1 - \delta$: $$|\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(2/\delta)}{2n}}$$

For example, with $n = 10000$ samples and $\delta = 0.05$: $$|\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(40)}{20000}} \approx 0.014$$

With 95% confidence, the empirical risk is within 1.4% of the true risk for any fixed hypothesis.

Fixed vs. Selected Hypothesis

This bound applies to a pre-specified hypothesis h, chosen before seeing the data. The hypothesis selected by ERM is data-dependent—it's the one that happened to minimize training error among all candidates. Applying the fixed-hypothesis bound to h_ERM is invalid and leads to overconfident estimates of generalization.

Uniform Convergence: Bounding All Hypotheses Simultaneously

The Need for Uniform Bounds

To analyze ERM, we need bounds that hold uniformly over all hypotheses in $\mathcal{H}$. Specifically, we want to bound:

$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)|$$

This supremum measures the worst-case deviation between empirical and true risk across the entire hypothesis class.

Union Bound for Finite Classes

For finite $\mathcal{H}$ with $|\mathcal{H}| = M$, we apply the union bound:

$$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |\hat{R}S(h) - R(h)| > \epsilon\right] \leq \sum{h \in \mathcal{H}} \mathbb{P}[|\hat{R}_S(h) - R(h)| > \epsilon]$$ $$\leq M \cdot 2e^{-2n\epsilon^2}$$

Setting $\delta = 2M e^{-2n\epsilon^2}$ and inverting:

Uniform Convergence Bound (Finite Class)

For finite hypothesis class $\mathcal{H}$ with bounded loss $\ell \in [0,1]$, with probability at least $1 - \delta$:

$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

Equivalently: for all $h \in \mathcal{H}$ simultaneously, the empirical risk is within $O(\sqrt{\log|\mathcal{H}|/n})$ of the true risk.

Generalization Bound for ERM

Once uniform convergence holds, ERM's generalization follows immediately. Let:

$h^* = \arg\min_{h \in \mathcal{H}} R(h)$ — the best hypothesis in class
$h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$ — the ERM hypothesis

With probability at least $1 - \delta$:

$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$ $$\leq \hat{R}_S(h^) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$ (since ERM minimizes training error) $$\leq R(h^) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$ (applying uniform convergence to $h^*$)

The excess risk of ERM over the best in class is at most $O(\sqrt{\log|\mathcal{H}|/n})$.

Beyond Finite Classes

For infinite hypothesis classes, the union bound fails ($|\mathcal{H}| = \infty$). We need more sophisticated measures of complexity:

VC Dimension: For classification, VC dimension $d$ implies uniform convergence with: $$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| = O\left(\sqrt{\frac{d \log(n/d)}{n}}\right)$$

Rademacher Complexity $\mathfrak{R}_n(\mathcal{H})$: For general classes and losses: $$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \leq 2\mathfrak{R}_n(\mathcal{H}) + O\left(\sqrt{\frac{\log(1/\delta)}{n}}\right)$$

These complexity measures effectively "count" how many distinct behaviors $\mathcal{H}$ can exhibit on finite samples, even when $\mathcal{H}$ is infinite. We will explore them in subsequent modules.

Error Decomposition: Approximation vs. Estimation

The Two Sources of Error

The true risk of a learned hypothesis can be decomposed into interpretable components. Let:

$f^*$ — the Bayes optimal predictor (minimizes risk over all functions)
$h^*$ — the best hypothesis in $\mathcal{H}$ (minimizes risk within the class)
$h_S$ — the hypothesis learned by our algorithm from data $S$

The excess risk of $h_S$ over Bayes optimal admits the decomposition:

$$\underbrace{R(h_S) - R(f^)}_{\text{Excess Risk}} = \underbrace{[R(h^) - R(f^)]}_{\text{Approximation Error}} + \underbrace{[R(h_S) - R(h^)]}_{\text{Estimation Error}}$$

Approximation Error

•Definition: $R(h^) - R(f^)$
•Source: Restriction to hypothesis class $\mathcal{H}$
•Nature: Deterministic (no randomness)
•Depends on: How well $\mathcal{H}$ can approximate $f^*$
•Decreases when: $\mathcal{H}$ gets richer
•Also known as: Bias, model misspecification

Estimation Error

•Definition: $R(h_S) - R(h^*)$
•Source: Learning from finite samples
•Nature: Random (depends on sample $S$)
•Depends on: Sample size $n$, complexity of $\mathcal{H}$
•Decreases when: $n$ increases, $\mathcal{H}$ gets simpler
•Also known as: Variance, generalization gap

The Fundamental Tradeoff

Approximation error and estimation error are in tension. A richer hypothesis class can approximate f* better (low approximation error) but is harder to learn from finite data (high estimation error). A simpler class is easier to learn but may not contain good approximations. Finding the right balance is the art and science of model selection.

Visual Intuition

Imagine the space of all possible functions, with Bayes optimal $f^*$ at the center (lowest risk).

Small $\mathcal{H}$: A small circle. The closest point in the circle to $f^$ (which is $h^$) may be far from the center. Large approximation error. But finding $h^*$ using finite data is easy—there isn't much to search.
Large $\mathcal{H}$: A huge circle. $h^$ is very close to or coincides with $f^$. Small approximation error. But identifying $h^*$ from finite data is hard—there are many hypotheses, and the empirical best may differ from the true best.

The optimal $\mathcal{H}$ is large enough that $h^$ is near $f^$, but small enough that we can reliably find $h^*$ from data. This optimal size typically grows with $n$—with more data, we can afford a richer class.

Bounding Estimation Error

Uniform Convergence Implies Estimation Error Bounds

From uniform convergence, we derived that with high probability: $$R(h_{\text{ERM}}) \leq R(h^*) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

This directly bounds the estimation error: $$\text{Estimation Error} = R(h_{\text{ERM}}) - R(h^*) \leq 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

The estimation error:

Decreases as $O(1/\sqrt{n})$ with sample size
Increases with $O(\sqrt{\log|\mathcal{H}|})$ as the hypothesis class grows

General Form of Estimation Error Bounds

For various complexity measures, the estimation error takes the form:

Complexity Measure	Estimation Error Bound
Finite class size $	\mathcal{H}
VC dimension $d$	$O\left(\sqrt{\frac{d \log(n/d)}{n}}\right)$
Rademacher complexity $\mathfrak{R}_n$	$O(\mathfrak{R}_n(\mathcal{H}))$
Covering number $\mathcal{N}(\epsilon, \mathcal{H})$	$O\left(\sqrt{\frac{\log \mathcal{N}(\epsilon, \mathcal{H})}{n}}\right)$

In all cases, the estimation error decreases with more data and increases with hypothesis class complexity. This universal pattern reflects the core truth: more complex models are harder to learn accurately.

High-Probability vs. Expected Bounds

We've focused on high-probability bounds: with probability $\geq 1 - \delta$, the error is at most...

Alternatively, we can bound expected error: $$\mathbb{E}_S[R(h_S) - R(h^*)]$$

These are related: if a bound holds with probability $1-\delta$ except for excess error $\epsilon(\delta)$, then: $$\mathbb{E}[\text{error}] \leq \epsilon(\delta) + \delta \cdot \text{(worst case)}$$

With bounded loss $\ell \in [0,1]$ and optimizing $\delta$, expected bounds are often of the same order as high-probability bounds (sometimes with better constants).

Practical Implications

These bounds justify common practices: (1) Use more data when possible (reduces estimation error). (2) Use simpler models when data is scarce (controls estimation error). (3) Use complex models only when data is abundant (can afford the estimation error). (4) Regularization effectively shrinks the hypothesis class, trading off approximation for estimation.

The Generalization Gap

Defining the Generalization Gap

The generalization gap (or generalization error) is the difference between test and training performance:

$$\text{Generalization Gap} = R(h_S) - \hat{R}_S(h_S)$$

For ERM, where $h_S = h_{\text{ERM}}$, this gap measures how much better the model looks on training data compared to fresh test data.

Key insight: If ERM achieves zero training error ($\hat{R}S(h{\text{ERM}}) = 0$), then the generalization gap equals the true risk. Low training error means nothing if the gap is large.

Uniform Bound on Generalization Gap

From uniform convergence: $$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_S(h)| \leq \epsilon\right] \geq 1 - \delta$$

This directly bounds the generalization gap for any hypothesis (including the one selected by ERM): $$|R(h_{\text{ERM}}) - \hat{R}S(h{\text{ERM}})| \leq \epsilon$$

with the same $\epsilon = O(\sqrt{\log|\mathcal{H}|/n})$.

Why the Gap Can Be Large

The generalization gap is typically positive for ERM: $R(h_{\text{ERM}}) > \hat{R}S(h{\text{ERM}})$. Why?

ERM selects the hypothesis that happens to minimize training error. Among all hypotheses in $\mathcal{H}$, we pick the one that was "luckiest" on the training data. This selection bias means training error systematically underestimates test error.

Analogy: Imagine measuring the heights of 100 people and reporting the tallest. Your single measurement overestimates the average height—you've selected the extreme. Similarly, ERM selects the hypothesis with the most favorable training error, which tends to overestimate true quality.

Data Snooping and Adaptive Analysis

The generalization gap worsens with data snooping—using test data to inform modeling choices. If you try 100 models and select the best on a 'test' set, that set is no longer a valid test set; it has become part of training. The more you adapt to a dataset, the larger the gap between apparent and true performance. This is why proper train/validation/test splits and cross-validation are essential.

Optimism of Training Error

The systematic underestimation of test error by training error is called optimism:

$$\text{Optimism} = \mathbb{E}_{S}[R(h_S) - \hat{R}_S(h_S)] \geq 0$$

For linear regression with squared loss, there's an elegant formula for optimism: $$\text{Optimism} \approx \frac{2p\sigma^2}{n}$$

where $p$ is the number of parameters and $\sigma^2$ is the noise variance. This motivates the Akaike Information Criterion (AIC) and Mallow's $C_p$ for model selection: add $2p/n$ to training error to estimate test error.

In general, more complex models (larger $p$ or hypothesis class) have higher optimism.

Implications for Practice

Never Trust Training Error Alone

The theory makes clear: training error is systematically optimistic. A model with 99% training accuracy could have 60% test accuracy if the generalization gap is large.

This is why machine learning practice emphasizes held-out evaluation:

Validation set: Used for model selection and hyperparameter tuning
Test set: Used only for final evaluation, never for training or selection
Cross-validation: Efficiently uses all data for both training and validation

Best Practices from Theory

•Collect more data: Reduces estimation error as $O(1/\sqrt{n})$
•Control complexity: Simpler models have smaller generalization gap
•Regularize: Explicitly or implicitly constrain the hypothesis class
•Use validation properly: Never peek at test data during model development
•Account for multiple testing: Trying many models inflates the gap
•Report confidence intervals: Theory provides probabilistic bounds

Model Selection: Balancing the Tradeoff

Choosing the right model complexity requires balancing approximation and estimation error:

Too simple:

High approximation error (underfitting)
Low estimation error (small generalization gap)
Both training and test error are high

Too complex:

Low approximation error (can fit the data)
High estimation error (overfitting)
Training error is low, test error is high

Just right:

Moderate approximation error
Moderate estimation error
Minimum total error (test error)

The optimal complexity increases with sample size. With more data, we can afford more complex models.

Modern Deep Learning Paradox

Modern deep learning seems to violate classical learning theory: models with millions of parameters (very high complexity) generalize well even when they can perfectly fit training data. This 'double descent' phenomenon and the role of implicit regularization are active research areas. The classical theory isn't wrong—it's incomplete for explaining these phenomena.

Summary: True Risk vs. Empirical Risk

We have deeply explored the relationship between what we can measure (training error) and what we care about (test error). Let us consolidate the key insights:

Key Takeaways

•Empirical Risk is an Estimate: $\hat{R}_S(h)$ is an unbiased, consistent estimator of $R(h)$ for fixed $h$, with variance $O(1/n)$
•Concentration Inequalities: Hoeffding's inequality bounds the deviation $|\hat{R}_S(h) - R(h)|$ with exponential probability
•Uniform Convergence: For ERM, we need bounds that hold simultaneously for all $h \in \mathcal{H}$, costing a factor of $\log|\mathcal{H}|$
•Error Decomposition: Total error = Approximation Error + Estimation Error; these are in tension
•Approximation Error: Due to restricting to $\mathcal{H}$; decreases with richer classes
•Estimation Error: Due to finite samples; decreases with more data, increases with complexity
•Generalization Gap: The difference $R(h_S) - \hat{R}_S(h_S)$; typically positive, larger for complex models
•Practical Implication: Never trust training error alone; use held-out validation

What's Next

We've established the fundamental relationship between training and test error. The next page introduces the Generalization Gap in greater depth, exploring how this gap varies across different learning scenarios and what conditions guarantee small gaps.

This will prepare us for the No Free Lunch theorem, which establishes fundamental limits on learning, and the bias-variance tradeoff, which formalizes the tension between approximation and estimation error.

Page Complete

You now understand the precise mathematical relationship between true and empirical risk. You can derive concentration bounds, recognize the sources of error in learning, and appreciate why the generalization gap is central to machine learning theory. This foundation enables rigorous analysis of any learning algorithm.

3 / 5

Loading learning content...

Machine LearningStatistical Learning Theory

The Learning Problem

LevelAdvanced

Duration90 mins

TopicStatistical Learning Theory

3 / 5

True Risk vs Empirical Risk

The Central Question of Generalization

This raises the central question of statistical learning:

When does low training error imply low test error?

Or more precisely: What is the relationship between $R(h)$ and $\hat{R}_S(h)$, and how can we bound the difference?

What You Will Learn

Empirical Risk as a Statistical Estimate

The Monte Carlo Perspective

For any fixed hypothesis $h$, the empirical risk is a sample mean:

$$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$

Each term $\ell(h(x_i), y_i)$ is an i.i.d. random variable (since $(x_i, y_i) \sim \mathcal{D}$ independently). The empirical risk is therefore a Monte Carlo estimate of the expectation:

$$\mathbb{E}[\hat{R}S(h)] = \mathbb{E}{(x,y) \sim \mathcal{D}}[\ell(h(x), y)] = R(h)$$

The empirical risk is an unbiased estimator of the true risk.

Variance of the Estimate

Being unbiased doesn't mean being accurate. The variance of $\hat{R}_S(h)$ is:

$$\text{Var}(\hat{R}_S(h)) = \frac{1}{n} \text{Var}(\ell(h(x), y)) = \frac{\sigma_h^2}{n}$$

where $\sigma_h^2 = \text{Var}_{(x,y) \sim \mathcal{D}}(\ell(h(x), y))$ depends on the hypothesis and distribution.

For bounded loss $\ell \in [0, 1]$, we have $\sigma_h^2 \leq 1/4$ (maximum variance for a bounded random variable).

The key observation: variance decreases as $1/n$. More data means a more precise estimate of the true risk.

For the standard deviation: $$\text{SD}(\hat{R}_S(h)) = \frac{\sigma_h}{\sqrt{n}} = O\left(\frac{1}{\sqrt{n}}\right)$$

The Square Root Law

Asymptotic Convergence

By the Law of Large Numbers, as $n \to \infty$: $$\hat{R}_S(h) \xrightarrow{a.s.} R(h)$$

The empirical risk converges almost surely to the true risk. Combined with the Central Limit Theorem, we get:

$$\sqrt{n}(\hat{R}_S(h) - R(h)) \xrightarrow{d} \mathcal{N}(0, \sigma_h^2)$$

For large $n$, the deviation between empirical and true risk is approximately Gaussian with standard deviation $\sigma_h / \sqrt{n}$.

Concentration Inequalities: Finite-Sample Bounds

Hoeffding's Inequality

For bounded i.i.d. random variables $Z_1, \ldots, Z_n \in [a, b]$ with mean $\mu$:

$$\mathbb{P}\left[\left|\frac{1}{n}\sum_{i=1}^n Z_i - \mu\right| > \epsilon\right] \leq 2\exp\left(-\frac{2n\epsilon^2}{(b-a)^2}\right)$$

Application to Learning: If $\ell \in [0, 1]$, then for any fixed hypothesis $h$:

$$\mathbb{P}[|\hat{R}_S(h) - R(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}$$

This is exponentially decreasing in both $n$ and $\epsilon^2$.

Key Concentration Inequalities
Inequality	Conditions	Bound on $\mathbb{P}[\|\bar{Z} - \mu\| > \epsilon]$	Key Feature
Markov	$Z \geq 0$	$\mathbb{E}[Z]/(\mu + \epsilon)$	Weakest, most general
Chebyshev	Finite variance	$\sigma^2 / (n\epsilon^2)$	Uses variance info
Hoeffding	Bounded $[a,b]$	$2e^{-2n\epsilon^2/(b-a)^2}$	Exponential, suboptimal constants
Bernstein	Bounded + variance	$2e^{-n\epsilon^2/(2\sigma^2 + 2(b-a)\epsilon/3)}$	Adapts to variance
McDiarmid	Bounded differences	$2e^{-2n\epsilon^2/\sum c_i^2}$	Functions of independent RVs

Inverting the Bound

The concentration inequality can be inverted to get a confidence interval. Setting $\delta = 2e^{-2n\epsilon^2}$ and solving for $\epsilon$:

$$\epsilon = \sqrt{\frac{\ln(2/\delta)}{2n}}$$

Result: With probability at least $1 - \delta$: $$|\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(2/\delta)}{2n}}$$

For example, with $n = 10000$ samples and $\delta = 0.05$: $$|\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(40)}{20000}} \approx 0.014$$

With 95% confidence, the empirical risk is within 1.4% of the true risk for any fixed hypothesis.

Fixed vs. Selected Hypothesis

Uniform Convergence: Bounding All Hypotheses Simultaneously

The Need for Uniform Bounds

To analyze ERM, we need bounds that hold uniformly over all hypotheses in $\mathcal{H}$. Specifically, we want to bound:

$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)|$$

This supremum measures the worst-case deviation between empirical and true risk across the entire hypothesis class.

Union Bound for Finite Classes

For finite $\mathcal{H}$ with $|\mathcal{H}| = M$, we apply the union bound:

$$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |\hat{R}S(h) - R(h)| > \epsilon\right] \leq \sum{h \in \mathcal{H}} \mathbb{P}[|\hat{R}_S(h) - R(h)| > \epsilon]$$ $$\leq M \cdot 2e^{-2n\epsilon^2}$$

Setting $\delta = 2M e^{-2n\epsilon^2}$ and inverting:

Uniform Convergence Bound (Finite Class)

For finite hypothesis class $\mathcal{H}$ with bounded loss $\ell \in [0,1]$, with probability at least $1 - \delta$:

$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

Equivalently: for all $h \in \mathcal{H}$ simultaneously, the empirical risk is within $O(\sqrt{\log|\mathcal{H}|/n})$ of the true risk.

Generalization Bound for ERM

Once uniform convergence holds, ERM's generalization follows immediately. Let:

$h^* = \arg\min_{h \in \mathcal{H}} R(h)$ — the best hypothesis in class
$h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$ — the ERM hypothesis

With probability at least $1 - \delta$:

The excess risk of ERM over the best in class is at most $O(\sqrt{\log|\mathcal{H}|/n})$.

Beyond Finite Classes

For infinite hypothesis classes, the union bound fails ($|\mathcal{H}| = \infty$). We need more sophisticated measures of complexity:

VC Dimension: For classification, VC dimension $d$ implies uniform convergence with: $$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| = O\left(\sqrt{\frac{d \log(n/d)}{n}}\right)$$

These complexity measures effectively "count" how many distinct behaviors $\mathcal{H}$ can exhibit on finite samples, even when $\mathcal{H}$ is infinite. We will explore them in subsequent modules.

Error Decomposition: Approximation vs. Estimation

The Two Sources of Error

The true risk of a learned hypothesis can be decomposed into interpretable components. Let:

$f^*$ — the Bayes optimal predictor (minimizes risk over all functions)
$h^*$ — the best hypothesis in $\mathcal{H}$ (minimizes risk within the class)
$h_S$ — the hypothesis learned by our algorithm from data $S$

The excess risk of $h_S$ over Bayes optimal admits the decomposition:

$$\underbrace{R(h_S) - R(f^)}_{\text{Excess Risk}} = \underbrace{[R(h^) - R(f^)]}_{\text{Approximation Error}} + \underbrace{[R(h_S) - R(h^)]}_{\text{Estimation Error}}$$

Approximation Error

•Definition: $R(h^) - R(f^)$
•Source: Restriction to hypothesis class $\mathcal{H}$
•Nature: Deterministic (no randomness)
•Depends on: How well $\mathcal{H}$ can approximate $f^*$
•Decreases when: $\mathcal{H}$ gets richer
•Also known as: Bias, model misspecification

Estimation Error

•Definition: $R(h_S) - R(h^*)$
•Source: Learning from finite samples
•Nature: Random (depends on sample $S$)
•Depends on: Sample size $n$, complexity of $\mathcal{H}$
•Decreases when: $n$ increases, $\mathcal{H}$ gets simpler
•Also known as: Variance, generalization gap

The Fundamental Tradeoff

Visual Intuition

Imagine the space of all possible functions, with Bayes optimal $f^*$ at the center (lowest risk).

Small $\mathcal{H}$: A small circle. The closest point in the circle to $f^$ (which is $h^$) may be far from the center. Large approximation error. But finding $h^*$ using finite data is easy—there isn't much to search.
Large $\mathcal{H}$: A huge circle. $h^$ is very close to or coincides with $f^$. Small approximation error. But identifying $h^*$ from finite data is hard—there are many hypotheses, and the empirical best may differ from the true best.

Bounding Estimation Error

Uniform Convergence Implies Estimation Error Bounds

From uniform convergence, we derived that with high probability: $$R(h_{\text{ERM}}) \leq R(h^*) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

This directly bounds the estimation error: $$\text{Estimation Error} = R(h_{\text{ERM}}) - R(h^*) \leq 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

The estimation error:

Decreases as $O(1/\sqrt{n})$ with sample size
Increases with $O(\sqrt{\log|\mathcal{H}|})$ as the hypothesis class grows

General Form of Estimation Error Bounds

For various complexity measures, the estimation error takes the form:

Complexity Measure	Estimation Error Bound
Finite class size $	\mathcal{H}
VC dimension $d$	$O\left(\sqrt{\frac{d \log(n/d)}{n}}\right)$
Rademacher complexity $\mathfrak{R}_n$	$O(\mathfrak{R}_n(\mathcal{H}))$
Covering number $\mathcal{N}(\epsilon, \mathcal{H})$	$O\left(\sqrt{\frac{\log \mathcal{N}(\epsilon, \mathcal{H})}{n}}\right)$

High-Probability vs. Expected Bounds

We've focused on high-probability bounds: with probability $\geq 1 - \delta$, the error is at most...

Alternatively, we can bound expected error: $$\mathbb{E}_S[R(h_S) - R(h^*)]$$

These are related: if a bound holds with probability $1-\delta$ except for excess error $\epsilon(\delta)$, then: $$\mathbb{E}[\text{error}] \leq \epsilon(\delta) + \delta \cdot \text{(worst case)}$$

With bounded loss $\ell \in [0,1]$ and optimizing $\delta$, expected bounds are often of the same order as high-probability bounds (sometimes with better constants).

Practical Implications

The Generalization Gap

Defining the Generalization Gap

The generalization gap (or generalization error) is the difference between test and training performance:

$$\text{Generalization Gap} = R(h_S) - \hat{R}_S(h_S)$$

For ERM, where $h_S = h_{\text{ERM}}$, this gap measures how much better the model looks on training data compared to fresh test data.

Key insight: If ERM achieves zero training error ($\hat{R}S(h{\text{ERM}}) = 0$), then the generalization gap equals the true risk. Low training error means nothing if the gap is large.

Uniform Bound on Generalization Gap

From uniform convergence: $$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_S(h)| \leq \epsilon\right] \geq 1 - \delta$$

This directly bounds the generalization gap for any hypothesis (including the one selected by ERM): $$|R(h_{\text{ERM}}) - \hat{R}S(h{\text{ERM}})| \leq \epsilon$$

with the same $\epsilon = O(\sqrt{\log|\mathcal{H}|/n})$.

Why the Gap Can Be Large

The generalization gap is typically positive for ERM: $R(h_{\text{ERM}}) > \hat{R}S(h{\text{ERM}})$. Why?

Data Snooping and Adaptive Analysis

Optimism of Training Error

The systematic underestimation of test error by training error is called optimism:

$$\text{Optimism} = \mathbb{E}_{S}[R(h_S) - \hat{R}_S(h_S)] \geq 0$$

For linear regression with squared loss, there's an elegant formula for optimism: $$\text{Optimism} \approx \frac{2p\sigma^2}{n}$$

In general, more complex models (larger $p$ or hypothesis class) have higher optimism.

Implications for Practice

Never Trust Training Error Alone

The theory makes clear: training error is systematically optimistic. A model with 99% training accuracy could have 60% test accuracy if the generalization gap is large.

This is why machine learning practice emphasizes held-out evaluation:

Validation set: Used for model selection and hyperparameter tuning
Test set: Used only for final evaluation, never for training or selection
Cross-validation: Efficiently uses all data for both training and validation

Best Practices from Theory

•Collect more data: Reduces estimation error as $O(1/\sqrt{n})$
•Control complexity: Simpler models have smaller generalization gap
•Regularize: Explicitly or implicitly constrain the hypothesis class
•Use validation properly: Never peek at test data during model development
•Account for multiple testing: Trying many models inflates the gap
•Report confidence intervals: Theory provides probabilistic bounds

Model Selection: Balancing the Tradeoff

Choosing the right model complexity requires balancing approximation and estimation error:

Too simple:

High approximation error (underfitting)
Low estimation error (small generalization gap)
Both training and test error are high

Too complex:

Low approximation error (can fit the data)
High estimation error (overfitting)
Training error is low, test error is high

Just right:

Moderate approximation error
Moderate estimation error
Minimum total error (test error)

The optimal complexity increases with sample size. With more data, we can afford more complex models.

Modern Deep Learning Paradox

Summary: True Risk vs. Empirical Risk

We have deeply explored the relationship between what we can measure (training error) and what we care about (test error). Let us consolidate the key insights:

Key Takeaways

•Empirical Risk is an Estimate: $\hat{R}_S(h)$ is an unbiased, consistent estimator of $R(h)$ for fixed $h$, with variance $O(1/n)$
•Concentration Inequalities: Hoeffding's inequality bounds the deviation $|\hat{R}_S(h) - R(h)|$ with exponential probability
•Uniform Convergence: For ERM, we need bounds that hold simultaneously for all $h \in \mathcal{H}$, costing a factor of $\log|\mathcal{H}|$
•Error Decomposition: Total error = Approximation Error + Estimation Error; these are in tension
•Approximation Error: Due to restricting to $\mathcal{H}$; decreases with richer classes
•Estimation Error: Due to finite samples; decreases with more data, increases with complexity
•Generalization Gap: The difference $R(h_S) - \hat{R}_S(h_S)$; typically positive, larger for complex models
•Practical Implication: Never trust training error alone; use held-out validation

What's Next

Page Complete

3 / 5