Loading learning content...
We have established that machine learning seeks to minimize true risk $R(h)$—the expected loss on new data—but can only directly minimize empirical risk $\hat{R}_S(h)$—the average loss on training data.
This raises the central question of statistical learning:
When does low training error imply low test error?
Or more precisely: What is the relationship between $R(h)$ and $\hat{R}_S(h)$, and how can we bound the difference?
This question is not merely theoretical. Every time you train a model and evaluate on a test set, you are witnessing the empirical manifestation of this relationship. Understanding it deeply enables you to predict when models will generalize, diagnose failures, and design better learning systems.
By the end of this page, you will understand concentration inequalities that bound the deviation between true and empirical risk, the decomposition of generalization error into approximation and estimation components, the sources of error in learning and how each can be addressed, and why simply looking at training error is never sufficient for assessing model quality.
For any fixed hypothesis $h$, the empirical risk is a sample mean:
$$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$
Each term $\ell(h(x_i), y_i)$ is an i.i.d. random variable (since $(x_i, y_i) \sim \mathcal{D}$ independently). The empirical risk is therefore a Monte Carlo estimate of the expectation:
$$\mathbb{E}[\hat{R}S(h)] = \mathbb{E}{(x,y) \sim \mathcal{D}}[\ell(h(x), y)] = R(h)$$
The empirical risk is an unbiased estimator of the true risk.
Being unbiased doesn't mean being accurate. The variance of $\hat{R}_S(h)$ is:
$$\text{Var}(\hat{R}_S(h)) = \frac{1}{n} \text{Var}(\ell(h(x), y)) = \frac{\sigma_h^2}{n}$$
where $\sigma_h^2 = \text{Var}_{(x,y) \sim \mathcal{D}}(\ell(h(x), y))$ depends on the hypothesis and distribution.
For bounded loss $\ell \in [0, 1]$, we have $\sigma_h^2 \leq 1/4$ (maximum variance for a bounded random variable).
The key observation: variance decreases as $1/n$. More data means a more precise estimate of the true risk.
For the standard deviation: $$\text{SD}(\hat{R}_S(h)) = \frac{\sigma_h}{\sqrt{n}} = O\left(\frac{1}{\sqrt{n}}\right)$$
The $1/\sqrt{n}$ convergence rate is a fundamental fact of statistics. Halving the estimation error requires quadrupling the sample size. This slow rate explains why big data is valuable: going from 10K to 1M samples reduces error by a factor of 10 (since $\sqrt{100} = 10$).
By the Law of Large Numbers, as $n \to \infty$: $$\hat{R}_S(h) \xrightarrow{a.s.} R(h)$$
The empirical risk converges almost surely to the true risk. Combined with the Central Limit Theorem, we get:
$$\sqrt{n}(\hat{R}_S(h) - R(h)) \xrightarrow{d} \mathcal{N}(0, \sigma_h^2)$$
For large $n$, the deviation between empirical and true risk is approximately Gaussian with standard deviation $\sigma_h / \sqrt{n}$.
These classical results are comforting but not sufficient for learning theory. They describe the behavior of a fixed hypothesis, but learning algorithms select hypotheses based on data. We need bounds that account for this selection.
Asymptotic results tell us what happens as $n \to \infty$, but we always have finite data. Concentration inequalities provide finite-sample bounds on how far $\hat{R}_S(h)$ can deviate from $R(h)$.
For bounded i.i.d. random variables $Z_1, \ldots, Z_n \in [a, b]$ with mean $\mu$:
$$\mathbb{P}\left[\left|\frac{1}{n}\sum_{i=1}^n Z_i - \mu\right| > \epsilon\right] \leq 2\exp\left(-\frac{2n\epsilon^2}{(b-a)^2}\right)$$
Application to Learning: If $\ell \in [0, 1]$, then for any fixed hypothesis $h$:
$$\mathbb{P}[|\hat{R}_S(h) - R(h)| > \epsilon] \leq 2e^{-2n\epsilon^2}$$
This is exponentially decreasing in both $n$ and $\epsilon^2$.
| Inequality | Conditions | Bound on $\mathbb{P}[|\bar{Z} - \mu| > \epsilon]$ | Key Feature |
|---|---|---|---|
| Markov | $Z \geq 0$ | $\mathbb{E}[Z]/(\mu + \epsilon)$ | Weakest, most general |
| Chebyshev | Finite variance | $\sigma^2 / (n\epsilon^2)$ | Uses variance info |
| Hoeffding | Bounded $[a,b]$ | $2e^{-2n\epsilon^2/(b-a)^2}$ | Exponential, suboptimal constants |
| Bernstein | Bounded + variance | $2e^{-n\epsilon^2/(2\sigma^2 + 2(b-a)\epsilon/3)}$ | Adapts to variance |
| McDiarmid | Bounded differences | $2e^{-2n\epsilon^2/\sum c_i^2}$ | Functions of independent RVs |
The concentration inequality can be inverted to get a confidence interval. Setting $\delta = 2e^{-2n\epsilon^2}$ and solving for $\epsilon$:
$$\epsilon = \sqrt{\frac{\ln(2/\delta)}{2n}}$$
Result: With probability at least $1 - \delta$: $$|\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(2/\delta)}{2n}}$$
For example, with $n = 10000$ samples and $\delta = 0.05$: $$|\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(40)}{20000}} \approx 0.014$$
With 95% confidence, the empirical risk is within 1.4% of the true risk for any fixed hypothesis.
This bound applies to a pre-specified hypothesis h, chosen before seeing the data. The hypothesis selected by ERM is data-dependent—it's the one that happened to minimize training error among all candidates. Applying the fixed-hypothesis bound to h_ERM is invalid and leads to overconfident estimates of generalization.
To analyze ERM, we need bounds that hold uniformly over all hypotheses in $\mathcal{H}$. Specifically, we want to bound:
$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)|$$
This supremum measures the worst-case deviation between empirical and true risk across the entire hypothesis class.
For finite $\mathcal{H}$ with $|\mathcal{H}| = M$, we apply the union bound:
$$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |\hat{R}S(h) - R(h)| > \epsilon\right] \leq \sum{h \in \mathcal{H}} \mathbb{P}[|\hat{R}_S(h) - R(h)| > \epsilon]$$ $$\leq M \cdot 2e^{-2n\epsilon^2}$$
Setting $\delta = 2M e^{-2n\epsilon^2}$ and inverting:
For finite hypothesis class $\mathcal{H}$ with bounded loss $\ell \in [0,1]$, with probability at least $1 - \delta$:
$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \leq \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$
Equivalently: for all $h \in \mathcal{H}$ simultaneously, the empirical risk is within $O(\sqrt{\log|\mathcal{H}|/n})$ of the true risk.
Once uniform convergence holds, ERM's generalization follows immediately. Let:
With probability at least $1 - \delta$:
$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$ $$\leq \hat{R}_S(h^) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$ (since ERM minimizes training error) $$\leq R(h^) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$ (applying uniform convergence to $h^*$)
The excess risk of ERM over the best in class is at most $O(\sqrt{\log|\mathcal{H}|/n})$.
For infinite hypothesis classes, the union bound fails ($|\mathcal{H}| = \infty$). We need more sophisticated measures of complexity:
VC Dimension: For classification, VC dimension $d$ implies uniform convergence with: $$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| = O\left(\sqrt{\frac{d \log(n/d)}{n}}\right)$$
Rademacher Complexity $\mathfrak{R}_n(\mathcal{H})$: For general classes and losses: $$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \leq 2\mathfrak{R}_n(\mathcal{H}) + O\left(\sqrt{\frac{\log(1/\delta)}{n}}\right)$$
These complexity measures effectively "count" how many distinct behaviors $\mathcal{H}$ can exhibit on finite samples, even when $\mathcal{H}$ is infinite. We will explore them in subsequent modules.
The true risk of a learned hypothesis can be decomposed into interpretable components. Let:
The excess risk of $h_S$ over Bayes optimal admits the decomposition:
$$\underbrace{R(h_S) - R(f^)}_{\text{Excess Risk}} = \underbrace{[R(h^) - R(f^)]}_{\text{Approximation Error}} + \underbrace{[R(h_S) - R(h^)]}_{\text{Estimation Error}}$$
Approximation error and estimation error are in tension. A richer hypothesis class can approximate f* better (low approximation error) but is harder to learn from finite data (high estimation error). A simpler class is easier to learn but may not contain good approximations. Finding the right balance is the art and science of model selection.
Imagine the space of all possible functions, with Bayes optimal $f^*$ at the center (lowest risk).
Small $\mathcal{H}$: A small circle. The closest point in the circle to $f^$ (which is $h^$) may be far from the center. Large approximation error. But finding $h^*$ using finite data is easy—there isn't much to search.
Large $\mathcal{H}$: A huge circle. $h^$ is very close to or coincides with $f^$. Small approximation error. But identifying $h^*$ from finite data is hard—there are many hypotheses, and the empirical best may differ from the true best.
The optimal $\mathcal{H}$ is large enough that $h^$ is near $f^$, but small enough that we can reliably find $h^*$ from data. This optimal size typically grows with $n$—with more data, we can afford a richer class.
From uniform convergence, we derived that with high probability: $$R(h_{\text{ERM}}) \leq R(h^*) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$
This directly bounds the estimation error: $$\text{Estimation Error} = R(h_{\text{ERM}}) - R(h^*) \leq 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$
The estimation error:
For various complexity measures, the estimation error takes the form:
| Complexity Measure | Estimation Error Bound |
|---|---|
| Finite class size $ | \mathcal{H} |
| VC dimension $d$ | $O\left(\sqrt{\frac{d \log(n/d)}{n}}\right)$ |
| Rademacher complexity $\mathfrak{R}_n$ | $O(\mathfrak{R}_n(\mathcal{H}))$ |
| Covering number $\mathcal{N}(\epsilon, \mathcal{H})$ | $O\left(\sqrt{\frac{\log \mathcal{N}(\epsilon, \mathcal{H})}{n}}\right)$ |
In all cases, the estimation error decreases with more data and increases with hypothesis class complexity. This universal pattern reflects the core truth: more complex models are harder to learn accurately.
We've focused on high-probability bounds: with probability $\geq 1 - \delta$, the error is at most...
Alternatively, we can bound expected error: $$\mathbb{E}_S[R(h_S) - R(h^*)]$$
These are related: if a bound holds with probability $1-\delta$ except for excess error $\epsilon(\delta)$, then: $$\mathbb{E}[\text{error}] \leq \epsilon(\delta) + \delta \cdot \text{(worst case)}$$
With bounded loss $\ell \in [0,1]$ and optimizing $\delta$, expected bounds are often of the same order as high-probability bounds (sometimes with better constants).
These bounds justify common practices: (1) Use more data when possible (reduces estimation error). (2) Use simpler models when data is scarce (controls estimation error). (3) Use complex models only when data is abundant (can afford the estimation error). (4) Regularization effectively shrinks the hypothesis class, trading off approximation for estimation.
The generalization gap (or generalization error) is the difference between test and training performance:
$$\text{Generalization Gap} = R(h_S) - \hat{R}_S(h_S)$$
For ERM, where $h_S = h_{\text{ERM}}$, this gap measures how much better the model looks on training data compared to fresh test data.
Key insight: If ERM achieves zero training error ($\hat{R}S(h{\text{ERM}}) = 0$), then the generalization gap equals the true risk. Low training error means nothing if the gap is large.
From uniform convergence: $$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |R(h) - \hat{R}_S(h)| \leq \epsilon\right] \geq 1 - \delta$$
This directly bounds the generalization gap for any hypothesis (including the one selected by ERM): $$|R(h_{\text{ERM}}) - \hat{R}S(h{\text{ERM}})| \leq \epsilon$$
with the same $\epsilon = O(\sqrt{\log|\mathcal{H}|/n})$.
The generalization gap is typically positive for ERM: $R(h_{\text{ERM}}) > \hat{R}S(h{\text{ERM}})$. Why?
ERM selects the hypothesis that happens to minimize training error. Among all hypotheses in $\mathcal{H}$, we pick the one that was "luckiest" on the training data. This selection bias means training error systematically underestimates test error.
Analogy: Imagine measuring the heights of 100 people and reporting the tallest. Your single measurement overestimates the average height—you've selected the extreme. Similarly, ERM selects the hypothesis with the most favorable training error, which tends to overestimate true quality.
The generalization gap worsens with data snooping—using test data to inform modeling choices. If you try 100 models and select the best on a 'test' set, that set is no longer a valid test set; it has become part of training. The more you adapt to a dataset, the larger the gap between apparent and true performance. This is why proper train/validation/test splits and cross-validation are essential.
The systematic underestimation of test error by training error is called optimism:
$$\text{Optimism} = \mathbb{E}_{S}[R(h_S) - \hat{R}_S(h_S)] \geq 0$$
For linear regression with squared loss, there's an elegant formula for optimism: $$\text{Optimism} \approx \frac{2p\sigma^2}{n}$$
where $p$ is the number of parameters and $\sigma^2$ is the noise variance. This motivates the Akaike Information Criterion (AIC) and Mallow's $C_p$ for model selection: add $2p/n$ to training error to estimate test error.
In general, more complex models (larger $p$ or hypothesis class) have higher optimism.
The theory makes clear: training error is systematically optimistic. A model with 99% training accuracy could have 60% test accuracy if the generalization gap is large.
This is why machine learning practice emphasizes held-out evaluation:
Choosing the right model complexity requires balancing approximation and estimation error:
Too simple:
Too complex:
Just right:
The optimal complexity increases with sample size. With more data, we can afford more complex models.
Modern deep learning seems to violate classical learning theory: models with millions of parameters (very high complexity) generalize well even when they can perfectly fit training data. This 'double descent' phenomenon and the role of implicit regularization are active research areas. The classical theory isn't wrong—it's incomplete for explaining these phenomena.
We have deeply explored the relationship between what we can measure (training error) and what we care about (test error). Let us consolidate the key insights:
We've established the fundamental relationship between training and test error. The next page introduces the Generalization Gap in greater depth, exploring how this gap varies across different learning scenarios and what conditions guarantee small gaps.
This will prepare us for the No Free Lunch theorem, which establishes fundamental limits on learning, and the bias-variance tradeoff, which formalizes the tension between approximation and estimation error.
You now understand the precise mathematical relationship between true and empirical risk. You can derive concentration bounds, recognize the sources of error in learning, and appreciate why the generalization gap is central to machine learning theory. This foundation enables rigorous analysis of any learning algorithm.