Loading content...
In the previous section, we established the formal learning framework: an unknown distribution $\mathcal{D}$, a hypothesis class $\mathcal{H}$, and the goal of finding $h \in \mathcal{H}$ with low true risk $R(h)$. We identified the fundamental challenge: we cannot compute $R(h)$ because $\mathcal{D}$ is unknown.
So what can we do?
The most natural idea in learning—so natural it may seem obvious—is to minimize what we can measure: the error on the training data. This principle, called Empirical Risk Minimization (ERM), underlies virtually every supervised learning algorithm.
But as we'll discover, ERM's apparent simplicity conceals deep questions. When does minimizing training error guarantee low test error? When does it catastrophically fail? What conditions must hold for ERM to succeed? These questions are at the heart of statistical learning theory.
By the end of this page, you will understand ERM as a formal principle, know its theoretical guarantees for finite hypothesis classes, appreciate why ERM can fail dramatically for complex hypothesis classes, and understand the tension between approximation error (related to bias) and estimation error (related to variance) that ERM must balance.
Recall that given a training set $S = {(x_1, y_1), \ldots, (x_n, y_n)}$ and a loss function $\ell$, the empirical risk of a hypothesis $h$ is:
$$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$
Definition (Empirical Risk Minimization): The ERM principle selects the hypothesis that minimizes empirical risk:
$$h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$$
In words: among all hypotheses in $\mathcal{H}$, choose the one that makes the fewest mistakes (or has the lowest loss) on the training data.
The argmin may not be unique—multiple hypotheses might achieve the same minimal training error. In such cases, ERM outputs one of the minimizers (which one depends on the implementation). Some analyses consider 'any ERM hypothesis' to handle this; others specify tie-breaking rules.
The intuition behind ERM is statistical: the empirical risk $\hat{R}_S(h)$ is a sample average that estimates the true risk $R(h)$. By the law of large numbers:
$$\hat{R}_S(h) \xrightarrow{a.s.} R(h) \quad \text{as } n \to \infty$$
For any fixed hypothesis $h$, the empirical risk converges to the true risk as sample size grows. Therefore, minimizing the empirical risk seems like a reasonable proxy for minimizing the true risk.
Moreover, ERM is all we can do in a certain sense: without additional assumptions or side information, the training data is our only window into the distribution. Any learning algorithm that doesn't consider the training loss is ignoring available evidence.
Many algorithms you already know are instances of ERM:
| Algorithm | Hypothesis Class $\mathcal{H}$ | Loss Function | ERM Procedure |
|---|---|---|---|
| Least Squares Regression | Linear functions ${w^\top x : w \in \mathbb{R}^d}$ | Squared loss $(\hat{y} - y)^2$ | Solve normal equations |
| Logistic Regression | Linear classifiers (via softmax) | Log loss | Gradient descent |
| Perceptron | Linear classifiers (implicit) | Hinge-like | Iterative updates |
| Decision Trees (no pruning) | All decision trees | 0-1 loss (or impurity) | Greedy splitting |
| Neural Networks | Parameterized functions | Cross-entropy, MSE | SGD, Adam |
In each case, the algorithm searches within a hypothesis class to find the lowest training error (or a surrogate of it).
Let us start with the simplest case: a finite hypothesis class $\mathcal{H}$ with $|\mathcal{H}| = M < \infty$. This case provides clean, intuitive results that illuminate the core logic of learning theory.
We want to show: if ERM achieves low training error, does it also achieve low true error?
For this, we need to understand how much the empirical risk can differ from the true risk. If $\hat{R}_S(h)$ is always close to $R(h)$, then minimizing $\hat{R}_S(h)$ approximately minimizes $R(h)$.
For a single fixed hypothesis $h$, the empirical risk is an average of $n$ i.i.d. random variables: $$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$
If the loss is bounded ($\ell \in [0, 1]$), Hoeffding's inequality gives: $$\mathbb{P}\left[|\hat{R}_S(h) - R(h)| > \epsilon\right] \leq 2e^{-2n\epsilon^2}$$
This says: for any single hypothesis, the empirical risk is close to the true risk with high probability.
The concentration result above holds for a fixed, pre-specified hypothesis h, not for the hypothesis selected by looking at the data. When we select h_ERM by minimizing training error, we're implicitly testing all hypotheses in H and picking the best. This selection process introduces dependence that we must account for.
To handle the selection process, we use the union bound. If we want $|\hat{R}_S(h) - R(h)| \leq \epsilon$ to hold simultaneously for all $h \in \mathcal{H}$, we pay a factor of $|\mathcal{H}|$:
$$\mathbb{P}\left[\exists h \in \mathcal{H}: |\hat{R}S(h) - R(h)| > \epsilon\right] \leq \sum{h \in \mathcal{H}} \mathbb{P}\left[|\hat{R}_S(h) - R(h)| > \epsilon\right]$$ $$\leq |\mathcal{H}| \cdot 2e^{-2n\epsilon^2} = 2M \cdot e^{-2n\epsilon^2}$$
With probability at least $1 - 2M \cdot e^{-2n\epsilon^2}$: $$\forall h \in \mathcal{H}: \quad |\hat{R}_S(h) - R(h)| \leq \epsilon$$
This is called uniform convergence: the empirical risk converges to the true risk uniformly over all hypotheses in $\mathcal{H}$.
Let's derive the guarantee. Setting $\delta = 2M \cdot e^{-2n\epsilon^2}$ and solving for $\epsilon$:
$$\epsilon = \sqrt{\frac{\ln(2M/\delta)}{2n}} = \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$
Theorem (ERM Generalization for Finite Classes): For any finite hypothesis class $\mathcal{H}$ with bounded loss $\ell \in [0,1]$, with probability at least $1 - \delta$ over the training set $S \sim \mathcal{D}^n$:
$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$
And since $\hat{R}S(h{\text{ERM}}) \leq \hat{R}_S(h^)$ for any $h^ \in \mathcal{H}$ (by definition of ERM):
$$R(h_{\text{ERM}}) \leq R(h^*) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$
where $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ is the best hypothesis in $\mathcal{H}$.
The bound says: ERM achieves risk at most $O(\sqrt{\log|\mathcal{H}|/n})$ worse than the best hypothesis in the class. This excess error shrinks as we get more data (n increases) and grows logarithmically with the class size. A class with 1 billion hypotheses only pays a factor of roughly 30 (≈ $\sqrt{\ln(10^9)}$) compared to a class with 1 hypothesis.
A central question in learning theory is: how much data is enough? This is captured by the notion of sample complexity.
Sample complexity is the minimum number of training examples needed to achieve a given accuracy with a given confidence.
Definition: The sample complexity $n_\mathcal{H}(\epsilon, \delta)$ of a hypothesis class $\mathcal{H}$ is the smallest integer $n$ such that:
For any distribution $\mathcal{D}$, if $S \sim \mathcal{D}^n$ and $h_S = \mathcal{A}(S)$ for learning algorithm $\mathcal{A}$, then: $$\mathbb{P}[R(h_S) \leq \min_{h \in \mathcal{H}} R(h) + \epsilon] \geq 1 - \delta$$
From our ERM analysis, we can derive an explicit sample complexity bound. We want:
$$\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}} \leq \epsilon$$
Solving for $n$:
$$n \geq \frac{\ln(2|\mathcal{H}|/\delta)}{2\epsilon^2}$$
Theorem (Sample Complexity of Finite ERM): For a finite hypothesis class $\mathcal{H}$ with bounded loss, the sample complexity of ERM is:
$$n_\mathcal{H}(\epsilon, \delta) = O\left(\frac{\log(|\mathcal{H}|/\delta)}{\epsilon^2}\right)$$
More precisely: $$n_\mathcal{H}(\epsilon, \delta) \leq \frac{\log(2|\mathcal{H}|/\delta)}{2\epsilon^2}$$
| Factor | Effect on Sample Complexity | Intuition |
|---|---|---|
| $|\mathcal{H}| \uparrow$ | $n \uparrow$ (logarithmically) | More hypotheses → need more data to distinguish |
| $\epsilon \downarrow$ (more accuracy) | $n \uparrow$ (quadratically) | Finer accuracy requires more samples |
| $\delta \downarrow$ (more confidence) | $n \uparrow$ (logarithmically) | Higher confidence requires slightly more data |
The logarithmic dependence on $|\mathcal{H}|$ is remarkably favorable. A hypothesis class with $2^{100}$ elements (astronomical!) requires only about 100 extra samples compared to a class of size 2. This suggests that even large hypothesis classes may be learnable with reasonable sample sizes—if we can measure their complexity appropriately.
In the realizable setting, we assume there exists $h^* \in \mathcal{H}$ with $R(h^*) = 0$ (perfect classification). Under this assumption, ERM achieves even tighter bounds.
Theorem (Realizable ERM Sample Complexity): In the realizable setting with 0-1 loss, the sample complexity of ERM is:
$$n_\mathcal{H}(\epsilon, \delta) \leq \frac{\log(|\mathcal{H}|/\delta)}{\epsilon}$$
Notice the improvement: $1/\epsilon$ instead of $1/\epsilon^2$. This is because when the optimal risk is zero, ERM with zero training error must have bounded true risk with high probability (any hypothesis with high true risk would likely make training errors).
Despite its intuitive appeal, ERM can fail catastrophically. The failure mode is called overfitting: the ERM hypothesis fits the training data perfectly but generalizes poorly to new data.
Consider the following scenario:
With this hypothesis class, we can always find an $h$ that exactly memorizes the training data: $$h(x) = \begin{cases} y_i & \text{if } x = x_i \text{ for some } i \ 0 & \text{otherwise} \end{cases}$$
This $h$ achieves $\hat{R}_S(h) = 0$ (zero training error). But its predictions on new points are essentially random—it has no knowledge of the underlying pattern.
The empirical risk is zero, but the true risk could be arbitrarily high.
Overfitting occurs when a learning algorithm achieves low training error but high test error. The hypothesis has 'memorized' the training data rather than 'learning' the underlying pattern. Overfitting is possible whenever the hypothesis class is rich enough to fit arbitrary training sets.
The mathematical explanation is clear from our earlier analysis. The generalization bound for finite classes was:
$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$
When $|\mathcal{H}|$ is very large (or infinite), the second term—the complexity penalty—becomes uncontrollable. For the class of all functions:
The core issue: ERM treats all hypotheses in $\mathcal{H}$ equally. If $\mathcal{H}$ is too rich, there will always be "lucky" hypotheses that accidentally fit the training data despite being wrong about the true distribution.
We now see the core tradeoff in machine learning:
This is the bias-variance tradeoff in hypothesis class terms:
We will formalize this decomposition in a later section of this module.
Inductive bias refers to the set of assumptions a learning algorithm makes to generalize beyond the training data. For ERM, the primary inductive bias is the choice of hypothesis class $\mathcal{H}$.
When we say "use a linear classifier," we're imposing the inductive bias that decision boundaries should be hyperplanes. When we use "a neural network with 3 layers," we're biasing toward a specific family of functions.
Without inductive bias, learning is impossible.
This isn't a practical limitation—it's a theorem, known as the No Free Lunch theorem, which we'll cover later. For now, understand that inductive bias is not an evil to be minimized but a necessary ingredient for generalization.
One classical inductive bias is Occam's Razor: prefer simpler hypotheses. In learning theory terms, this translates to: given multiple hypotheses with similar training error, prefer the one from a simpler class. This is the foundation of regularization and model selection procedures.
ERM's inductive bias operates at multiple levels:
1. Hypothesis Class Selection The broadest bias: choosing what family of functions to consider. Linear vs. nonlinear, parametric vs. nonparametric, etc.
2. Loss Function Selection The loss encodes what we care about. Squared loss vs. absolute loss vs. 0-1 loss vs. hinge loss yield different ERM solutions.
3. Optimization Procedure For complex hypothesis classes, ERM can't be solved exactly. The optimization algorithm (gradient descent, for instance) introduces implicit bias toward certain solutions. This is an active research area: deep learning's success may partly stem from implicit regularization in SGD.
4. Regularization (if added) Many practical algorithms add explicit regularization, departing from pure ERM. This modifies the optimization objective: $$h = \arg\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \cdot \Omega(h) \right]$$
where $\Omega(h)$ penalizes complexity (e.g., norm of weights).
Consider two scenarios:
Scenario A: True function is linear, hypothesis class is linear
Scenario B: True function is linear, hypothesis class is degree-100 polynomials
In Scenario B, we made $\mathcal{H}$ unnecessarily large. The extra capacity doesn't help (the true function is already linear) but hurts (overfitting risk increases).
Moral: The best hypothesis class is one that is large enough to contain the truth but small enough for reliable estimation. This Goldilocks zone depends on the domain, data quantity, and prior knowledge.
The theoretical analysis of ERM assumes we can perfectly minimize empirical risk over $\mathcal{H}$. In practice, this minimization is often computationally intractable.
For many hypothesis classes, ERM is NP-hard:
| Hypothesis Class | ERM Complexity | Notes |
|---|---|---|
| Linear classifiers (0-1 loss) | NP-hard | Finding min-error hyperplane is hard |
| Decision trees (optimal) | NP-hard | Finding smallest tree is hard |
| Neural networks (arbitrary architecture) | NP-hard | Non-convex optimization landscape |
| Linear classifiers (convex surrogate) | Polynomial | Convex optimization |
| Linear regression (squared loss) | $O(nd^2 + d^3)$ | Closed-form solution |
When ERM is NP-hard, we use heuristics (e.g., greedy algorithms, local search) or optimize a convex surrogate of the loss function.
Theoretical ERM analysis assumes exact minimization of empirical risk over H. Practical algorithms often (1) use surrogate losses, (2) employ approximate optimization, and (3) add regularization. The gap between theoretical ERM and practical algorithms is an area of ongoing research.
To make ERM computationally tractable, we often replace the true loss with a surrogate that is easier to optimize.
For classification, 0-1 loss is non-convex and discontinuous. Common surrogates include:
Hinge Loss (SVM): $$\ell_{\text{hinge}}(f(x), y) = \max(0, 1 - y \cdot f(x))$$
Logistic Loss: $$\ell_{\text{logistic}}(f(x), y) = \log(1 + e^{-y \cdot f(x)})$$
Exponential Loss (AdaBoost): $$\ell_{\text{exp}}(f(x), y) = e^{-y \cdot f(x)}$$
These are convex in $f(x)$, enabling efficient optimization. Crucially, they are calibrated: minimizing the surrogate risk also minimizes the 0-1 risk (under mild conditions).
Theorem (Calibration): A surrogate $\phi$ is calibrated if minimizing $\mathbb{E}[\phi(f(x), y)]$ over all measurable $f$ also yields the Bayes optimal classifier for 0-1 loss.
Even with convex surrogates, complex hypothesis classes (like neural networks) require iterative optimization. The choice of optimizer can affect the final hypothesis:
Gradient Descent: Tends to find solutions that interpolate data "smoothly" (minimum norm solutions in certain settings)
Stochastic Gradient Descent: Introduces noise that can aid generalization (acts like implicit regularization)
Adaptive Methods (Adam, RMSprop): May converge to different solutions than vanilla gradient descent
This implicit bias of optimization is particularly important for deep learning, where the hypothesis class is so large that many hypotheses achieve zero training error. The optimizer's trajectory selects among these, and some selections generalize better than others.
Key Insight: In modern deep learning, we're not just doing ERM—we're doing ERM + implicit regularization from optimization + explicit regularization + architecture bias. Disentangling these effects is an active research frontier.
The theoretical foundation for ERM rests on the concept of uniform convergence—the idea that empirical risk approximates true risk uniformly over all hypotheses in $\mathcal{H}$.
Definition: A hypothesis class $\mathcal{H}$ has the uniform convergence property with respect to a distribution $\mathcal{D}$ if:
$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \xrightarrow{p} 0 \quad \text{as } n \to \infty$$
That is, the maximum deviation between empirical and true risk (over all hypotheses) goes to zero in probability.
Equivalently, for any $\epsilon, \delta > 0$, there exists $n_0$ such that for $n \geq n_0$: $$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| > \epsilon\right] < \delta$$
Theorem: If $\mathcal{H}$ has the uniform convergence property, then ERM is a consistent learner for $\mathcal{H}$. Specifically:
Let $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ and $h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}S(h)$. If uniform convergence holds: $$R(h{\text{ERM}}) \to R(h^*) \quad \text{as } n \to \infty$$
Proof Sketch:
A deep result (which we'll encounter when studying VC dimension) states that for binary classification with 0-1 loss, uniform convergence is equivalent to learnability. That is, a class is learnable (has finite sample complexity) if and only if it has the uniform convergence property. This is called the Fundamental Theorem of Statistical Learning.
Which hypothesis classes have the uniform convergence property? This is precisely what complexity measures like VC dimension and Rademacher complexity characterize:
These complexity measures will be the subject of subsequent sections. They provide the "right" notion of hypothesis class size—richer than simply counting hypotheses.
We have examined the principle of Empirical Risk Minimization from multiple angles—its definition, guarantees, limitations, and practical considerations. Let us consolidate the key insights:
ERM tells us what to optimize; now we need to understand the gap between what we optimize and what we care about.
The next page explores the relationship between true risk and empirical risk in depth. We'll formalize the generalization gap, understand its sources, and see how it decomposes into interpretable components. This will set the stage for the bias-variance tradeoff and the central complexity measures (VC dimension, Rademacher complexity) that characterize learnability.
You now understand Empirical Risk Minimization as a formal learning principle. You can state when ERM provides guarantees (finite classes, uniform convergence), why it can fail (overfitting with complex classes), and how inductive bias is essential for learning. You're equipped to analyze ERM-based algorithms and understand their theoretical foundations.