The Learning Problem - Learning Module

Loading content...

0/245

Empirical Risk Minimization

The Most Natural Idea in Learning

In the previous section, we established the formal learning framework: an unknown distribution $\mathcal{D}$, a hypothesis class $\mathcal{H}$, and the goal of finding $h \in \mathcal{H}$ with low true risk $R(h)$. We identified the fundamental challenge: we cannot compute $R(h)$ because $\mathcal{D}$ is unknown.

So what can we do?

The most natural idea in learning—so natural it may seem obvious—is to minimize what we can measure: the error on the training data. This principle, called Empirical Risk Minimization (ERM), underlies virtually every supervised learning algorithm.

But as we'll discover, ERM's apparent simplicity conceals deep questions. When does minimizing training error guarantee low test error? When does it catastrophically fail? What conditions must hold for ERM to succeed? These questions are at the heart of statistical learning theory.

What You Will Learn

By the end of this page, you will understand ERM as a formal principle, know its theoretical guarantees for finite hypothesis classes, appreciate why ERM can fail dramatically for complex hypothesis classes, and understand the tension between approximation error (related to bias) and estimation error (related to variance) that ERM must balance.

The ERM Principle: Definition and Intuition

Formal Definition

Recall that given a training set $S = {(x_1, y_1), \ldots, (x_n, y_n)}$ and a loss function $\ell$, the empirical risk of a hypothesis $h$ is:

$$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$

Definition (Empirical Risk Minimization): The ERM principle selects the hypothesis that minimizes empirical risk:

$$h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$$

In words: among all hypotheses in $\mathcal{H}$, choose the one that makes the fewest mistakes (or has the lowest loss) on the training data.

ERM May Not Be Unique

The argmin may not be unique—multiple hypotheses might achieve the same minimal training error. In such cases, ERM outputs one of the minimizers (which one depends on the implementation). Some analyses consider 'any ERM hypothesis' to handle this; others specify tie-breaking rules.

Why ERM Makes Sense

The intuition behind ERM is statistical: the empirical risk $\hat{R}_S(h)$ is a sample average that estimates the true risk $R(h)$. By the law of large numbers:

$$\hat{R}_S(h) \xrightarrow{a.s.} R(h) \quad \text{as } n \to \infty$$

For any fixed hypothesis $h$, the empirical risk converges to the true risk as sample size grows. Therefore, minimizing the empirical risk seems like a reasonable proxy for minimizing the true risk.

Moreover, ERM is all we can do in a certain sense: without additional assumptions or side information, the training data is our only window into the distribution. Any learning algorithm that doesn't consider the training loss is ignoring available evidence.

ERM in Familiar Algorithms

Many algorithms you already know are instances of ERM:

Algorithm	Hypothesis Class $\mathcal{H}$	Loss Function	ERM Procedure
Least Squares Regression	Linear functions ${w^\top x : w \in \mathbb{R}^d}$	Squared loss $(\hat{y} - y)^2$	Solve normal equations
Logistic Regression	Linear classifiers (via softmax)	Log loss	Gradient descent
Perceptron	Linear classifiers (implicit)	Hinge-like	Iterative updates
Decision Trees (no pruning)	All decision trees	0-1 loss (or impurity)	Greedy splitting
Neural Networks	Parameterized functions	Cross-entropy, MSE	SGD, Adam

In each case, the algorithm searches within a hypothesis class to find the lowest training error (or a surrogate of it).

ERM for Finite Hypothesis Classes

Let us start with the simplest case: a finite hypothesis class $\mathcal{H}$ with $|\mathcal{H}| = M < \infty$. This case provides clean, intuitive results that illuminate the core logic of learning theory.

The Key Question

We want to show: if ERM achieves low training error, does it also achieve low true error?

For this, we need to understand how much the empirical risk can differ from the true risk. If $\hat{R}_S(h)$ is always close to $R(h)$, then minimizing $\hat{R}_S(h)$ approximately minimizes $R(h)$.

Concentration for a Single Hypothesis

For a single fixed hypothesis $h$, the empirical risk is an average of $n$ i.i.d. random variables: $$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$

If the loss is bounded ($\ell \in [0, 1]$), Hoeffding's inequality gives: $$\mathbb{P}\left[|\hat{R}_S(h) - R(h)| > \epsilon\right] \leq 2e^{-2n\epsilon^2}$$

This says: for any single hypothesis, the empirical risk is close to the true risk with high probability.

The Union Bound Trap

The concentration result above holds for a fixed, pre-specified hypothesis h, not for the hypothesis selected by looking at the data. When we select h_ERM by minimizing training error, we're implicitly testing all hypotheses in H and picking the best. This selection process introduces dependence that we must account for.

Union Bound Over All Hypotheses

To handle the selection process, we use the union bound. If we want $|\hat{R}_S(h) - R(h)| \leq \epsilon$ to hold simultaneously for all $h \in \mathcal{H}$, we pay a factor of $|\mathcal{H}|$:

$$\mathbb{P}\left[\exists h \in \mathcal{H}: |\hat{R}S(h) - R(h)| > \epsilon\right] \leq \sum{h \in \mathcal{H}} \mathbb{P}\left[|\hat{R}_S(h) - R(h)| > \epsilon\right]$$ $$\leq |\mathcal{H}| \cdot 2e^{-2n\epsilon^2} = 2M \cdot e^{-2n\epsilon^2}$$

With probability at least $1 - 2M \cdot e^{-2n\epsilon^2}$: $$\forall h \in \mathcal{H}: \quad |\hat{R}_S(h) - R(h)| \leq \epsilon$$

This is called uniform convergence: the empirical risk converges to the true risk uniformly over all hypotheses in $\mathcal{H}$.

The ERM Guarantee for Finite Classes

Let's derive the guarantee. Setting $\delta = 2M \cdot e^{-2n\epsilon^2}$ and solving for $\epsilon$:

$$\epsilon = \sqrt{\frac{\ln(2M/\delta)}{2n}} = \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

Theorem (ERM Generalization for Finite Classes): For any finite hypothesis class $\mathcal{H}$ with bounded loss $\ell \in [0,1]$, with probability at least $1 - \delta$ over the training set $S \sim \mathcal{D}^n$:

$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

And since $\hat{R}S(h{\text{ERM}}) \leq \hat{R}_S(h^)$ for any $h^ \in \mathcal{H}$ (by definition of ERM):

$$R(h_{\text{ERM}}) \leq R(h^*) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

where $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ is the best hypothesis in $\mathcal{H}$.

Interpreting the Bound

The bound says: ERM achieves risk at most $O(\sqrt{\log|\mathcal{H}|/n})$ worse than the best hypothesis in the class. This excess error shrinks as we get more data (n increases) and grows logarithmically with the class size. A class with 1 billion hypotheses only pays a factor of roughly 30 (≈ $\sqrt{\ln(10^9)}$) compared to a class with 1 hypothesis.

Sample Complexity of ERM

A central question in learning theory is: how much data is enough? This is captured by the notion of sample complexity.

Definition of Sample Complexity

Sample complexity is the minimum number of training examples needed to achieve a given accuracy with a given confidence.

Definition: The sample complexity $n_\mathcal{H}(\epsilon, \delta)$ of a hypothesis class $\mathcal{H}$ is the smallest integer $n$ such that:

For any distribution $\mathcal{D}$, if $S \sim \mathcal{D}^n$ and $h_S = \mathcal{A}(S)$ for learning algorithm $\mathcal{A}$, then: $$\mathbb{P}[R(h_S) \leq \min_{h \in \mathcal{H}} R(h) + \epsilon] \geq 1 - \delta$$

Sample Complexity for Finite Classes

From our ERM analysis, we can derive an explicit sample complexity bound. We want:

$$\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}} \leq \epsilon$$

Solving for $n$:

$$n \geq \frac{\ln(2|\mathcal{H}|/\delta)}{2\epsilon^2}$$

Theorem (Sample Complexity of Finite ERM): For a finite hypothesis class $\mathcal{H}$ with bounded loss, the sample complexity of ERM is:

$$n_\mathcal{H}(\epsilon, \delta) = O\left(\frac{\log(|\mathcal{H}|/\delta)}{\epsilon^2}\right)$$

More precisely: $$n_\mathcal{H}(\epsilon, \delta) \leq \frac{\log(2|\mathcal{H}|/\delta)}{2\epsilon^2}$$

Sample Complexity Dependencies
Factor	Effect on Sample Complexity	Intuition
$\|\mathcal{H}\| \uparrow$	$n \uparrow$ (logarithmically)	More hypotheses → need more data to distinguish
$\epsilon \downarrow$ (more accuracy)	$n \uparrow$ (quadratically)	Finer accuracy requires more samples
$\delta \downarrow$ (more confidence)	$n \uparrow$ (logarithmically)	Higher confidence requires slightly more data

Logarithmic Dependence is Favorable

The logarithmic dependence on $|\mathcal{H}|$ is remarkably favorable. A hypothesis class with $2^{100}$ elements (astronomical!) requires only about 100 extra samples compared to a class of size 2. This suggests that even large hypothesis classes may be learnable with reasonable sample sizes—if we can measure their complexity appropriately.

The Realizable Setting

In the realizable setting, we assume there exists $h^* \in \mathcal{H}$ with $R(h^*) = 0$ (perfect classification). Under this assumption, ERM achieves even tighter bounds.

Theorem (Realizable ERM Sample Complexity): In the realizable setting with 0-1 loss, the sample complexity of ERM is:

$$n_\mathcal{H}(\epsilon, \delta) \leq \frac{\log(|\mathcal{H}|/\delta)}{\epsilon}$$

Notice the improvement: $1/\epsilon$ instead of $1/\epsilon^2$. This is because when the optimal risk is zero, ERM with zero training error must have bounded true risk with high probability (any hypothesis with high true risk would likely make training errors).

When ERM Fails: The Overfitting Phenomenon

Despite its intuitive appeal, ERM can fail catastrophically. The failure mode is called overfitting: the ERM hypothesis fits the training data perfectly but generalizes poorly to new data.

A Constructive Example of Overfitting

Consider the following scenario:

Input space: $\mathcal{X} = \mathbb{R}$
Output space: $\mathcal{Y} = {0, 1}$
Training set: 100 labeled points
Hypothesis class: all functions from $\mathbb{R}$ to ${0, 1}$

With this hypothesis class, we can always find an $h$ that exactly memorizes the training data: $$h(x) = \begin{cases} y_i & \text{if } x = x_i \text{ for some } i \ 0 & \text{otherwise} \end{cases}$$

This $h$ achieves $\hat{R}_S(h) = 0$ (zero training error). But its predictions on new points are essentially random—it has no knowledge of the underlying pattern.

The empirical risk is zero, but the true risk could be arbitrarily high.

Overfitting Defined

Overfitting occurs when a learning algorithm achieves low training error but high test error. The hypothesis has 'memorized' the training data rather than 'learning' the underlying pattern. Overfitting is possible whenever the hypothesis class is rich enough to fit arbitrary training sets.

Why Overfitting Happens

The mathematical explanation is clear from our earlier analysis. The generalization bound for finite classes was:

$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

When $|\mathcal{H}|$ is very large (or infinite), the second term—the complexity penalty—becomes uncontrollable. For the class of all functions:

$|\mathcal{H}| = \infty$
No finite sample can uniformly approximate all hypotheses
ERM provides no generalization guarantee

The core issue: ERM treats all hypotheses in $\mathcal{H}$ equally. If $\mathcal{H}$ is too rich, there will always be "lucky" hypotheses that accidentally fit the training data despite being wrong about the true distribution.

ERM Succeeds When

•Hypothesis class $\mathcal{H}$ has bounded complexity
•Sample size $n$ is large relative to complexity
•The loss function provides meaningful feedback
•Distribution $\mathcal{D}$ is regular (no adversarial design)

ERM Fails When

•Hypothesis class is too expressive (can fit any data)
•Sample size is too small for the class complexity
•Optimization finds spurious minima (local overfitting)
•Data is noisy and hypothesis class fits noise

The Fundamental Tradeoff

We now see the core tradeoff in machine learning:

Small $\mathcal{H}$: Low complexity penalty, but may not contain a good hypothesis (underfitting)
Large $\mathcal{H}$: Contains good hypotheses, but high complexity penalty (overfitting)

This is the bias-variance tradeoff in hypothesis class terms:

Bias: Error from restricting to $\mathcal{H}$ (approximation error)
Variance: Error from finite sampling (estimation error)

We will formalize this decomposition in a later section of this module.

ERM and Inductive Bias

Inductive Bias Defined

Inductive bias refers to the set of assumptions a learning algorithm makes to generalize beyond the training data. For ERM, the primary inductive bias is the choice of hypothesis class $\mathcal{H}$.

When we say "use a linear classifier," we're imposing the inductive bias that decision boundaries should be hyperplanes. When we use "a neural network with 3 layers," we're biasing toward a specific family of functions.

Without inductive bias, learning is impossible.

This isn't a practical limitation—it's a theorem, known as the No Free Lunch theorem, which we'll cover later. For now, understand that inductive bias is not an evil to be minimized but a necessary ingredient for generalization.

Occam's Razor as Inductive Bias

One classical inductive bias is Occam's Razor: prefer simpler hypotheses. In learning theory terms, this translates to: given multiple hypotheses with similar training error, prefer the one from a simpler class. This is the foundation of regularization and model selection procedures.

The ERM Inductive Bias Hierarchy

ERM's inductive bias operates at multiple levels:

1. Hypothesis Class Selection The broadest bias: choosing what family of functions to consider. Linear vs. nonlinear, parametric vs. nonparametric, etc.

2. Loss Function Selection The loss encodes what we care about. Squared loss vs. absolute loss vs. 0-1 loss vs. hinge loss yield different ERM solutions.

3. Optimization Procedure For complex hypothesis classes, ERM can't be solved exactly. The optimization algorithm (gradient descent, for instance) introduces implicit bias toward certain solutions. This is an active research area: deep learning's success may partly stem from implicit regularization in SGD.

4. Regularization (if added) Many practical algorithms add explicit regularization, departing from pure ERM. This modifies the optimization objective: $$h = \arg\min_{h \in \mathcal{H}} \left[ \hat{R}_S(h) + \lambda \cdot \Omega(h) \right]$$

where $\Omega(h)$ penalizes complexity (e.g., norm of weights).

Why the Right Inductive Bias Matters

Consider two scenarios:

Scenario A: True function is linear, hypothesis class is linear

ERM works excellently
Low approximation error (true function is in $\mathcal{H}$)
Low estimation error (simple class, low complexity)

Scenario B: True function is linear, hypothesis class is degree-100 polynomials

ERM may overfit
Zero approximation error (true function is in $\mathcal{H}$ as a special case)
High estimation error (complex class, high complexity)

In Scenario B, we made $\mathcal{H}$ unnecessarily large. The extra capacity doesn't help (the true function is already linear) but hurts (overfitting risk increases).

Moral: The best hypothesis class is one that is large enough to contain the truth but small enough for reliable estimation. This Goldilocks zone depends on the domain, data quantity, and prior knowledge.

ERM in Practice: Computational Considerations

The theoretical analysis of ERM assumes we can perfectly minimize empirical risk over $\mathcal{H}$. In practice, this minimization is often computationally intractable.

Computational Complexity of ERM

For many hypothesis classes, ERM is NP-hard:

Hypothesis Class	ERM Complexity	Notes
Linear classifiers (0-1 loss)	NP-hard	Finding min-error hyperplane is hard
Decision trees (optimal)	NP-hard	Finding smallest tree is hard
Neural networks (arbitrary architecture)	NP-hard	Non-convex optimization landscape
Linear classifiers (convex surrogate)	Polynomial	Convex optimization
Linear regression (squared loss)	$O(nd^2 + d^3)$	Closed-form solution

When ERM is NP-hard, we use heuristics (e.g., greedy algorithms, local search) or optimize a convex surrogate of the loss function.

Theory-Practice Gap

Theoretical ERM analysis assumes exact minimization of empirical risk over H. Practical algorithms often (1) use surrogate losses, (2) employ approximate optimization, and (3) add regularization. The gap between theoretical ERM and practical algorithms is an area of ongoing research.

Surrogate Losses

To make ERM computationally tractable, we often replace the true loss with a surrogate that is easier to optimize.

For classification, 0-1 loss is non-convex and discontinuous. Common surrogates include:

Hinge Loss (SVM): $$\ell_{\text{hinge}}(f(x), y) = \max(0, 1 - y \cdot f(x))$$

Logistic Loss: $$\ell_{\text{logistic}}(f(x), y) = \log(1 + e^{-y \cdot f(x)})$$

Exponential Loss (AdaBoost): $$\ell_{\text{exp}}(f(x), y) = e^{-y \cdot f(x)}$$

These are convex in $f(x)$, enabling efficient optimization. Crucially, they are calibrated: minimizing the surrogate risk also minimizes the 0-1 risk (under mild conditions).

Theorem (Calibration): A surrogate $\phi$ is calibrated if minimizing $\mathbb{E}[\phi(f(x), y)]$ over all measurable $f$ also yields the Bayes optimal classifier for 0-1 loss.

Optimization Algorithms and Implicit Bias

Even with convex surrogates, complex hypothesis classes (like neural networks) require iterative optimization. The choice of optimizer can affect the final hypothesis:

Gradient Descent: Tends to find solutions that interpolate data "smoothly" (minimum norm solutions in certain settings)

Stochastic Gradient Descent: Introduces noise that can aid generalization (acts like implicit regularization)

Adaptive Methods (Adam, RMSprop): May converge to different solutions than vanilla gradient descent

This implicit bias of optimization is particularly important for deep learning, where the hypothesis class is so large that many hypotheses achieve zero training error. The optimizer's trajectory selects among these, and some selections generalize better than others.

Key Insight: In modern deep learning, we're not just doing ERM—we're doing ERM + implicit regularization from optimization + explicit regularization + architecture bias. Disentangling these effects is an active research frontier.

ERM and Uniform Convergence

The theoretical foundation for ERM rests on the concept of uniform convergence—the idea that empirical risk approximates true risk uniformly over all hypotheses in $\mathcal{H}$.

Definition of Uniform Convergence

Definition: A hypothesis class $\mathcal{H}$ has the uniform convergence property with respect to a distribution $\mathcal{D}$ if:

$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \xrightarrow{p} 0 \quad \text{as } n \to \infty$$

That is, the maximum deviation between empirical and true risk (over all hypotheses) goes to zero in probability.

Equivalently, for any $\epsilon, \delta > 0$, there exists $n_0$ such that for $n \geq n_0$: $$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| > \epsilon\right] < \delta$$

Uniform Convergence Implies ERM Learnability

Theorem: If $\mathcal{H}$ has the uniform convergence property, then ERM is a consistent learner for $\mathcal{H}$. Specifically:

Let $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ and $h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}S(h)$. If uniform convergence holds: $$R(h{\text{ERM}}) \to R(h^*) \quad \text{as } n \to \infty$$

Proof Sketch:

By uniform convergence: $|\hat{R}S(h^) - R(h^)| \leq \epsilon$ and $|\hat{R}S(h{\text{ERM}}) - R(h{\text{ERM}})| \leq \epsilon$ w.h.p.
By ERM: $\hat{R}S(h{\text{ERM}}) \leq \hat{R}_S(h^*)$
Combining: $R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \epsilon \leq \hat{R}_S(h^) + \epsilon \leq R(h^) + 2\epsilon$
As $n \to \infty$, $\epsilon \to 0$, so $R(h_{\text{ERM}}) \to R(h^*)$. $\square$

The Fundamental Theorem of Statistical Learning

A deep result (which we'll encounter when studying VC dimension) states that for binary classification with 0-1 loss, uniform convergence is equivalent to learnability. That is, a class is learnable (has finite sample complexity) if and only if it has the uniform convergence property. This is called the Fundamental Theorem of Statistical Learning.

Characterizing Uniform Convergence

Which hypothesis classes have the uniform convergence property? This is precisely what complexity measures like VC dimension and Rademacher complexity characterize:

Finite classes: Always have uniform convergence (as shown earlier)
VC dimension finite: Sufficient and necessary for uniform convergence in classification
Rademacher complexity sublinear in $n$: Implies uniform convergence at rate dictated by the complexity

These complexity measures will be the subject of subsequent sections. They provide the "right" notion of hypothesis class size—richer than simply counting hypotheses.

Summary: Empirical Risk Minimization

We have examined the principle of Empirical Risk Minimization from multiple angles—its definition, guarantees, limitations, and practical considerations. Let us consolidate the key insights:

Key Takeaways

•ERM Principle: Choose the hypothesis minimizing training error: $h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$
•Finite Class Guarantee: For finite $\mathcal{H}$, ERM generalizes with error $O(\sqrt{\log|\mathcal{H}|/n})$ above the best in class
•Sample Complexity: Learning requires $n = O(\log|\mathcal{H}| / \epsilon^2)$ samples for accuracy $\epsilon$
•Overfitting Risk: ERM fails when $\mathcal{H}$ is too complex—the hypothesis fits training data but not the true distribution
•Inductive Bias is Essential: The choice of $\mathcal{H}$ encodes assumptions that enable generalization; without bias, learning is impossible
•Uniform Convergence Underlies ERM: ERM works when empirical risk approximates true risk uniformly over $\mathcal{H}$
•Practical ERM Uses Surrogates: Computationally intractable losses are replaced with convex surrogates; optimization algorithms introduce implicit bias

What's Next

ERM tells us what to optimize; now we need to understand the gap between what we optimize and what we care about.

The next page explores the relationship between true risk and empirical risk in depth. We'll formalize the generalization gap, understand its sources, and see how it decomposes into interpretable components. This will set the stage for the bias-variance tradeoff and the central complexity measures (VC dimension, Rademacher complexity) that characterize learnability.

Page Complete

You now understand Empirical Risk Minimization as a formal learning principle. You can state when ERM provides guarantees (finite classes, uniform convergence), why it can fail (overfitting with complex classes), and how inductive bias is essential for learning. You're equipped to analyze ERM-based algorithms and understand their theoretical foundations.

Empirical Risk Minimization

The Most Natural Idea in Learning

So what can we do?

What You Will Learn

The ERM Principle: Definition and Intuition

Formal Definition

Recall that given a training set $S = {(x_1, y_1), \ldots, (x_n, y_n)}$ and a loss function $\ell$, the empirical risk of a hypothesis $h$ is:

$$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$

Definition (Empirical Risk Minimization): The ERM principle selects the hypothesis that minimizes empirical risk:

$$h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$$

In words: among all hypotheses in $\mathcal{H}$, choose the one that makes the fewest mistakes (or has the lowest loss) on the training data.

ERM May Not Be Unique

Why ERM Makes Sense

The intuition behind ERM is statistical: the empirical risk $\hat{R}_S(h)$ is a sample average that estimates the true risk $R(h)$. By the law of large numbers:

$$\hat{R}_S(h) \xrightarrow{a.s.} R(h) \quad \text{as } n \to \infty$$

For any fixed hypothesis $h$, the empirical risk converges to the true risk as sample size grows. Therefore, minimizing the empirical risk seems like a reasonable proxy for minimizing the true risk.

ERM in Familiar Algorithms

Many algorithms you already know are instances of ERM:

Algorithm	Hypothesis Class $\mathcal{H}$	Loss Function	ERM Procedure
Least Squares Regression	Linear functions ${w^\top x : w \in \mathbb{R}^d}$	Squared loss $(\hat{y} - y)^2$	Solve normal equations
Logistic Regression	Linear classifiers (via softmax)	Log loss	Gradient descent
Perceptron	Linear classifiers (implicit)	Hinge-like	Iterative updates
Decision Trees (no pruning)	All decision trees	0-1 loss (or impurity)	Greedy splitting
Neural Networks	Parameterized functions	Cross-entropy, MSE	SGD, Adam

In each case, the algorithm searches within a hypothesis class to find the lowest training error (or a surrogate of it).

ERM for Finite Hypothesis Classes

The Key Question

We want to show: if ERM achieves low training error, does it also achieve low true error?

For this, we need to understand how much the empirical risk can differ from the true risk. If $\hat{R}_S(h)$ is always close to $R(h)$, then minimizing $\hat{R}_S(h)$ approximately minimizes $R(h)$.

Concentration for a Single Hypothesis

For a single fixed hypothesis $h$, the empirical risk is an average of $n$ i.i.d. random variables: $$\hat{R}S(h) = \frac{1}{n} \sum{i=1}^{n} \ell(h(x_i), y_i)$$

If the loss is bounded ($\ell \in [0, 1]$), Hoeffding's inequality gives: $$\mathbb{P}\left[|\hat{R}_S(h) - R(h)| > \epsilon\right] \leq 2e^{-2n\epsilon^2}$$

This says: for any single hypothesis, the empirical risk is close to the true risk with high probability.

The Union Bound Trap

Union Bound Over All Hypotheses

With probability at least $1 - 2M \cdot e^{-2n\epsilon^2}$: $$\forall h \in \mathcal{H}: \quad |\hat{R}_S(h) - R(h)| \leq \epsilon$$

This is called uniform convergence: the empirical risk converges to the true risk uniformly over all hypotheses in $\mathcal{H}$.

The ERM Guarantee for Finite Classes

Let's derive the guarantee. Setting $\delta = 2M \cdot e^{-2n\epsilon^2}$ and solving for $\epsilon$:

$$\epsilon = \sqrt{\frac{\ln(2M/\delta)}{2n}} = \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

And since $\hat{R}S(h{\text{ERM}}) \leq \hat{R}_S(h^)$ for any $h^ \in \mathcal{H}$ (by definition of ERM):

$$R(h_{\text{ERM}}) \leq R(h^*) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

where $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ is the best hypothesis in $\mathcal{H}$.

Interpreting the Bound

Sample Complexity of ERM

A central question in learning theory is: how much data is enough? This is captured by the notion of sample complexity.

Definition of Sample Complexity

Sample complexity is the minimum number of training examples needed to achieve a given accuracy with a given confidence.

Definition: The sample complexity $n_\mathcal{H}(\epsilon, \delta)$ of a hypothesis class $\mathcal{H}$ is the smallest integer $n$ such that:

Sample Complexity for Finite Classes

From our ERM analysis, we can derive an explicit sample complexity bound. We want:

$$\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}} \leq \epsilon$$

Solving for $n$:

$$n \geq \frac{\ln(2|\mathcal{H}|/\delta)}{2\epsilon^2}$$

Theorem (Sample Complexity of Finite ERM): For a finite hypothesis class $\mathcal{H}$ with bounded loss, the sample complexity of ERM is:

$$n_\mathcal{H}(\epsilon, \delta) = O\left(\frac{\log(|\mathcal{H}|/\delta)}{\epsilon^2}\right)$$

More precisely: $$n_\mathcal{H}(\epsilon, \delta) \leq \frac{\log(2|\mathcal{H}|/\delta)}{2\epsilon^2}$$

Sample Complexity Dependencies
Factor	Effect on Sample Complexity	Intuition
$\|\mathcal{H}\| \uparrow$	$n \uparrow$ (logarithmically)	More hypotheses → need more data to distinguish
$\epsilon \downarrow$ (more accuracy)	$n \uparrow$ (quadratically)	Finer accuracy requires more samples
$\delta \downarrow$ (more confidence)	$n \uparrow$ (logarithmically)	Higher confidence requires slightly more data

Logarithmic Dependence is Favorable

The Realizable Setting

In the realizable setting, we assume there exists $h^* \in \mathcal{H}$ with $R(h^*) = 0$ (perfect classification). Under this assumption, ERM achieves even tighter bounds.

Theorem (Realizable ERM Sample Complexity): In the realizable setting with 0-1 loss, the sample complexity of ERM is:

$$n_\mathcal{H}(\epsilon, \delta) \leq \frac{\log(|\mathcal{H}|/\delta)}{\epsilon}$$

When ERM Fails: The Overfitting Phenomenon

Despite its intuitive appeal, ERM can fail catastrophically. The failure mode is called overfitting: the ERM hypothesis fits the training data perfectly but generalizes poorly to new data.

A Constructive Example of Overfitting

Consider the following scenario:

Input space: $\mathcal{X} = \mathbb{R}$
Output space: $\mathcal{Y} = {0, 1}$
Training set: 100 labeled points
Hypothesis class: all functions from $\mathbb{R}$ to ${0, 1}$

This $h$ achieves $\hat{R}_S(h) = 0$ (zero training error). But its predictions on new points are essentially random—it has no knowledge of the underlying pattern.

The empirical risk is zero, but the true risk could be arbitrarily high.

Overfitting Defined

Why Overfitting Happens

The mathematical explanation is clear from our earlier analysis. The generalization bound for finite classes was:

$$R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2n}}$$

When $|\mathcal{H}|$ is very large (or infinite), the second term—the complexity penalty—becomes uncontrollable. For the class of all functions:

$|\mathcal{H}| = \infty$
No finite sample can uniformly approximate all hypotheses
ERM provides no generalization guarantee

ERM Succeeds When

•Hypothesis class $\mathcal{H}$ has bounded complexity
•Sample size $n$ is large relative to complexity
•The loss function provides meaningful feedback
•Distribution $\mathcal{D}$ is regular (no adversarial design)

ERM Fails When

•Hypothesis class is too expressive (can fit any data)
•Sample size is too small for the class complexity
•Optimization finds spurious minima (local overfitting)
•Data is noisy and hypothesis class fits noise

The Fundamental Tradeoff

We now see the core tradeoff in machine learning:

Small $\mathcal{H}$: Low complexity penalty, but may not contain a good hypothesis (underfitting)
Large $\mathcal{H}$: Contains good hypotheses, but high complexity penalty (overfitting)

This is the bias-variance tradeoff in hypothesis class terms:

Bias: Error from restricting to $\mathcal{H}$ (approximation error)
Variance: Error from finite sampling (estimation error)

We will formalize this decomposition in a later section of this module.

ERM and Inductive Bias

Inductive Bias Defined

Without inductive bias, learning is impossible.

Occam's Razor as Inductive Bias

The ERM Inductive Bias Hierarchy

ERM's inductive bias operates at multiple levels:

1. Hypothesis Class Selection The broadest bias: choosing what family of functions to consider. Linear vs. nonlinear, parametric vs. nonparametric, etc.

2. Loss Function Selection The loss encodes what we care about. Squared loss vs. absolute loss vs. 0-1 loss vs. hinge loss yield different ERM solutions.

where $\Omega(h)$ penalizes complexity (e.g., norm of weights).

Why the Right Inductive Bias Matters

Consider two scenarios:

Scenario A: True function is linear, hypothesis class is linear

ERM works excellently
Low approximation error (true function is in $\mathcal{H}$)
Low estimation error (simple class, low complexity)

Scenario B: True function is linear, hypothesis class is degree-100 polynomials

ERM may overfit
Zero approximation error (true function is in $\mathcal{H}$ as a special case)
High estimation error (complex class, high complexity)

In Scenario B, we made $\mathcal{H}$ unnecessarily large. The extra capacity doesn't help (the true function is already linear) but hurts (overfitting risk increases).

ERM in Practice: Computational Considerations

The theoretical analysis of ERM assumes we can perfectly minimize empirical risk over $\mathcal{H}$. In practice, this minimization is often computationally intractable.

Computational Complexity of ERM

For many hypothesis classes, ERM is NP-hard:

Hypothesis Class	ERM Complexity	Notes
Linear classifiers (0-1 loss)	NP-hard	Finding min-error hyperplane is hard
Decision trees (optimal)	NP-hard	Finding smallest tree is hard
Neural networks (arbitrary architecture)	NP-hard	Non-convex optimization landscape
Linear classifiers (convex surrogate)	Polynomial	Convex optimization
Linear regression (squared loss)	$O(nd^2 + d^3)$	Closed-form solution

When ERM is NP-hard, we use heuristics (e.g., greedy algorithms, local search) or optimize a convex surrogate of the loss function.

Theory-Practice Gap

Surrogate Losses

To make ERM computationally tractable, we often replace the true loss with a surrogate that is easier to optimize.

For classification, 0-1 loss is non-convex and discontinuous. Common surrogates include:

Hinge Loss (SVM): $$\ell_{\text{hinge}}(f(x), y) = \max(0, 1 - y \cdot f(x))$$

Logistic Loss: $$\ell_{\text{logistic}}(f(x), y) = \log(1 + e^{-y \cdot f(x)})$$

Exponential Loss (AdaBoost): $$\ell_{\text{exp}}(f(x), y) = e^{-y \cdot f(x)}$$

These are convex in $f(x)$, enabling efficient optimization. Crucially, they are calibrated: minimizing the surrogate risk also minimizes the 0-1 risk (under mild conditions).

Theorem (Calibration): A surrogate $\phi$ is calibrated if minimizing $\mathbb{E}[\phi(f(x), y)]$ over all measurable $f$ also yields the Bayes optimal classifier for 0-1 loss.

Optimization Algorithms and Implicit Bias

Even with convex surrogates, complex hypothesis classes (like neural networks) require iterative optimization. The choice of optimizer can affect the final hypothesis:

Gradient Descent: Tends to find solutions that interpolate data "smoothly" (minimum norm solutions in certain settings)

Stochastic Gradient Descent: Introduces noise that can aid generalization (acts like implicit regularization)

Adaptive Methods (Adam, RMSprop): May converge to different solutions than vanilla gradient descent

ERM and Uniform Convergence

The theoretical foundation for ERM rests on the concept of uniform convergence—the idea that empirical risk approximates true risk uniformly over all hypotheses in $\mathcal{H}$.

Definition of Uniform Convergence

Definition: A hypothesis class $\mathcal{H}$ has the uniform convergence property with respect to a distribution $\mathcal{D}$ if:

$$\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| \xrightarrow{p} 0 \quad \text{as } n \to \infty$$

That is, the maximum deviation between empirical and true risk (over all hypotheses) goes to zero in probability.

Equivalently, for any $\epsilon, \delta > 0$, there exists $n_0$ such that for $n \geq n_0$: $$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |\hat{R}_S(h) - R(h)| > \epsilon\right] < \delta$$

Uniform Convergence Implies ERM Learnability

Theorem: If $\mathcal{H}$ has the uniform convergence property, then ERM is a consistent learner for $\mathcal{H}$. Specifically:

Proof Sketch:

By uniform convergence: $|\hat{R}S(h^) - R(h^)| \leq \epsilon$ and $|\hat{R}S(h{\text{ERM}}) - R(h{\text{ERM}})| \leq \epsilon$ w.h.p.
By ERM: $\hat{R}S(h{\text{ERM}}) \leq \hat{R}_S(h^*)$
Combining: $R(h_{\text{ERM}}) \leq \hat{R}S(h{\text{ERM}}) + \epsilon \leq \hat{R}_S(h^) + \epsilon \leq R(h^) + 2\epsilon$
As $n \to \infty$, $\epsilon \to 0$, so $R(h_{\text{ERM}}) \to R(h^*)$. $\square$

The Fundamental Theorem of Statistical Learning

Characterizing Uniform Convergence

Which hypothesis classes have the uniform convergence property? This is precisely what complexity measures like VC dimension and Rademacher complexity characterize:

Finite classes: Always have uniform convergence (as shown earlier)
VC dimension finite: Sufficient and necessary for uniform convergence in classification
Rademacher complexity sublinear in $n$: Implies uniform convergence at rate dictated by the complexity

These complexity measures will be the subject of subsequent sections. They provide the "right" notion of hypothesis class size—richer than simply counting hypotheses.

Summary: Empirical Risk Minimization

We have examined the principle of Empirical Risk Minimization from multiple angles—its definition, guarantees, limitations, and practical considerations. Let us consolidate the key insights:

Key Takeaways

•ERM Principle: Choose the hypothesis minimizing training error: $h_{\text{ERM}} = \arg\min_{h \in \mathcal{H}} \hat{R}_S(h)$
•Finite Class Guarantee: For finite $\mathcal{H}$, ERM generalizes with error $O(\sqrt{\log|\mathcal{H}|/n})$ above the best in class
•Sample Complexity: Learning requires $n = O(\log|\mathcal{H}| / \epsilon^2)$ samples for accuracy $\epsilon$
•Overfitting Risk: ERM fails when $\mathcal{H}$ is too complex—the hypothesis fits training data but not the true distribution
•Inductive Bias is Essential: The choice of $\mathcal{H}$ encodes assumptions that enable generalization; without bias, learning is impossible
•Uniform Convergence Underlies ERM: ERM works when empirical risk approximates true risk uniformly over $\mathcal{H}$
•Practical ERM Uses Surrogates: Computationally intractable losses are replaced with convex surrogates; optimization algorithms introduce implicit bias

What's Next

ERM tells us what to optimize; now we need to understand the gap between what we optimize and what we care about.

Page Complete