Machine LearningGeneralization Bounds

Generalization Bounds: Mathematical Guarantees for Learning

LevelAdvanced

Duration90 mins

TopicGeneralization Bounds

1 / 5

The Union Bound: Foundation of Generalization Theory

From Finite Data to Universal Guarantees

At the heart of machine learning lies a profound challenge: how can we guarantee that a model learned from a finite sample of data will perform well on unseen examples? This question, seemingly philosophical, has precise mathematical answers—and those answers begin with one of the simplest yet most powerful tools in probability theory: the union bound.

The union bound, also known as Boole's inequality, may appear deceptively simple. It states that the probability of at least one of several events occurring is at most the sum of their individual probabilities. Yet this elementary observation is the cornerstone upon which the entire edifice of statistical learning theory is constructed.

In this page, we embark on a rigorous exploration of the union bound. We will derive it from first principles, understand its geometric and probabilistic intuitions, and see how it enables us to transform statements about individual hypotheses into statements about entire hypothesis classes. This transition—from reasoning about a single function to reasoning about families of functions—is precisely what makes learning theory possible.

What You Will Master

By the end of this page, you will: (1) Understand the union bound and its proof from foundational probability axioms; (2) Grasp the probabilistic intuition behind 'bad event' analysis in learning theory; (3) See how the union bound enables generalization guarantees for finite hypothesis classes; (4) Appreciate why the union bound is both fundamental and insufficient, motivating more sophisticated tools; (5) Apply the union bound to derive your first sample complexity result.

The Union Bound: Precise Statement

Let us begin with the precise mathematical statement of the union bound, building from the foundational axioms of probability theory.

Definition (Union Bound / Boole's Inequality): Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, and let $A_1, A_2, \ldots, A_n$ be a finite collection of events in $\mathcal{F}$. Then:

$$\mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} \mathbb{P}(A_i)$$

For countably infinite collections ${A_i}_{i=1}^{\infty}$:

$$\mathbb{P}\left(\bigcup_{i=1}^{\infty} A_i\right) \leq \sum_{i=1}^{\infty} \mathbb{P}(A_i)$$

This inequality holds with equality if and only if the events are mutually exclusive (i.e., $A_i \cap A_j = \emptyset$ for all $i \neq j$), in which case it reduces to the third Kolmogorov axiom of probability.

Why 'At Most' and Not 'Equal'

The inequality arises because events may overlap. When we sum individual probabilities, we count overlapping regions multiple times. The union bound essentially tells us that naive counting (ignoring overlaps) gives an upper bound. This 'pessimistic' counting is precisely what makes the bound useful—it's always safe, never optimistic.

Equivalent Formulations:

The union bound admits several equivalent formulations that prove useful in different contexts:

1. Complementary Form: For events $A_1, \ldots, A_n$, define the complement events $B_i = A_i^c$. Then:

$$\mathbb{P}\left(\bigcap_{i=1}^{n} B_i\right) \geq 1 - \sum_{i=1}^{n} \mathbb{P}(A_i)$$

This form is crucial in learning theory: if each $A_i$ represents a 'bad event' (e.g., hypothesis $h_i$ has large generalization error), then $\bigcap_i B_i$ represents 'all good events occur simultaneously.'

2. Failure Probability Form: If $\mathbb{P}(A_i) \leq \delta_i$ for each $i$, then:

$$\mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} \delta_i$$

Setting $\delta_i = \delta/n$ for all $i$ ensures the total failure probability is at most $\delta$. This 'budget allocation' is fundamental to uniform convergence arguments.

Union Bound Formulations and Their Learning Theory Interpretations
Formulation	Mathematical Form	Learning Theory Meaning
Standard	$\mathbb{P}(\cup A_i) \leq \sum \mathbb{P}(A_i)$	Probability of any bad event is bounded by sum
Complementary	$\mathbb{P}(\cap B_i) \geq 1 - \sum \mathbb{P}(A_i)$	Probability all good events occur simultaneously
Budget Allocation	$\delta_i = \delta/n \Rightarrow \mathbb{P}(\cup A_i) \leq \delta$	Divide failure budget equally among events
Contrapositive	If $\sum \mathbb{P}(A_i) < 1$, then $\mathbb{P}(\cap B_i) > 0$	Existence proofs: good configurations exist

Proof from First Principles

The union bound follows directly from the axioms of probability. We present two proofs: one by induction for finite collections, and one using indicator functions for deeper insight.

Proof 1: By Mathematical Induction

Base Case (n = 1): $\mathbb{P}(A_1) \leq \mathbb{P}(A_1)$ trivially.

Base Case (n = 2): Using the inclusion-exclusion principle: $$\mathbb{P}(A_1 \cup A_2) = \mathbb{P}(A_1) + \mathbb{P}(A_2) - \mathbb{P}(A_1 \cap A_2)$$

Since $\mathbb{P}(A_1 \cap A_2) \geq 0$, we have: $$\mathbb{P}(A_1 \cup A_2) \leq \mathbb{P}(A_1) + \mathbb{P}(A_2)$$

Inductive Step: Assume the bound holds for $n$ events. For $n+1$ events: $$\mathbb{P}\left(\bigcup_{i=1}^{n+1} A_i\right) = \mathbb{P}\left(\left(\bigcup_{i=1}^{n} A_i\right) \cup A_{n+1}\right)$$

Applying the $n=2$ case: $$\leq \mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) + \mathbb{P}(A_{n+1})$$

Applying the inductive hypothesis: $$\leq \sum_{i=1}^{n} \mathbb{P}(A_i) + \mathbb{P}(A_{n+1}) = \sum_{i=1}^{n+1} \mathbb{P}(A_i)$$

$\square$

The Power of Induction in Learning Theory

This induction is not merely a proof technique—it reflects how we extend guarantees. If we can bound the failure probability for each hypothesis individually, we can extend guarantees to the entire hypothesis class by paying the 'cost' of considering multiple hypotheses (the sum).

Proof 2: Via Indicator Functions

This proof provides deeper insight into why the bound works and when it is tight.

For any event $A$, define the indicator random variable $\mathbf{1}_A$ as: $$\mathbf{1}_A(\omega) = \begin{cases} 1 & \text{if } \omega \in A \ 0 & \text{if } \omega \notin A \end{cases}$$

Note that $\mathbb{P}(A) = \mathbb{E}[\mathbf{1}_A]$.

Key Observation: For any collection of events: $$\mathbf{1}{\cup_i A_i} \leq \sum_i \mathbf{1}{A_i}$$

This inequality holds pointwise: the left side is 1 if any $A_i$ occurs (and 0 otherwise), while the right side counts how many events occur. Since at least one occurring means the count is at least 1, the inequality holds.

Completing the Proof: $$\mathbb{P}\left(\bigcup_i A_i\right) = \mathbb{E}\left[\mathbf{1}{\cup_i A_i}\right] \leq \mathbb{E}\left[\sum_i \mathbf{1}{A_i}\right] = \sum_i \mathbb{E}[\mathbf{1}_{A_i}] = \sum_i \mathbb{P}(A_i)$$

$\square$

Tightness Analysis:

The indicator function proof reveals when the union bound is tight. The bound is tight (equality holds) when: $$\mathbf{1}{\cup_i A_i} = \sum_i \mathbf{1}{A_i}$$

This occurs if and only if at most one event can occur at any point $\omega \in \Omega$—i.e., the events are mutually exclusive.

When is the Bound Loose?

The bound is loose when events have significant overlap. Consider $n$ events, each with probability $1/2$, but all identical ($A_i = A$ for all $i$). Then:

True probability: $\mathbb{P}(\cup_i A_i) = \mathbb{P}(A) = 1/2$
Union bound: $\sum_i \mathbb{P}(A_i) = n/2$

The bound overestimates by a factor of $n$! This looseness when hypotheses are 'similar' motivates the development of more sophisticated bounds (Rademacher complexity, covering numbers) that account for the structure of hypothesis classes.

Geometric and Probabilistic Intuition

To truly internalize the union bound, we develop both geometric and probabilistic intuitions that illuminate its behavior and limitations.

Geometric Interpretation:

Consider the sample space $\Omega$ as a unit square (for Lebesgue measure). Each event $A_i$ corresponds to a region within this square. The probability $\mathbb{P}(A_i)$ is the area of region $A_i$.

The union $\cup_i A_i$ is the region covered by at least one event
The sum $\sum_i \mathbb{P}(A_i)$ counts areas, with overlaps counted multiple times
The bound $\mathbb{P}(\cup_i A_i) \leq \sum_i \mathbb{P}(A_i)$ says the true area is at most the naive sum

The Venn Diagram Perspective:

For two events: $$\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B)$$

The union bound drops the correction term $\mathbb{P}(A \cap B)$, accepting an overestimate. For many events, inclusion-exclusion becomes: $$\mathbb{P}\left(\bigcup_i A_i\right) = \sum_i \mathbb{P}(A_i) - \sum_{i<j} \mathbb{P}(A_i \cap A_j) + \ldots$$

The union bound ignores all correction terms (which alternate in sign), yielding a simpler but coarser bound.

Why Accept Overestimation?

In learning theory, we typically analyze 'bad events' (large generalization error). Overestimating the probability of bad events is conservative but safe—we never underestimate risk. This philosophical stance pervades statistical learning: we derive bounds that are always valid, even if sometimes loose.

Probabilistic Intuition: The 'Any Bad Event' Perspective

Imagine you're designing a system with $n$ components. Each component has an independent failure probability $p$. What's the probability that at least one component fails?

Exact Answer (for independent events): $$\mathbb{P}(\text{at least one failure}) = 1 - (1-p)^n$$

Union Bound Approximation: $$\mathbb{P}(\text{at least one failure}) \leq np$$

Comparison:

For small $p$ and moderate $n$, $(1-p)^n \approx e^{-np} \approx 1 - np$ (first-order Taylor expansion), so: $$1 - (1-p)^n \approx np$$

The union bound is nearly tight when events are unlikely and roughly independent! This is precisely the regime of learning theory: each 'bad event' (a specific hypothesis having large generalization error) has small probability, and we consider many such events.

Union Bound Accuracy for Independent Events with $p = 0.01$
Number of Events $n$	Exact $1 - (1-p)^n$	Union Bound $np$	Relative Error
10	0.0956	0.10	4.6%
50	0.395	0.50	26.6%
100	0.634	1.00 (invalid)	57.8%
200	0.866	2.00 (invalid)	131%

The table reveals a critical insight: the union bound works well when $np \ll 1$ (few expected failures) but degrades as $np$ approaches or exceeds 1. In learning theory, this translates to: the union bound is effective for moderate hypothesis classes with sufficiently large samples, but degrades for very large (or infinite) hypothesis classes.

This limitation motivates the development of:

Hoeffding's inequality: Tighter bounds on individual deviations
Rademacher complexity: Structure-aware complexity measures
Covering numbers: Reducing effective hypothesis class size
PAC-Bayes: Averaging over hypotheses rather than taking unions

Application to Statistical Learning Theory

We now apply the union bound to derive the foundational generalization guarantee for finite hypothesis classes—the starting point for all of PAC learning.

The Learning Setup:

Let $\mathcal{H} = {h_1, h_2, \ldots, h_M}$ be a finite hypothesis class with $|\mathcal{H}| = M$ hypotheses. Data $S = {(x_1, y_1), \ldots, (x_m, y_m)}$ is drawn i.i.d. from an unknown distribution $\mathcal{D}$.

For each hypothesis $h \in \mathcal{H}$, define:

True Risk (Generalization Error): $R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]$
Empirical Risk (Training Error): $\hat{R}(h) = \frac{1}{m} \sum_{i=1}^{m} \ell(h(x_i), y_i)$

For binary classification with 0-1 loss:

$R(h) = \mathbb{P}_{(x,y) \sim \mathcal{D}}[h(x) \neq y]$
$\hat{R}(h) = \frac{1}{m} \sum_{i=1}^{m} \mathbf{1}[h(x_i) \neq y_i]$

The Fundamental Question:

We want to bound the probability that the empirical risk $\hat{R}(h)$ deviates significantly from the true risk $R(h)$ for any hypothesis $h \in \mathcal{H}$.

Step 1: Define Bad Events

For each hypothesis $h_i$ and tolerance $\epsilon > 0$, define the 'bad event': $$A_i = {|\hat{R}(h_i) - R(h_i)| > \epsilon}$$

This is the event that hypothesis $h_i$ has empirical risk that deviates from its true risk by more than $\epsilon$.

Step 2: Apply the Union Bound

The event 'some hypothesis has large deviation' is: $$\bigcup_{i=1}^{M} A_i = {\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon}$$

By the union bound: $$\mathbb{P}\left(\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon\right) \leq \sum_{i=1}^{M} \mathbb{P}(|\hat{R}(h_i) - R(h_i)| > \epsilon)$$

The Union Bound's Role

Notice the magic here: we've converted a statement about 'any hypothesis in the class' (which depends on the entire class structure) into a sum of individual statements. Each individual term can be bounded using concentration inequalities (like Hoeffding's, which we'll cover next page). The union bound is the bridge between individual and uniform guarantees.

Step 3: Bound Individual Terms

For now, assume we have a bound on individual deviations (we'll derive this rigorously with Hoeffding's inequality): $$\mathbb{P}(|\hat{R}(h_i) - R(h_i)| > \epsilon) \leq 2e^{-2m\epsilon^2}$$

This is Hoeffding's bound for the average of $m$ i.i.d. bounded random variables.

Step 4: Combine via Union Bound

$$\mathbb{P}\left(\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon\right) \leq M \cdot 2e^{-2m\epsilon^2} = 2Me^{-2m\epsilon^2}$$

Step 5: Solve for Sample Complexity

We want this probability to be at most $\delta$. Setting $2Me^{-2m\epsilon^2} \leq \delta$ and solving for $m$:

$$e^{-2m\epsilon^2} \leq \frac{\delta}{2M}$$ $$-2m\epsilon^2 \leq \ln\left(\frac{\delta}{2M}\right)$$ $$m \geq \frac{1}{2\epsilon^2} \ln\left(\frac{2M}{\delta}\right)$$

Sample Complexity for Finite Hypothesis Classes

Theorem: For a finite hypothesis class $\mathcal{H}$ with $|\mathcal{H}| = M$, with probability at least $1 - \delta$, simultaneously for all $h \in \mathcal{H}$: $$|\hat{R}(h) - R(h)| \leq \sqrt{\frac{\ln(2M/\delta)}{2m}}$$

Equivalently, $m = O\left(\frac{1}{\epsilon^2}\ln\frac{M}{\delta}\right)$ samples suffice for $\epsilon$-uniform convergence.

Interpreting the Sample Complexity

The sample complexity result $m = O\left(\frac{1}{\epsilon^2}\ln\frac{M}{\delta}\right)$ encapsulates fundamental insights about learning. Let's dissect each component.

The $1/\epsilon^2$ Dependence:

The required sample size scales inversely with the square of the accuracy $\epsilon$. This is not an artifact of our proof technique—it's fundamental:

Reducing error from 10% to 5% requires ~4× more samples
Reducing from 5% to 1% requires ~25× more samples
The 'last mile' of accuracy is exponentially expensive

This quadratic dependence arises from the central limit theorem: the standard error of an empirical mean decreases as $1/\sqrt{m}$, so achieving $\epsilon$ accuracy requires $m \propto 1/\epsilon^2$.

The $\ln(M)$ Dependence:

The sample complexity grows only logarithmically with the hypothesis class size! This is remarkable:

Doubling the hypothesis class requires only $\ln(2) \approx 0.69$ additional 'units' of samples
A class with $2^{100}$ hypotheses (astronomical!) requires only ~$69$ more units than a class with 1 hypothesis

This logarithmic dependence is the union bound's gift to learning theory: we can consider exponentially large hypothesis classes with only linear growth in sample requirements.

Sample Complexity Scaling (for $\epsilon = 0.1$, $\delta = 0.05$)
Hypothesis Class Size $M$	$\ln(2M/\delta)$	Required Samples $m$
10	5.99	300
100	8.29	415
1,000	10.60	530
10,000	12.90	645
$10^6$	17.52	876
$10^{12}$	28.35	1,418
$10^{100}$	232.63	11,632

The $\ln(1/\delta)$ Dependence:

The dependence on confidence level $\delta$ is also logarithmic:

Going from 95% confidence ($\delta = 0.05$) to 99% confidence ($\delta = 0.01$) adds $\ln(5) \approx 1.6$ units
From 99% to 99.9% adds only $\ln(10) \approx 2.3$ units

High confidence is 'cheap'—we can increase confidence dramatically with modest sample increases.

Trade-offs and Practical Implications:

Key Trade-offs in Sample Complexity

•Accuracy vs. Samples: Quadratic relationship means diminishing returns for tighter accuracy. Consider whether 0.1% vs 0.5% accuracy matters for your application.
•Expressiveness vs. Samples: Larger hypothesis classes are more expressive but require more data. Model selection involves balancing complexity against available data.
•Confidence vs. Samples: Logarithmic relationship means high confidence is achievable. In critical applications, push $\delta$ to $10^{-6}$ or lower at modest cost.
•The Realizability Question: This bound assumes we can bound $|\hat{R}(h) - R(h)|$ for each $h$. If $\mathcal{H}$ contains the true function, ERM will find it. If not, the bound applies to the best available approximation.

Limitations of the Union Bound

While the union bound provides our first generalization guarantee, it has significant limitations that motivate the sophisticated machinery of modern statistical learning theory.

Limitation 1: Requires Finite Hypothesis Classes

The bound $2Me^{-2m\epsilon^2}$ is only meaningful when $M$ is finite. But many important hypothesis classes are infinite:

Linear classifiers in $\mathbb{R}^d$: uncountably infinite
Neural networks (viewing weights as continuous): uncountably infinite
Polynomial classifiers: uncountably infinite

For infinite $M$, the union bound yields the vacuous bound $\infty$. We need:

VC dimension: Replaces $\ln M$ with a finite complexity measure for infinite classes
Covering numbers: Approximate infinite classes with finite covers
Rademacher complexity: Data-dependent complexity measures

The Infinite Class Challenge

The union bound over an infinite class is useless: $\sum_{i=1}^{\infty} p_i$ diverges unless $p_i \to 0$ fast enough. Even for very small individual probabilities, an infinite sum can diverge. VC theory addresses this by showing that infinite classes often 'behave like' finite classes of bounded effective size.

Limitation 2: Ignores Hypothesis Similarity

The union bound treats all hypotheses as independent entities. But in reality, similar hypotheses have correlated errors:

Two linear classifiers with nearly identical weights will misclassify nearly the same points
If $h_1$ and $h_2$ differ on only 1% of the domain, their empirical risks are highly correlated

The union bound doesn't 'notice' this correlation and charges the full $\mathbb{P}(A_1) + \mathbb{P}(A_2)$ even when $A_1 \approx A_2$.

Approaches that exploit structure:

Covering numbers: Replace the class with a finite $\epsilon$-cover
Rademacher complexity: Measures effective dimensionality in the data
Chaining arguments: Build a hierarchy of covers at different scales

Limitation 3: Worst-Case Over Distributions

The bound holds uniformly over all possible data distributions $\mathcal{D}$. But for specific 'nice' distributions (e.g., low-noise, large margins), tighter bounds are possible:

Margin-based bounds: Exploit classification margins for better bounds
Algorithm-dependent bounds: Specific algorithms (SVMs, boosting) have tighter guarantees
PAC-Bayes: Averages over hypotheses weighted by a prior, exploiting algorithm-specific structure

Limitation 4: Multiplicative Overhead

The union bound introduces a multiplicative factor of $M$ (or additive $\ln M$ in the exponent). For very large but finite classes, this can be substantial:

$M = 2^{100}$ introduces a factor of $100 \ln 2 \approx 69$ in the exponent
This translates to requiring $\sim 69/(2\epsilon^2)$ more samples than a single hypothesis

While logarithmic growth is remarkable, it can still be the dominant term for highly expressive classes.

The Path Forward:

These limitations set the agenda for the remainder of this module:

Hoeffding's Inequality (next page): Tighter individual bounds with exponential concentration
Rademacher Complexity: Data-dependent, structure-aware complexity
Margin-Based Bounds: Exploiting geometric properties of classifiers
PAC-Bayes Bounds: Averaging over hypotheses for tighter guarantees

Worked Example: Decision Stumps

Let's apply the union bound to a concrete hypothesis class: decision stumps. This example illustrates the full machinery in action.

Setup:

Consider binary classification over $\mathcal{X} = [0, 1]$ (unit interval). A decision stump is a classifier of the form:

$$h_{\theta,s}(x) = \begin{cases} +1 & \text{if } s \cdot x \geq s \cdot \theta \ -1 & \text{otherwise} \end{cases}$$

where $\theta \in [0,1]$ is a threshold and $s \in {+1, -1}$ is the direction.

Problem: The continuous threshold makes the class infinite. How do we apply the union bound?

Solution: Discretization

For a training set of $m$ points $x_1 < x_2 < \ldots < x_m$, the empirical risk of a stump only changes at the data points. Thus, we need only consider:

$m + 1$ threshold positions (between consecutive points, plus endpoints)
2 directions

This gives $M = 2(m + 1) \leq 4m$ distinct empirical behaviors.

decision_stump_bound.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
 
def decision_stump_sample_complexity(epsilon: float, delta: float, n_thresholds: int) -> int:
    """
    Compute sample complexity for decision stumps using union bound.
    
    For decision stumps on 1D data with n_thresholds discretization points,
    the effective hypothesis class size is M = 2 * (n_thresholds + 1).
    
    Args:
        epsilon: Desired accuracy (|empirical - true risk| <= epsilon)
        delta: Failure probability
        n_thresholds: Number of threshold positions considered
    
    Returns:
        Required number of samples m
    
    Note: This is a simplified bound. In practice, the threshold positions
    depend on the data, introducing subtle dependencies.
    """
    M = 2 * (n_thresholds + 1)  # 2 directions × (n+1) positions
    
    # From: 2 * M * exp(-2 * m * epsilon^2) <= delta
    # Solving: m >= ln(2M / delta) / (2 * epsilon^2)
    m = np.log(2 * M / delta) / (2 * epsilon**2)
    
    return int(np.ceil(m))
 
# Example calculations
print("Decision Stump Sample Complexity (ε=0.1, δ=0.05)")
print("-" * 50)
 
for n_thresh in [10, 100, 1000, 10000]:
    m = decision_stump_sample_complexity(0.1, 0.05, n_thresh)
    print(f"n_thresholds = {n_thresh:,}: m = {m:,} samples")
 
print()
print("Note: As we refine the discretization, sample complexity")
print("grows only logarithmically in the number of thresholds!")

Detailed Calculation:

Let's work through the mathematics for $m = 1000$ samples, $\epsilon = 0.05$, $\delta = 0.01$.

Effective Hypothesis Class Size: $$M = 2(m + 1) = 2(1001) = 2002$$
Required Sample Complexity: $$m \geq \frac{\ln(2M/\delta)}{2\epsilon^2} = \frac{\ln(2 \cdot 2002 / 0.01)}{2 \cdot 0.05^2} = \frac{\ln(400400)}{0.005} = \frac{12.90}{0.005} = 2580$$
Interpretation: Wait—we assumed $m = 1000$ but the bound requires $m \geq 2580$! This reveals the circular nature of the problem: $M$ depends on $m$, but we need $m$ to bound $M$.
Resolving the Circularity: Since $M = 2(m+1) \leq 3m$ for $m \geq 2$, we need: $$m \geq \frac{\ln(6m/\delta)}{2\epsilon^2}$$

Solving iteratively or using Lambert W function, we find the fixed point. For $\epsilon = 0.05$, $\delta = 0.01$: $$m \approx 3200$$

The Discretization-Dependent Bound

This example illustrates a subtle point: for decision stumps, the 'effective' hypothesis class size grows with the sample size. The VC dimension framework (covered in Module 3) elegantly handles this by showing decision stumps have VC dimension 2, giving sample complexity $O(1/\epsilon^2 \cdot \log(1/\delta))$ independent of $m$.

Summary and Key Takeaways

We have established the union bound as the foundational tool of statistical learning theory. Let us consolidate the key insights from this page.

Key Takeaways

•The Union Bound is Simple but Powerful: $\mathbb{P}(\cup_i A_i) \leq \sum_i \mathbb{P}(A_i)$. This elementary inequality is the bridge from individual hypothesis guarantees to uniform guarantees over entire classes.
•The 'Bad Event' Paradigm: In learning theory, we identify 'bad events' (large generalization error) for each hypothesis, then bound the probability that any bad event occurs. The union bound makes this possible.
•Logarithmic Dependence on Class Size: Sample complexity scales as $\ln(M/\delta)$, enabling learning with exponentially large hypothesis classes. This is the union bound's remarkable gift.
•The $1/\epsilon^2$ Barrier: The quadratic dependence on accuracy is fundamental, arising from concentration of measure. Improving accuracy is expensive.
•Limitations Set the Agenda: The union bound fails for (1) infinite hypothesis classes, (2) exploiting hypothesis similarity, (3) algorithm-specific structure. These limitations motivate VC theory, Rademacher complexity, and PAC-Bayes.
•The Union Bound is Conservative: It always overestimates probability (treating events as disjoint). This conservatism ensures validity but may yield loose bounds.

What's Next:

The union bound tells us how to combine individual probability bounds, but we need tools to derive those individual bounds. The next page introduces Hoeffding's inequality—the workhorse concentration inequality that bounds how empirical averages deviate from their expectations. Together, the union bound + Hoeffding's inequality form the complete foundation for finite-class PAC learning.

From there, we'll develop the sophisticated machinery needed for infinite hypothesis classes: Rademacher complexity for data-dependent bounds, margin-based analysis for geometric structure, and PAC-Bayes for algorithm-specific guarantees.

Page Complete

You now understand the union bound—its statement, proof, geometric intuition, and application to learning theory. You've seen how it enables the first generalization guarantees for finite hypothesis classes, and you understand its limitations that motivate more sophisticated techniques. Next, we develop Hoeffding's inequality to complete the foundation.

1 / 5

Loading learning content...

Machine LearningGeneralization Bounds

Generalization Bounds: Mathematical Guarantees for Learning

LevelAdvanced

Duration90 mins

TopicGeneralization Bounds

1 / 5

The Union Bound: Foundation of Generalization Theory

From Finite Data to Universal Guarantees

What You Will Master

The Union Bound: Precise Statement

Let us begin with the precise mathematical statement of the union bound, building from the foundational axioms of probability theory.

$$\mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} \mathbb{P}(A_i)$$

For countably infinite collections ${A_i}_{i=1}^{\infty}$:

$$\mathbb{P}\left(\bigcup_{i=1}^{\infty} A_i\right) \leq \sum_{i=1}^{\infty} \mathbb{P}(A_i)$$

Why 'At Most' and Not 'Equal'

Equivalent Formulations:

The union bound admits several equivalent formulations that prove useful in different contexts:

1. Complementary Form: For events $A_1, \ldots, A_n$, define the complement events $B_i = A_i^c$. Then:

$$\mathbb{P}\left(\bigcap_{i=1}^{n} B_i\right) \geq 1 - \sum_{i=1}^{n} \mathbb{P}(A_i)$$

2. Failure Probability Form: If $\mathbb{P}(A_i) \leq \delta_i$ for each $i$, then:

$$\mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} \delta_i$$

Setting $\delta_i = \delta/n$ for all $i$ ensures the total failure probability is at most $\delta$. This 'budget allocation' is fundamental to uniform convergence arguments.

Union Bound Formulations and Their Learning Theory Interpretations
Formulation	Mathematical Form	Learning Theory Meaning
Standard	$\mathbb{P}(\cup A_i) \leq \sum \mathbb{P}(A_i)$	Probability of any bad event is bounded by sum
Complementary	$\mathbb{P}(\cap B_i) \geq 1 - \sum \mathbb{P}(A_i)$	Probability all good events occur simultaneously
Budget Allocation	$\delta_i = \delta/n \Rightarrow \mathbb{P}(\cup A_i) \leq \delta$	Divide failure budget equally among events
Contrapositive	If $\sum \mathbb{P}(A_i) < 1$, then $\mathbb{P}(\cap B_i) > 0$	Existence proofs: good configurations exist

Proof from First Principles

The union bound follows directly from the axioms of probability. We present two proofs: one by induction for finite collections, and one using indicator functions for deeper insight.

Proof 1: By Mathematical Induction

Base Case (n = 1): $\mathbb{P}(A_1) \leq \mathbb{P}(A_1)$ trivially.

Base Case (n = 2): Using the inclusion-exclusion principle: $$\mathbb{P}(A_1 \cup A_2) = \mathbb{P}(A_1) + \mathbb{P}(A_2) - \mathbb{P}(A_1 \cap A_2)$$

Since $\mathbb{P}(A_1 \cap A_2) \geq 0$, we have: $$\mathbb{P}(A_1 \cup A_2) \leq \mathbb{P}(A_1) + \mathbb{P}(A_2)$$

Inductive Step: Assume the bound holds for $n$ events. For $n+1$ events: $$\mathbb{P}\left(\bigcup_{i=1}^{n+1} A_i\right) = \mathbb{P}\left(\left(\bigcup_{i=1}^{n} A_i\right) \cup A_{n+1}\right)$$

Applying the $n=2$ case: $$\leq \mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) + \mathbb{P}(A_{n+1})$$

Applying the inductive hypothesis: $$\leq \sum_{i=1}^{n} \mathbb{P}(A_i) + \mathbb{P}(A_{n+1}) = \sum_{i=1}^{n+1} \mathbb{P}(A_i)$$

$\square$

The Power of Induction in Learning Theory

Proof 2: Via Indicator Functions

This proof provides deeper insight into why the bound works and when it is tight.

For any event $A$, define the indicator random variable $\mathbf{1}_A$ as: $$\mathbf{1}_A(\omega) = \begin{cases} 1 & \text{if } \omega \in A \ 0 & \text{if } \omega \notin A \end{cases}$$

Note that $\mathbb{P}(A) = \mathbb{E}[\mathbf{1}_A]$.

Key Observation: For any collection of events: $$\mathbf{1}{\cup_i A_i} \leq \sum_i \mathbf{1}{A_i}$$

$\square$

Tightness Analysis:

The indicator function proof reveals when the union bound is tight. The bound is tight (equality holds) when: $$\mathbf{1}{\cup_i A_i} = \sum_i \mathbf{1}{A_i}$$

This occurs if and only if at most one event can occur at any point $\omega \in \Omega$—i.e., the events are mutually exclusive.

When is the Bound Loose?

The bound is loose when events have significant overlap. Consider $n$ events, each with probability $1/2$, but all identical ($A_i = A$ for all $i$). Then:

True probability: $\mathbb{P}(\cup_i A_i) = \mathbb{P}(A) = 1/2$
Union bound: $\sum_i \mathbb{P}(A_i) = n/2$

Geometric and Probabilistic Intuition

To truly internalize the union bound, we develop both geometric and probabilistic intuitions that illuminate its behavior and limitations.

Geometric Interpretation:

Consider the sample space $\Omega$ as a unit square (for Lebesgue measure). Each event $A_i$ corresponds to a region within this square. The probability $\mathbb{P}(A_i)$ is the area of region $A_i$.

The union $\cup_i A_i$ is the region covered by at least one event
The sum $\sum_i \mathbb{P}(A_i)$ counts areas, with overlaps counted multiple times
The bound $\mathbb{P}(\cup_i A_i) \leq \sum_i \mathbb{P}(A_i)$ says the true area is at most the naive sum

The Venn Diagram Perspective:

For two events: $$\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B)$$

The union bound ignores all correction terms (which alternate in sign), yielding a simpler but coarser bound.

Why Accept Overestimation?

Probabilistic Intuition: The 'Any Bad Event' Perspective

Imagine you're designing a system with $n$ components. Each component has an independent failure probability $p$. What's the probability that at least one component fails?

Exact Answer (for independent events): $$\mathbb{P}(\text{at least one failure}) = 1 - (1-p)^n$$

Union Bound Approximation: $$\mathbb{P}(\text{at least one failure}) \leq np$$

Comparison:

For small $p$ and moderate $n$, $(1-p)^n \approx e^{-np} \approx 1 - np$ (first-order Taylor expansion), so: $$1 - (1-p)^n \approx np$$

Union Bound Accuracy for Independent Events with $p = 0.01$
Number of Events $n$	Exact $1 - (1-p)^n$	Union Bound $np$	Relative Error
10	0.0956	0.10	4.6%
50	0.395	0.50	26.6%
100	0.634	1.00 (invalid)	57.8%
200	0.866	2.00 (invalid)	131%

This limitation motivates the development of:

Hoeffding's inequality: Tighter bounds on individual deviations
Rademacher complexity: Structure-aware complexity measures
Covering numbers: Reducing effective hypothesis class size
PAC-Bayes: Averaging over hypotheses rather than taking unions

Application to Statistical Learning Theory

We now apply the union bound to derive the foundational generalization guarantee for finite hypothesis classes—the starting point for all of PAC learning.

The Learning Setup:

For each hypothesis $h \in \mathcal{H}$, define:

True Risk (Generalization Error): $R(h) = \mathbb{E}_{(x,y) \sim \mathcal{D}}[\ell(h(x), y)]$
Empirical Risk (Training Error): $\hat{R}(h) = \frac{1}{m} \sum_{i=1}^{m} \ell(h(x_i), y_i)$

For binary classification with 0-1 loss:

$R(h) = \mathbb{P}_{(x,y) \sim \mathcal{D}}[h(x) \neq y]$
$\hat{R}(h) = \frac{1}{m} \sum_{i=1}^{m} \mathbf{1}[h(x_i) \neq y_i]$

The Fundamental Question:

We want to bound the probability that the empirical risk $\hat{R}(h)$ deviates significantly from the true risk $R(h)$ for any hypothesis $h \in \mathcal{H}$.

Step 1: Define Bad Events

For each hypothesis $h_i$ and tolerance $\epsilon > 0$, define the 'bad event': $$A_i = {|\hat{R}(h_i) - R(h_i)| > \epsilon}$$

This is the event that hypothesis $h_i$ has empirical risk that deviates from its true risk by more than $\epsilon$.

Step 2: Apply the Union Bound

The event 'some hypothesis has large deviation' is: $$\bigcup_{i=1}^{M} A_i = {\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon}$$

By the union bound: $$\mathbb{P}\left(\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon\right) \leq \sum_{i=1}^{M} \mathbb{P}(|\hat{R}(h_i) - R(h_i)| > \epsilon)$$

The Union Bound's Role

Step 3: Bound Individual Terms

For now, assume we have a bound on individual deviations (we'll derive this rigorously with Hoeffding's inequality): $$\mathbb{P}(|\hat{R}(h_i) - R(h_i)| > \epsilon) \leq 2e^{-2m\epsilon^2}$$

This is Hoeffding's bound for the average of $m$ i.i.d. bounded random variables.

Step 4: Combine via Union Bound

$$\mathbb{P}\left(\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon\right) \leq M \cdot 2e^{-2m\epsilon^2} = 2Me^{-2m\epsilon^2}$$

Step 5: Solve for Sample Complexity

We want this probability to be at most $\delta$. Setting $2Me^{-2m\epsilon^2} \leq \delta$ and solving for $m$:

$$e^{-2m\epsilon^2} \leq \frac{\delta}{2M}$$ $$-2m\epsilon^2 \leq \ln\left(\frac{\delta}{2M}\right)$$ $$m \geq \frac{1}{2\epsilon^2} \ln\left(\frac{2M}{\delta}\right)$$

Sample Complexity for Finite Hypothesis Classes

Equivalently, $m = O\left(\frac{1}{\epsilon^2}\ln\frac{M}{\delta}\right)$ samples suffice for $\epsilon$-uniform convergence.

Interpreting the Sample Complexity

The sample complexity result $m = O\left(\frac{1}{\epsilon^2}\ln\frac{M}{\delta}\right)$ encapsulates fundamental insights about learning. Let's dissect each component.

The $1/\epsilon^2$ Dependence:

The required sample size scales inversely with the square of the accuracy $\epsilon$. This is not an artifact of our proof technique—it's fundamental:

Reducing error from 10% to 5% requires ~4× more samples
Reducing from 5% to 1% requires ~25× more samples
The 'last mile' of accuracy is exponentially expensive

This quadratic dependence arises from the central limit theorem: the standard error of an empirical mean decreases as $1/\sqrt{m}$, so achieving $\epsilon$ accuracy requires $m \propto 1/\epsilon^2$.

The $\ln(M)$ Dependence:

The sample complexity grows only logarithmically with the hypothesis class size! This is remarkable:

Doubling the hypothesis class requires only $\ln(2) \approx 0.69$ additional 'units' of samples
A class with $2^{100}$ hypotheses (astronomical!) requires only ~$69$ more units than a class with 1 hypothesis

This logarithmic dependence is the union bound's gift to learning theory: we can consider exponentially large hypothesis classes with only linear growth in sample requirements.

Sample Complexity Scaling (for $\epsilon = 0.1$, $\delta = 0.05$)
Hypothesis Class Size $M$	$\ln(2M/\delta)$	Required Samples $m$
10	5.99	300
100	8.29	415
1,000	10.60	530
10,000	12.90	645
$10^6$	17.52	876
$10^{12}$	28.35	1,418
$10^{100}$	232.63	11,632

The $\ln(1/\delta)$ Dependence:

The dependence on confidence level $\delta$ is also logarithmic:

Going from 95% confidence ($\delta = 0.05$) to 99% confidence ($\delta = 0.01$) adds $\ln(5) \approx 1.6$ units
From 99% to 99.9% adds only $\ln(10) \approx 2.3$ units

High confidence is 'cheap'—we can increase confidence dramatically with modest sample increases.

Trade-offs and Practical Implications:

Key Trade-offs in Sample Complexity

•Accuracy vs. Samples: Quadratic relationship means diminishing returns for tighter accuracy. Consider whether 0.1% vs 0.5% accuracy matters for your application.
•Expressiveness vs. Samples: Larger hypothesis classes are more expressive but require more data. Model selection involves balancing complexity against available data.
•Confidence vs. Samples: Logarithmic relationship means high confidence is achievable. In critical applications, push $\delta$ to $10^{-6}$ or lower at modest cost.
•The Realizability Question: This bound assumes we can bound $|\hat{R}(h) - R(h)|$ for each $h$. If $\mathcal{H}$ contains the true function, ERM will find it. If not, the bound applies to the best available approximation.

Limitations of the Union Bound

While the union bound provides our first generalization guarantee, it has significant limitations that motivate the sophisticated machinery of modern statistical learning theory.

Limitation 1: Requires Finite Hypothesis Classes

The bound $2Me^{-2m\epsilon^2}$ is only meaningful when $M$ is finite. But many important hypothesis classes are infinite:

Linear classifiers in $\mathbb{R}^d$: uncountably infinite
Neural networks (viewing weights as continuous): uncountably infinite
Polynomial classifiers: uncountably infinite

For infinite $M$, the union bound yields the vacuous bound $\infty$. We need:

VC dimension: Replaces $\ln M$ with a finite complexity measure for infinite classes
Covering numbers: Approximate infinite classes with finite covers
Rademacher complexity: Data-dependent complexity measures

The Infinite Class Challenge

Limitation 2: Ignores Hypothesis Similarity

The union bound treats all hypotheses as independent entities. But in reality, similar hypotheses have correlated errors:

Two linear classifiers with nearly identical weights will misclassify nearly the same points
If $h_1$ and $h_2$ differ on only 1% of the domain, their empirical risks are highly correlated

The union bound doesn't 'notice' this correlation and charges the full $\mathbb{P}(A_1) + \mathbb{P}(A_2)$ even when $A_1 \approx A_2$.

Approaches that exploit structure:

Covering numbers: Replace the class with a finite $\epsilon$-cover
Rademacher complexity: Measures effective dimensionality in the data
Chaining arguments: Build a hierarchy of covers at different scales

Limitation 3: Worst-Case Over Distributions

The bound holds uniformly over all possible data distributions $\mathcal{D}$. But for specific 'nice' distributions (e.g., low-noise, large margins), tighter bounds are possible:

Margin-based bounds: Exploit classification margins for better bounds
Algorithm-dependent bounds: Specific algorithms (SVMs, boosting) have tighter guarantees
PAC-Bayes: Averages over hypotheses weighted by a prior, exploiting algorithm-specific structure

Limitation 4: Multiplicative Overhead

The union bound introduces a multiplicative factor of $M$ (or additive $\ln M$ in the exponent). For very large but finite classes, this can be substantial:

$M = 2^{100}$ introduces a factor of $100 \ln 2 \approx 69$ in the exponent
This translates to requiring $\sim 69/(2\epsilon^2)$ more samples than a single hypothesis

While logarithmic growth is remarkable, it can still be the dominant term for highly expressive classes.

The Path Forward:

These limitations set the agenda for the remainder of this module:

Hoeffding's Inequality (next page): Tighter individual bounds with exponential concentration
Rademacher Complexity: Data-dependent, structure-aware complexity
Margin-Based Bounds: Exploiting geometric properties of classifiers
PAC-Bayes Bounds: Averaging over hypotheses for tighter guarantees

Worked Example: Decision Stumps

Let's apply the union bound to a concrete hypothesis class: decision stumps. This example illustrates the full machinery in action.

Setup:

Consider binary classification over $\mathcal{X} = [0, 1]$ (unit interval). A decision stump is a classifier of the form:

$$h_{\theta,s}(x) = \begin{cases} +1 & \text{if } s \cdot x \geq s \cdot \theta \ -1 & \text{otherwise} \end{cases}$$

where $\theta \in [0,1]$ is a threshold and $s \in {+1, -1}$ is the direction.

Problem: The continuous threshold makes the class infinite. How do we apply the union bound?

Solution: Discretization

For a training set of $m$ points $x_1 < x_2 < \ldots < x_m$, the empirical risk of a stump only changes at the data points. Thus, we need only consider:

$m + 1$ threshold positions (between consecutive points, plus endpoints)
2 directions

This gives $M = 2(m + 1) \leq 4m$ distinct empirical behaviors.

decision_stump_bound.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
 
def decision_stump_sample_complexity(epsilon: float, delta: float, n_thresholds: int) -> int:
    """
    Compute sample complexity for decision stumps using union bound.
    
    For decision stumps on 1D data with n_thresholds discretization points,
    the effective hypothesis class size is M = 2 * (n_thresholds + 1).
    
    Args:
        epsilon: Desired accuracy (|empirical - true risk| <= epsilon)
        delta: Failure probability
        n_thresholds: Number of threshold positions considered
    
    Returns:
        Required number of samples m
    
    Note: This is a simplified bound. In practice, the threshold positions
    depend on the data, introducing subtle dependencies.
    """
    M = 2 * (n_thresholds + 1)  # 2 directions × (n+1) positions
    
    # From: 2 * M * exp(-2 * m * epsilon^2) <= delta
    # Solving: m >= ln(2M / delta) / (2 * epsilon^2)
    m = np.log(2 * M / delta) / (2 * epsilon**2)
    
    return int(np.ceil(m))
 
# Example calculations
print("Decision Stump Sample Complexity (ε=0.1, δ=0.05)")
print("-" * 50)
 
for n_thresh in [10, 100, 1000, 10000]:
    m = decision_stump_sample_complexity(0.1, 0.05, n_thresh)
    print(f"n_thresholds = {n_thresh:,}: m = {m:,} samples")
 
print()
print("Note: As we refine the discretization, sample complexity")
print("grows only logarithmically in the number of thresholds!")

Detailed Calculation:

Let's work through the mathematics for $m = 1000$ samples, $\epsilon = 0.05$, $\delta = 0.01$.

Effective Hypothesis Class Size: $$M = 2(m + 1) = 2(1001) = 2002$$
Required Sample Complexity: $$m \geq \frac{\ln(2M/\delta)}{2\epsilon^2} = \frac{\ln(2 \cdot 2002 / 0.01)}{2 \cdot 0.05^2} = \frac{\ln(400400)}{0.005} = \frac{12.90}{0.005} = 2580$$
Interpretation: Wait—we assumed $m = 1000$ but the bound requires $m \geq 2580$! This reveals the circular nature of the problem: $M$ depends on $m$, but we need $m$ to bound $M$.
Resolving the Circularity: Since $M = 2(m+1) \leq 3m$ for $m \geq 2$, we need: $$m \geq \frac{\ln(6m/\delta)}{2\epsilon^2}$$

Solving iteratively or using Lambert W function, we find the fixed point. For $\epsilon = 0.05$, $\delta = 0.01$: $$m \approx 3200$$

The Discretization-Dependent Bound

Summary and Key Takeaways

We have established the union bound as the foundational tool of statistical learning theory. Let us consolidate the key insights from this page.

Key Takeaways

•The Union Bound is Simple but Powerful: $\mathbb{P}(\cup_i A_i) \leq \sum_i \mathbb{P}(A_i)$. This elementary inequality is the bridge from individual hypothesis guarantees to uniform guarantees over entire classes.
•The 'Bad Event' Paradigm: In learning theory, we identify 'bad events' (large generalization error) for each hypothesis, then bound the probability that any bad event occurs. The union bound makes this possible.
•Logarithmic Dependence on Class Size: Sample complexity scales as $\ln(M/\delta)$, enabling learning with exponentially large hypothesis classes. This is the union bound's remarkable gift.
•The $1/\epsilon^2$ Barrier: The quadratic dependence on accuracy is fundamental, arising from concentration of measure. Improving accuracy is expensive.
•Limitations Set the Agenda: The union bound fails for (1) infinite hypothesis classes, (2) exploiting hypothesis similarity, (3) algorithm-specific structure. These limitations motivate VC theory, Rademacher complexity, and PAC-Bayes.
•The Union Bound is Conservative: It always overestimates probability (treating events as disjoint). This conservatism ensures validity but may yield loose bounds.

What's Next:

Page Complete

1 / 5