Loading learning content...
At the heart of machine learning lies a profound challenge: how can we guarantee that a model learned from a finite sample of data will perform well on unseen examples? This question, seemingly philosophical, has precise mathematical answers—and those answers begin with one of the simplest yet most powerful tools in probability theory: the union bound.
The union bound, also known as Boole's inequality, may appear deceptively simple. It states that the probability of at least one of several events occurring is at most the sum of their individual probabilities. Yet this elementary observation is the cornerstone upon which the entire edifice of statistical learning theory is constructed.
In this page, we embark on a rigorous exploration of the union bound. We will derive it from first principles, understand its geometric and probabilistic intuitions, and see how it enables us to transform statements about individual hypotheses into statements about entire hypothesis classes. This transition—from reasoning about a single function to reasoning about families of functions—is precisely what makes learning theory possible.
By the end of this page, you will: (1) Understand the union bound and its proof from foundational probability axioms; (2) Grasp the probabilistic intuition behind 'bad event' analysis in learning theory; (3) See how the union bound enables generalization guarantees for finite hypothesis classes; (4) Appreciate why the union bound is both fundamental and insufficient, motivating more sophisticated tools; (5) Apply the union bound to derive your first sample complexity result.
Let us begin with the precise mathematical statement of the union bound, building from the foundational axioms of probability theory.
Definition (Union Bound / Boole's Inequality): Let $(\Omega, \mathcal{F}, \mathbb{P})$ be a probability space, and let $A_1, A_2, \ldots, A_n$ be a finite collection of events in $\mathcal{F}$. Then:
$$\mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} \mathbb{P}(A_i)$$
For countably infinite collections ${A_i}_{i=1}^{\infty}$:
$$\mathbb{P}\left(\bigcup_{i=1}^{\infty} A_i\right) \leq \sum_{i=1}^{\infty} \mathbb{P}(A_i)$$
This inequality holds with equality if and only if the events are mutually exclusive (i.e., $A_i \cap A_j = \emptyset$ for all $i \neq j$), in which case it reduces to the third Kolmogorov axiom of probability.
The inequality arises because events may overlap. When we sum individual probabilities, we count overlapping regions multiple times. The union bound essentially tells us that naive counting (ignoring overlaps) gives an upper bound. This 'pessimistic' counting is precisely what makes the bound useful—it's always safe, never optimistic.
Equivalent Formulations:
The union bound admits several equivalent formulations that prove useful in different contexts:
1. Complementary Form: For events $A_1, \ldots, A_n$, define the complement events $B_i = A_i^c$. Then:
$$\mathbb{P}\left(\bigcap_{i=1}^{n} B_i\right) \geq 1 - \sum_{i=1}^{n} \mathbb{P}(A_i)$$
This form is crucial in learning theory: if each $A_i$ represents a 'bad event' (e.g., hypothesis $h_i$ has large generalization error), then $\bigcap_i B_i$ represents 'all good events occur simultaneously.'
2. Failure Probability Form: If $\mathbb{P}(A_i) \leq \delta_i$ for each $i$, then:
$$\mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) \leq \sum_{i=1}^{n} \delta_i$$
Setting $\delta_i = \delta/n$ for all $i$ ensures the total failure probability is at most $\delta$. This 'budget allocation' is fundamental to uniform convergence arguments.
| Formulation | Mathematical Form | Learning Theory Meaning |
|---|---|---|
| Standard | $\mathbb{P}(\cup A_i) \leq \sum \mathbb{P}(A_i)$ | Probability of any bad event is bounded by sum |
| Complementary | $\mathbb{P}(\cap B_i) \geq 1 - \sum \mathbb{P}(A_i)$ | Probability all good events occur simultaneously |
| Budget Allocation | $\delta_i = \delta/n \Rightarrow \mathbb{P}(\cup A_i) \leq \delta$ | Divide failure budget equally among events |
| Contrapositive | If $\sum \mathbb{P}(A_i) < 1$, then $\mathbb{P}(\cap B_i) > 0$ | Existence proofs: good configurations exist |
The union bound follows directly from the axioms of probability. We present two proofs: one by induction for finite collections, and one using indicator functions for deeper insight.
Proof 1: By Mathematical Induction
Base Case (n = 1): $\mathbb{P}(A_1) \leq \mathbb{P}(A_1)$ trivially.
Base Case (n = 2): Using the inclusion-exclusion principle: $$\mathbb{P}(A_1 \cup A_2) = \mathbb{P}(A_1) + \mathbb{P}(A_2) - \mathbb{P}(A_1 \cap A_2)$$
Since $\mathbb{P}(A_1 \cap A_2) \geq 0$, we have: $$\mathbb{P}(A_1 \cup A_2) \leq \mathbb{P}(A_1) + \mathbb{P}(A_2)$$
Inductive Step: Assume the bound holds for $n$ events. For $n+1$ events: $$\mathbb{P}\left(\bigcup_{i=1}^{n+1} A_i\right) = \mathbb{P}\left(\left(\bigcup_{i=1}^{n} A_i\right) \cup A_{n+1}\right)$$
Applying the $n=2$ case: $$\leq \mathbb{P}\left(\bigcup_{i=1}^{n} A_i\right) + \mathbb{P}(A_{n+1})$$
Applying the inductive hypothesis: $$\leq \sum_{i=1}^{n} \mathbb{P}(A_i) + \mathbb{P}(A_{n+1}) = \sum_{i=1}^{n+1} \mathbb{P}(A_i)$$
$\square$
This induction is not merely a proof technique—it reflects how we extend guarantees. If we can bound the failure probability for each hypothesis individually, we can extend guarantees to the entire hypothesis class by paying the 'cost' of considering multiple hypotheses (the sum).
Proof 2: Via Indicator Functions
This proof provides deeper insight into why the bound works and when it is tight.
For any event $A$, define the indicator random variable $\mathbf{1}_A$ as: $$\mathbf{1}_A(\omega) = \begin{cases} 1 & \text{if } \omega \in A \ 0 & \text{if } \omega \notin A \end{cases}$$
Note that $\mathbb{P}(A) = \mathbb{E}[\mathbf{1}_A]$.
Key Observation: For any collection of events: $$\mathbf{1}{\cup_i A_i} \leq \sum_i \mathbf{1}{A_i}$$
This inequality holds pointwise: the left side is 1 if any $A_i$ occurs (and 0 otherwise), while the right side counts how many events occur. Since at least one occurring means the count is at least 1, the inequality holds.
Completing the Proof: $$\mathbb{P}\left(\bigcup_i A_i\right) = \mathbb{E}\left[\mathbf{1}{\cup_i A_i}\right] \leq \mathbb{E}\left[\sum_i \mathbf{1}{A_i}\right] = \sum_i \mathbb{E}[\mathbf{1}_{A_i}] = \sum_i \mathbb{P}(A_i)$$
$\square$
Tightness Analysis:
The indicator function proof reveals when the union bound is tight. The bound is tight (equality holds) when: $$\mathbf{1}{\cup_i A_i} = \sum_i \mathbf{1}{A_i}$$
This occurs if and only if at most one event can occur at any point $\omega \in \Omega$—i.e., the events are mutually exclusive.
When is the Bound Loose?
The bound is loose when events have significant overlap. Consider $n$ events, each with probability $1/2$, but all identical ($A_i = A$ for all $i$). Then:
The bound overestimates by a factor of $n$! This looseness when hypotheses are 'similar' motivates the development of more sophisticated bounds (Rademacher complexity, covering numbers) that account for the structure of hypothesis classes.
To truly internalize the union bound, we develop both geometric and probabilistic intuitions that illuminate its behavior and limitations.
Geometric Interpretation:
Consider the sample space $\Omega$ as a unit square (for Lebesgue measure). Each event $A_i$ corresponds to a region within this square. The probability $\mathbb{P}(A_i)$ is the area of region $A_i$.
The Venn Diagram Perspective:
For two events: $$\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) - \mathbb{P}(A \cap B)$$
The union bound drops the correction term $\mathbb{P}(A \cap B)$, accepting an overestimate. For many events, inclusion-exclusion becomes: $$\mathbb{P}\left(\bigcup_i A_i\right) = \sum_i \mathbb{P}(A_i) - \sum_{i<j} \mathbb{P}(A_i \cap A_j) + \ldots$$
The union bound ignores all correction terms (which alternate in sign), yielding a simpler but coarser bound.
In learning theory, we typically analyze 'bad events' (large generalization error). Overestimating the probability of bad events is conservative but safe—we never underestimate risk. This philosophical stance pervades statistical learning: we derive bounds that are always valid, even if sometimes loose.
Probabilistic Intuition: The 'Any Bad Event' Perspective
Imagine you're designing a system with $n$ components. Each component has an independent failure probability $p$. What's the probability that at least one component fails?
Exact Answer (for independent events): $$\mathbb{P}(\text{at least one failure}) = 1 - (1-p)^n$$
Union Bound Approximation: $$\mathbb{P}(\text{at least one failure}) \leq np$$
Comparison:
For small $p$ and moderate $n$, $(1-p)^n \approx e^{-np} \approx 1 - np$ (first-order Taylor expansion), so: $$1 - (1-p)^n \approx np$$
The union bound is nearly tight when events are unlikely and roughly independent! This is precisely the regime of learning theory: each 'bad event' (a specific hypothesis having large generalization error) has small probability, and we consider many such events.
| Number of Events $n$ | Exact $1 - (1-p)^n$ | Union Bound $np$ | Relative Error |
|---|---|---|---|
| 10 | 0.0956 | 0.10 | 4.6% |
| 50 | 0.395 | 0.50 | 26.6% |
| 100 | 0.634 | 1.00 (invalid) | 57.8% |
| 200 | 0.866 | 2.00 (invalid) | 131% |
The table reveals a critical insight: the union bound works well when $np \ll 1$ (few expected failures) but degrades as $np$ approaches or exceeds 1. In learning theory, this translates to: the union bound is effective for moderate hypothesis classes with sufficiently large samples, but degrades for very large (or infinite) hypothesis classes.
This limitation motivates the development of:
We now apply the union bound to derive the foundational generalization guarantee for finite hypothesis classes—the starting point for all of PAC learning.
The Learning Setup:
Let $\mathcal{H} = {h_1, h_2, \ldots, h_M}$ be a finite hypothesis class with $|\mathcal{H}| = M$ hypotheses. Data $S = {(x_1, y_1), \ldots, (x_m, y_m)}$ is drawn i.i.d. from an unknown distribution $\mathcal{D}$.
For each hypothesis $h \in \mathcal{H}$, define:
For binary classification with 0-1 loss:
The Fundamental Question:
We want to bound the probability that the empirical risk $\hat{R}(h)$ deviates significantly from the true risk $R(h)$ for any hypothesis $h \in \mathcal{H}$.
Step 1: Define Bad Events
For each hypothesis $h_i$ and tolerance $\epsilon > 0$, define the 'bad event': $$A_i = {|\hat{R}(h_i) - R(h_i)| > \epsilon}$$
This is the event that hypothesis $h_i$ has empirical risk that deviates from its true risk by more than $\epsilon$.
Step 2: Apply the Union Bound
The event 'some hypothesis has large deviation' is: $$\bigcup_{i=1}^{M} A_i = {\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon}$$
By the union bound: $$\mathbb{P}\left(\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon\right) \leq \sum_{i=1}^{M} \mathbb{P}(|\hat{R}(h_i) - R(h_i)| > \epsilon)$$
Notice the magic here: we've converted a statement about 'any hypothesis in the class' (which depends on the entire class structure) into a sum of individual statements. Each individual term can be bounded using concentration inequalities (like Hoeffding's, which we'll cover next page). The union bound is the bridge between individual and uniform guarantees.
Step 3: Bound Individual Terms
For now, assume we have a bound on individual deviations (we'll derive this rigorously with Hoeffding's inequality): $$\mathbb{P}(|\hat{R}(h_i) - R(h_i)| > \epsilon) \leq 2e^{-2m\epsilon^2}$$
This is Hoeffding's bound for the average of $m$ i.i.d. bounded random variables.
Step 4: Combine via Union Bound
$$\mathbb{P}\left(\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon\right) \leq M \cdot 2e^{-2m\epsilon^2} = 2Me^{-2m\epsilon^2}$$
Step 5: Solve for Sample Complexity
We want this probability to be at most $\delta$. Setting $2Me^{-2m\epsilon^2} \leq \delta$ and solving for $m$:
$$e^{-2m\epsilon^2} \leq \frac{\delta}{2M}$$ $$-2m\epsilon^2 \leq \ln\left(\frac{\delta}{2M}\right)$$ $$m \geq \frac{1}{2\epsilon^2} \ln\left(\frac{2M}{\delta}\right)$$
Theorem: For a finite hypothesis class $\mathcal{H}$ with $|\mathcal{H}| = M$, with probability at least $1 - \delta$, simultaneously for all $h \in \mathcal{H}$: $$|\hat{R}(h) - R(h)| \leq \sqrt{\frac{\ln(2M/\delta)}{2m}}$$
Equivalently, $m = O\left(\frac{1}{\epsilon^2}\ln\frac{M}{\delta}\right)$ samples suffice for $\epsilon$-uniform convergence.
The sample complexity result $m = O\left(\frac{1}{\epsilon^2}\ln\frac{M}{\delta}\right)$ encapsulates fundamental insights about learning. Let's dissect each component.
The $1/\epsilon^2$ Dependence:
The required sample size scales inversely with the square of the accuracy $\epsilon$. This is not an artifact of our proof technique—it's fundamental:
This quadratic dependence arises from the central limit theorem: the standard error of an empirical mean decreases as $1/\sqrt{m}$, so achieving $\epsilon$ accuracy requires $m \propto 1/\epsilon^2$.
The $\ln(M)$ Dependence:
The sample complexity grows only logarithmically with the hypothesis class size! This is remarkable:
This logarithmic dependence is the union bound's gift to learning theory: we can consider exponentially large hypothesis classes with only linear growth in sample requirements.
| Hypothesis Class Size $M$ | $\ln(2M/\delta)$ | Required Samples $m$ |
|---|---|---|
| 10 | 5.99 | 300 |
| 100 | 8.29 | 415 |
| 1,000 | 10.60 | 530 |
| 10,000 | 12.90 | 645 |
| $10^6$ | 17.52 | 876 |
| $10^{12}$ | 28.35 | 1,418 |
| $10^{100}$ | 232.63 | 11,632 |
The $\ln(1/\delta)$ Dependence:
The dependence on confidence level $\delta$ is also logarithmic:
High confidence is 'cheap'—we can increase confidence dramatically with modest sample increases.
Trade-offs and Practical Implications:
While the union bound provides our first generalization guarantee, it has significant limitations that motivate the sophisticated machinery of modern statistical learning theory.
Limitation 1: Requires Finite Hypothesis Classes
The bound $2Me^{-2m\epsilon^2}$ is only meaningful when $M$ is finite. But many important hypothesis classes are infinite:
For infinite $M$, the union bound yields the vacuous bound $\infty$. We need:
The union bound over an infinite class is useless: $\sum_{i=1}^{\infty} p_i$ diverges unless $p_i \to 0$ fast enough. Even for very small individual probabilities, an infinite sum can diverge. VC theory addresses this by showing that infinite classes often 'behave like' finite classes of bounded effective size.
Limitation 2: Ignores Hypothesis Similarity
The union bound treats all hypotheses as independent entities. But in reality, similar hypotheses have correlated errors:
The union bound doesn't 'notice' this correlation and charges the full $\mathbb{P}(A_1) + \mathbb{P}(A_2)$ even when $A_1 \approx A_2$.
Approaches that exploit structure:
Limitation 3: Worst-Case Over Distributions
The bound holds uniformly over all possible data distributions $\mathcal{D}$. But for specific 'nice' distributions (e.g., low-noise, large margins), tighter bounds are possible:
Limitation 4: Multiplicative Overhead
The union bound introduces a multiplicative factor of $M$ (or additive $\ln M$ in the exponent). For very large but finite classes, this can be substantial:
While logarithmic growth is remarkable, it can still be the dominant term for highly expressive classes.
The Path Forward:
These limitations set the agenda for the remainder of this module:
Let's apply the union bound to a concrete hypothesis class: decision stumps. This example illustrates the full machinery in action.
Setup:
Consider binary classification over $\mathcal{X} = [0, 1]$ (unit interval). A decision stump is a classifier of the form:
$$h_{\theta,s}(x) = \begin{cases} +1 & \text{if } s \cdot x \geq s \cdot \theta \ -1 & \text{otherwise} \end{cases}$$
where $\theta \in [0,1]$ is a threshold and $s \in {+1, -1}$ is the direction.
Problem: The continuous threshold makes the class infinite. How do we apply the union bound?
Solution: Discretization
For a training set of $m$ points $x_1 < x_2 < \ldots < x_m$, the empirical risk of a stump only changes at the data points. Thus, we need only consider:
This gives $M = 2(m + 1) \leq 4m$ distinct empirical behaviors.
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as np def decision_stump_sample_complexity(epsilon: float, delta: float, n_thresholds: int) -> int: """ Compute sample complexity for decision stumps using union bound. For decision stumps on 1D data with n_thresholds discretization points, the effective hypothesis class size is M = 2 * (n_thresholds + 1). Args: epsilon: Desired accuracy (|empirical - true risk| <= epsilon) delta: Failure probability n_thresholds: Number of threshold positions considered Returns: Required number of samples m Note: This is a simplified bound. In practice, the threshold positions depend on the data, introducing subtle dependencies. """ M = 2 * (n_thresholds + 1) # 2 directions × (n+1) positions # From: 2 * M * exp(-2 * m * epsilon^2) <= delta # Solving: m >= ln(2M / delta) / (2 * epsilon^2) m = np.log(2 * M / delta) / (2 * epsilon**2) return int(np.ceil(m)) # Example calculationsprint("Decision Stump Sample Complexity (ε=0.1, δ=0.05)")print("-" * 50) for n_thresh in [10, 100, 1000, 10000]: m = decision_stump_sample_complexity(0.1, 0.05, n_thresh) print(f"n_thresholds = {n_thresh:,}: m = {m:,} samples") print()print("Note: As we refine the discretization, sample complexity")print("grows only logarithmically in the number of thresholds!")Detailed Calculation:
Let's work through the mathematics for $m = 1000$ samples, $\epsilon = 0.05$, $\delta = 0.01$.
Effective Hypothesis Class Size: $$M = 2(m + 1) = 2(1001) = 2002$$
Required Sample Complexity: $$m \geq \frac{\ln(2M/\delta)}{2\epsilon^2} = \frac{\ln(2 \cdot 2002 / 0.01)}{2 \cdot 0.05^2} = \frac{\ln(400400)}{0.005} = \frac{12.90}{0.005} = 2580$$
Interpretation: Wait—we assumed $m = 1000$ but the bound requires $m \geq 2580$! This reveals the circular nature of the problem: $M$ depends on $m$, but we need $m$ to bound $M$.
Resolving the Circularity: Since $M = 2(m+1) \leq 3m$ for $m \geq 2$, we need: $$m \geq \frac{\ln(6m/\delta)}{2\epsilon^2}$$
Solving iteratively or using Lambert W function, we find the fixed point. For $\epsilon = 0.05$, $\delta = 0.01$: $$m \approx 3200$$
This example illustrates a subtle point: for decision stumps, the 'effective' hypothesis class size grows with the sample size. The VC dimension framework (covered in Module 3) elegantly handles this by showing decision stumps have VC dimension 2, giving sample complexity $O(1/\epsilon^2 \cdot \log(1/\delta))$ independent of $m$.
We have established the union bound as the foundational tool of statistical learning theory. Let us consolidate the key insights from this page.
What's Next:
The union bound tells us how to combine individual probability bounds, but we need tools to derive those individual bounds. The next page introduces Hoeffding's inequality—the workhorse concentration inequality that bounds how empirical averages deviate from their expectations. Together, the union bound + Hoeffding's inequality form the complete foundation for finite-class PAC learning.
From there, we'll develop the sophisticated machinery needed for infinite hypothesis classes: Rademacher complexity for data-dependent bounds, margin-based analysis for geometric structure, and PAC-Bayes for algorithm-specific guarantees.
You now understand the union bound—its statement, proof, geometric intuition, and application to learning theory. You've seen how it enables the first generalization guarantees for finite hypothesis classes, and you understand its limitations that motivate more sophisticated techniques. Next, we develop Hoeffding's inequality to complete the foundation.