Loading content...
In the previous page, we established the union bound as the mechanism for extending guarantees from individual hypotheses to entire hypothesis classes. But the union bound merely combines probability bounds—it doesn't create them. We need a tool that bounds how far an empirical average can deviate from its expectation.
Enter Hoeffding's inequality, one of the most important results in probability theory and the mathematical engine that powers PAC learning bounds. Named after the Finnish statistician Wassily Hoeffding, this inequality establishes that empirical averages of bounded random variables concentrate exponentially fast around their expectations.
The implications are profound: with only $m$ samples, the empirical risk of a hypothesis converges to its true risk at a rate of $1/\sqrt{m}$, with exponentially small probability of large deviations. This exponential decay is what makes learning from finite samples mathematically tractable.
In this page, we develop Hoeffding's inequality from first principles, building through Markov's inequality, Chernoff bounds, and the moment generating function technique. We then apply it to derive rigorous generalization bounds that complete the foundation of PAC learning.
By the end of this page, you will: (1) Derive Hoeffding's inequality from foundational probability tools; (2) Understand why bounded random variables exhibit exponential concentration; (3) Apply Hoeffding to bound the deviation between empirical and true risk; (4) Combine Hoeffding with the union bound for complete generalization guarantees; (5) Appreciate the assumptions and limitations of Hoeffding-based bounds.
Hoeffding's inequality belongs to a family of probability bounds that exploit exponential tail decay. To appreciate its power, we trace the development from the basic Markov inequality through progressively stronger bounds.
Markov's Inequality (The Foundation):
For any non-negative random variable $X$ and $a > 0$:
$$\mathbb{P}(X \geq a) \leq \frac{\mathbb{E}[X]}{a}$$
Proof: Let $\mathbf{1}{X \geq a}$ be the indicator of ${X \geq a}$. Since $X \geq 0$ and $X \geq a$ implies $X \geq a \cdot \mathbf{1}{X \geq a}$: $$\mathbb{E}[X] \geq \mathbb{E}[a \cdot \mathbf{1}_{X \geq a}] = a \cdot \mathbb{P}(X \geq a)$$
$\square$
Markov's inequality is weak (linear decay) but applies to any non-negative random variable with finite expectation. Its power comes from what we choose to plug in for $X$.
Chebyshev's Inequality (Using Variance):
Applying Markov to $(X - \mu)^2$ where $\mu = \mathbb{E}[X]$:
$$\mathbb{P}(|X - \mu| \geq a) = \mathbb{P}((X - \mu)^2 \geq a^2) \leq \frac{\mathbb{E}[(X-\mu)^2]}{a^2} = \frac{\text{Var}(X)}{a^2}$$
Chebyshev provides $1/a^2$ (quadratic) decay, an improvement over Markov's linear decay. But for sums of many random variables, we can do exponentially better.
The Chernoff Bound Technique (Exponential Decay):
The key insight: instead of applying Markov to $X$ directly, apply it to $e^{\lambda X}$ for suitable $\lambda > 0$.
For any $\lambda > 0$: $$\mathbb{P}(X \geq a) = \mathbb{P}(e^{\lambda X} \geq e^{\lambda a}) \leq \frac{\mathbb{E}[e^{\lambda X}]}{e^{\lambda a}}$$
The quantity $M_X(\lambda) = \mathbb{E}[e^{\lambda X}]$ is the moment generating function (MGF) of $X$. The Chernoff technique minimizes the bound over $\lambda > 0$:
$$\mathbb{P}(X \geq a) \leq \inf_{\lambda > 0} e^{-\lambda a} M_X(\lambda)$$
The exponential transformation is the key to exponential concentration. Intuitively: while $\mathbb{E}[X]$ captures the 'typical' behavior, $\mathbb{E}[e^{\lambda X}]$ heavily weights the tail. For well-behaved random variables, this weighted expectation is controlled, yielding exponentially decaying tail bounds.
| Inequality | Decay Rate | Requirements | Use Case |
|---|---|---|---|
| Markov | $O(1/a)$ | Non-negative, finite mean | Weak but universal |
| Chebyshev | $O(1/a^2)$ | Finite variance | Known variance |
| Chernoff/Hoeffding | $e^{-\Omega(a^2)}$ | Bounded or sub-Gaussian | Sums of independent RVs |
| Bennett | $e^{-\Omega(a^2/v)}$ | Bounded variance $v$ | Variance-aware bounds |
To apply the Chernoff technique to bounded random variables, we need to control the moment generating function. Hoeffding's lemma provides this control.
Hoeffding's Lemma: Let $X$ be a random variable with $\mathbb{E}[X] = 0$ and $a \leq X \leq b$ almost surely. Then for all $\lambda \in \mathbb{R}$:
$$\mathbb{E}[e^{\lambda X}] \leq \exp\left(\frac{\lambda^2 (b-a)^2}{8}\right)$$
This bound is remarkable: the MGF of any zero-mean bounded random variable is controlled solely by the range $b - a$, regardless of the distribution's shape within that range.
Proof of Hoeffding's Lemma:
We present a rigorous proof exploiting the convexity of the exponential function.
Step 1: By convexity of $e^{\lambda x}$, for any $x \in [a, b]$: $$e^{\lambda x} \leq \frac{b - x}{b - a} e^{\lambda a} + \frac{x - a}{b - a} e^{\lambda b}$$
This expresses $e^{\lambda x}$ as a linear interpolation between the endpoint values.
Step 2: Taking expectations (using $\mathbb{E}[X] = 0$): $$\mathbb{E}[e^{\lambda X}] \leq \frac{b - \mathbb{E}[X]}{b - a} e^{\lambda a} + \frac{\mathbb{E}[X] - a}{b - a} e^{\lambda b} = \frac{b}{b-a} e^{\lambda a} + \frac{-a}{b-a} e^{\lambda b}$$
Step 3: Let $p = -a/(b-a)$, so $1-p = b/(b-a)$. Then: $$\mathbb{E}[e^{\lambda X}] \leq (1-p) e^{\lambda a} + p \cdot e^{\lambda b}$$
Step 4: Define $h = \lambda(b-a)$ and $L(h) = -p \cdot h + \ln((1-p) + p \cdot e^h)$. Then: $$\mathbb{E}[e^{\lambda X}] \leq e^{L(h)}$$
Step 5: We bound $L(h) \leq h^2/8$ using Taylor expansion.
$L(0) = 0$, $L'(0) = -p + \frac{p}{1} = 0$, and: $$L''(h) = \frac{p(1-p)e^h}{((1-p) + pe^h)^2} \leq \frac{1}{4}$$
The last inequality uses $ab \leq (a+b)^2/4$ for $a, b \geq 0$.
By Taylor's theorem with Lagrange remainder: $$L(h) = L(0) + L'(0) \cdot h + \frac{L''(\xi)}{2} h^2 \leq \frac{h^2}{8}$$
Substituting $h = \lambda(b-a)$: $$\mathbb{E}[e^{\lambda X}] \leq \exp\left(\frac{\lambda^2(b-a)^2}{8}\right)$$
$\square$
The constant 8 in the denominator is tight: it's achieved when $X$ is a symmetric Bernoulli random variable taking values ${a, b}$ with equal probability. Other distributions give a smaller constant, but Hoeffding's lemma provides a universal bound valid for all bounded distributions.
We now state and prove Hoeffding's inequality in its full generality, then specialize to the forms most useful for machine learning.
Theorem (Hoeffding's Inequality): Let $X_1, \ldots, X_m$ be independent random variables with $a_i \leq X_i \leq b_i$ almost surely. Let $S_m = \sum_{i=1}^m X_i$ and $\mu = \mathbb{E}[S_m] = \sum_{i=1}^m \mathbb{E}[X_i]$. Then for any $t > 0$:
$$\mathbb{P}(S_m - \mu \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^m (b_i - a_i)^2}\right)$$
$$\mathbb{P}(\mu - S_m \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^m (b_i - a_i)^2}\right)$$
Combining these (union bound): $$\mathbb{P}(|S_m - \mu| \geq t) \leq 2\exp\left(-\frac{2t^2}{\sum_{i=1}^m (b_i - a_i)^2}\right)$$
Proof:
We prove the upper tail bound; the lower tail follows by symmetry (applying the same argument to $-X_i$).
Step 1 (Chernoff Technique): For any $\lambda > 0$: $$\mathbb{P}(S_m - \mu \geq t) = \mathbb{P}(e^{\lambda(S_m - \mu)} \geq e^{\lambda t}) \leq e^{-\lambda t} \mathbb{E}[e^{\lambda(S_m - \mu)}]$$
Step 2 (Independence): Since the $X_i$ are independent: $$\mathbb{E}[e^{\lambda(S_m - \mu)}] = \mathbb{E}\left[\prod_{i=1}^m e^{\lambda(X_i - \mathbb{E}[X_i])}\right] = \prod_{i=1}^m \mathbb{E}[e^{\lambda(X_i - \mathbb{E}[X_i])}]$$
Step 3 (Apply Hoeffding's Lemma): Let $Y_i = X_i - \mathbb{E}[X_i]$. Then $\mathbb{E}[Y_i] = 0$ and $Y_i \in [a_i - \mathbb{E}[X_i], b_i - \mathbb{E}[X_i]]$, which has length $b_i - a_i$.
By Hoeffding's lemma: $$\mathbb{E}[e^{\lambda Y_i}] \leq \exp\left(\frac{\lambda^2(b_i - a_i)^2}{8}\right)$$
Step 4 (Combine): $$\mathbb{P}(S_m - \mu \geq t) \leq e^{-\lambda t} \prod_{i=1}^m \exp\left(\frac{\lambda^2(b_i - a_i)^2}{8}\right) = \exp\left(-\lambda t + \frac{\lambda^2}{8} \sum_{i=1}^m (b_i - a_i)^2\right)$$
Step 5 (Optimize over λ): Let $c = \sum_{i=1}^m (b_i - a_i)^2$. Minimize $f(\lambda) = -\lambda t + \lambda^2 c/8$: $$f'(\lambda) = -t + \frac{\lambda c}{4} = 0 \implies \lambda^* = \frac{4t}{c}$$
Substituting: $$f(\lambda^*) = -\frac{4t^2}{c} + \frac{16t^2}{8c} = -\frac{4t^2}{c} + \frac{2t^2}{c} = -\frac{2t^2}{c}$$
Thus: $$\mathbb{P}(S_m - \mu \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^m (b_i - a_i)^2}\right)$$
$\square$
Corollary: For i.i.d. random variables $X_1, \ldots, X_m$ with $X_i \in [a, b]$ and mean $\mu = \mathbb{E}[X_i]$: $$\mathbb{P}\left(\left|\frac{1}{m}\sum_{i=1}^m X_i - \mu\right| \geq \epsilon\right) \leq 2\exp(-2m\epsilon^2 / (b-a)^2)$$
For $[0, 1]$-bounded variables: $\mathbb{P}(|\hat{\mu} - \mu| \geq \epsilon) \leq 2e^{-2m\epsilon^2}$
Let us develop deep intuition for what Hoeffding's inequality tells us and why it has the form it does.
The Exponential Decay:
The bound $2e^{-2m\epsilon^2}$ decays exponentially in $m\epsilon^2$. This means:
Rate of Concentration:
To achieve deviation at most $\epsilon$ with probability at least $1 - \delta$: $$2e^{-2m\epsilon^2} \leq \delta \implies m \geq \frac{\ln(2/\delta)}{2\epsilon^2}$$
Alternatively, with $m$ samples, with probability $\geq 1 - \delta$: $$|\hat{\mu} - \mu| \leq \sqrt{\frac{\ln(2/\delta)}{2m}}$$
| Accuracy $\epsilon$ | Confidence $1-\delta$ | Required Samples $m$ |
|---|---|---|
| 0.1 (10%) | 0.95 | 185 |
| 0.1 (10%) | 0.99 | 265 |
| 0.05 (5%) | 0.95 | 738 |
| 0.05 (5%) | 0.99 | 1,060 |
| 0.01 (1%) | 0.95 | 18,444 |
| 0.01 (1%) | 0.99 | 26,492 |
Why $1/\epsilon^2$ and Not $1/\epsilon$?
The $1/\epsilon^2$ dependence arises from the Central Limit Theorem. For i.i.d. random variables:
Hoeffding's inequality captures this CLT-like behavior with explicit, non-asymptotic bounds.
Comparison to CLT:
The CLT tells us $\sqrt{m}(\hat{\mu}_m - \mu) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$, implying: $$\mathbb{P}(|\hat{\mu}_m - \mu| \geq \epsilon) \approx 2\Phi(-\epsilon\sqrt{m}/\sigma)$$
For large deviations, Hoeffding can be tighter or looser than the CLT approximation:
Hoeffding uses only the range $[a, b]$, ignoring the actual variance. For a random variable concentrated near its mean (low variance), Hoeffding is pessimistic. Bennett's inequality and Bernstein's inequality provide tighter bounds by incorporating variance information.
We now apply Hoeffding's inequality to the core concern of statistical learning theory: bounding the generalization gap between empirical and true risk.
Setting:
Let $h$ be a hypothesis, $\ell$ a loss function taking values in $[0, 1]$ (e.g., 0-1 loss for classification). Given i.i.d. samples $(x_1, y_1), \ldots, (x_m, y_m) \sim \mathcal{D}$, define:
The random variables $Z_i = \ell(h(x_i), y_i)$ are i.i.d. with $Z_i \in [0, 1]$ and $\mathbb{E}[Z_i] = R(h)$.
Applying Hoeffding:
For a fixed hypothesis $h$: $$\mathbb{P}(|\hat{R}(h) - R(h)| \geq \epsilon) \leq 2e^{-2m\epsilon^2}$$
With probability at least $1 - \delta$: $$|\hat{R}(h) - R(h)| \leq \sqrt{\frac{\ln(2/\delta)}{2m}}$$
This bound holds for a hypothesis chosen before seeing the data. If we choose $h$ based on the data (e.g., ERM), the bound doesn't directly apply—$h$ is now data-dependent. The union bound resolves this by applying Hoeffding to every hypothesis, then taking a union.
Combining with Union Bound:
For a finite hypothesis class $\mathcal{H} = {h_1, \ldots, h_M}$, apply the union bound:
$$\mathbb{P}(\exists h \in \mathcal{H}: |\hat{R}(h) - R(h)| > \epsilon) \leq \sum_{i=1}^M \mathbb{P}(|\hat{R}(h_i) - R(h_i)| > \epsilon) \leq 2Me^{-2m\epsilon^2}$$
Theorem (Uniform Convergence for Finite Hypothesis Classes):
With probability at least $1 - \delta$, simultaneously for all $h \in \mathcal{H}$:
$$|\hat{R}(h) - R(h)| \leq \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2m}}$$
Proof: Set $2|\mathcal{H}|e^{-2m\epsilon^2} = \delta$ and solve for $\epsilon$: $$\epsilon = \sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2m}}$$
$\square$
Generalization Bound for ERM:
Let $\hat{h}$ be the empirical risk minimizer over $\mathcal{H}$: $$\hat{h} = \arg\min_{h \in \mathcal{H}} \hat{R}(h)$$
Let $h^* = \arg\min_{h \in \mathcal{H}} R(h)$ be the optimal hypothesis in $\mathcal{H}$.
With probability at least $1 - \delta$: $$R(\hat{h}) \leq R(h^*) + 2\sqrt{\frac{\ln(2|\mathcal{H}|/\delta)}{2m}}$$
Proof: \begin{align*} R(\hat{h}) &\leq \hat{R}(\hat{h}) + \epsilon \quad \text{(uniform convergence)} \ &\leq \hat{R}(h^) + \epsilon \quad \text{(definition of ERM)} \ &\leq R(h^) + 2\epsilon \quad \text{(uniform convergence)} \end{align*}
$\square$
This is the foundational PAC learning guarantee: ERM over a finite hypothesis class generalizes, with excess risk scaling as $O(\sqrt{(\ln|\mathcal{H}|)/m})$.
Hoeffding's inequality, while powerful, admits several refinements that provide tighter bounds in special cases.
Bennett's Inequality (Variance-Aware):
For independent random variables with $X_i - \mathbb{E}[X_i] \leq b$ and variance $\sigma_i^2$:
$$\mathbb{P}(S_m - \mu \geq t) \leq \exp\left(-\frac{v}{b^2} \cdot h\left(\frac{bt}{v}\right)\right)$$
where $v = \sum_i \sigma_i^2$ and $h(u) = (1+u)\ln(1+u) - u$.
Bennett's bound is tighter when variance is small relative to the range.
Bernstein's Inequality (Simplified):
A simpler approximation to Bennett: $$\mathbb{P}(S_m - \mu \geq t) \leq \exp\left(-\frac{t^2/2}{v + bt/3}\right)$$
For small $t$ (small deviations), this is approximately $\exp(-t^2/(2v))$—a Gaussian-like bound with the actual variance.
For large $t$ (large deviations), this is approximately $\exp(-3t/(2b))$—exponential in $t$ with rate depending on the bound $b$.
| Inequality | Bound Form | Best For | Information Used |
|---|---|---|---|
| Hoeffding | $\exp(-2t^2/(\sum(b_i-a_i)^2))$ | Universal, bounded RVs | Only range |
| Bennett | $\exp(-v h(bt/v)/b^2)$ | Low variance, bounded | Range + variance |
| Bernstein | $\exp(-t^2/(2v + 2bt/3))$ | Trade-off regimes | Range + variance |
| McDiarmid | $\exp(-2t^2/(\sum c_i^2))$ | Functions of many RVs | Bounded differences |
McDiarmid's Inequality (Bounded Differences):
A powerful generalization for functions of independent random variables.
Theorem: Let $X_1, \ldots, X_m$ be independent random variables taking values in sets $\mathcal{X}_1, \ldots, \mathcal{X}_m$. Let $f: \mathcal{X}_1 \times \ldots \times \mathcal{X}_m \to \mathbb{R}$ satisfy the bounded differences property:
$$\sup_{x_1,\ldots,x_m, x'_i} |f(x_1,\ldots,x_i,\ldots,x_m) - f(x_1,\ldots,x'_i,\ldots,x_m)| \leq c_i$$
Then: $$\mathbb{P}(f(X_1,\ldots,X_m) - \mathbb{E}[f] \geq t) \leq \exp\left(-\frac{2t^2}{\sum_{i=1}^m c_i^2}\right)$$
Application to Learning Theory:
Consider the empirical risk $\hat{R}(h) = \frac{1}{m}\sum_{i=1}^m \ell(h(x_i), y_i)$.
Changing a single sample $(x_i, y_i)$ changes $\hat{R}(h)$ by at most $1/m$ (for $[0,1]$-bounded loss).
By McDiarmid: $$\mathbb{P}(|\hat{R}(h) - \mathbb{E}[\hat{R}(h)]| \geq t) \leq 2\exp\left(-\frac{2t^2}{m \cdot (1/m)^2}\right) = 2\exp(-2mt^2)$$
This recovers Hoeffding for empirical averages!
Let us implement Hoeffding's inequality and visualize how concentration improves with sample size.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def hoeffding_bound(m: int, epsilon: float, a: float = 0.0, b: float = 1.0) -> float: """ Compute Hoeffding bound for P(|empirical_mean - true_mean| >= epsilon). Args: m: Number of samples epsilon: Deviation threshold a, b: Range of random variables [a, b] Returns: Upper bound on deviation probability """ return 2 * np.exp(-2 * m * epsilon**2 / (b - a)**2) def hoeffding_confidence_interval(m: int, delta: float, a: float = 0.0, b: float = 1.0) -> float: """ Compute epsilon such that P(|empirical - true| >= epsilon) <= delta. Args: m: Number of samples delta: Failure probability a, b: Range of random variables Returns: Width of confidence interval """ return (b - a) * np.sqrt(np.log(2 / delta) / (2 * m)) def sample_complexity(epsilon: float, delta: float, a: float = 0.0, b: float = 1.0) -> int: """ Compute minimum samples needed for epsilon accuracy with prob 1-delta. """ return int(np.ceil((b - a)**2 * np.log(2 / delta) / (2 * epsilon**2))) # Example: Estimating a Bernoulli probabilitydef simulate_hoeffding_tightness(true_p: float, m: int, n_trials: int = 10000) -> dict: """ Compare empirical deviation frequency to Hoeffding bound. """ samples = np.random.binomial(1, true_p, size=(n_trials, m)) empirical_means = samples.mean(axis=1) epsilons = np.linspace(0.01, 0.3, 30) empirical_probs = [] hoeffding_probs = [] for eps in epsilons: emp_prob = np.mean(np.abs(empirical_means - true_p) >= eps) hoeff_prob = hoeffding_bound(m, eps) empirical_probs.append(emp_prob) hoeffding_probs.append(min(hoeff_prob, 1.0)) return { 'epsilons': epsilons, 'empirical': empirical_probs, 'hoeffding': hoeffding_probs } # Demonstrationprint("Hoeffding's Inequality - Sample Complexity Table")print("=" * 55)print(f"{'Accuracy (ε)':>15} {'Confidence':>12} {'Samples (m)':>15}")print("-" * 55) for eps in [0.1, 0.05, 0.01]: for delta in [0.1, 0.05, 0.01]: m = sample_complexity(eps, delta) print(f"{eps:>15.2f} {1-delta:>12.1%} {m:>15,}") print()print("Confidence Interval Width for Various Sample Sizes")print("=" * 50)print(f"{'Samples':>10} {'95% CI':>12} {'99% CI':>12}")print("-" * 50) for m in [100, 500, 1000, 5000, 10000]: ci_95 = hoeffding_confidence_interval(m, 0.05) ci_99 = hoeffding_confidence_interval(m, 0.01) print(f"{m:>10,} {ci_95:>12.4f} {ci_99:>12.4f}")In practice, the Hoeffding bound is often conservative (larger than the true probability). This is the cost of a distribution-free bound. For specific distributions with low variance, empirical deviation probabilities can be significantly smaller than Hoeffding predicts.
We have developed Hoeffding's inequality—the mathematical engine powering concentration bounds in statistical learning theory.
What's Next:
With the union bound + Hoeffding's inequality, we can bound generalization for finite hypothesis classes. But what about infinite classes like linear classifiers or neural networks? The next page introduces Rademacher complexity—a data-dependent complexity measure that provides generalization bounds for infinite hypothesis classes by measuring how well the class can fit random noise.
You now command Hoeffding's inequality—from its derivation through Markov and Chernoff bounds, to its application in generalization theory. Combined with the union bound from the previous page, you have the complete toolkit for PAC learning with finite hypothesis classes. Next, we tackle the challenge of infinite classes with Rademacher complexity.