Loading learning content...
The generalization bounds we've developed—from the union bound through Rademacher complexity to margin-based analysis—share a common limitation: they provide guarantees for the worst-case hypothesis that could be selected. The union bound says: 'for any hypothesis you might choose, the gap is bounded.' But learning algorithms don't choose hypotheses arbitrarily; they follow specific procedures that favor certain hypotheses over others.
PAC-Bayes bounds represent a paradigm shift: instead of bounding generalization for any single hypothesis, they bound the expected generalization error of a distribution over hypotheses. This subtle change has profound implications:
The name 'PAC-Bayes' reflects the synthesis: PAC (Probably Approximately Correct) learning framework meets Bayesian reasoning about hypothesis distributions.
By the end of this page, you will: (1) Understand the PAC-Bayes framework and its relationship to Bayesian learning; (2) Derive the fundamental PAC-Bayes inequality; (3) Connect PAC-Bayes to KL divergence and information theory; (4) Apply PAC-Bayes to analyze practical learning algorithms; (5) Appreciate how PAC-Bayes provides non-vacuous bounds for deep networks.
PAC-Bayes bounds consider distributions over hypotheses rather than single hypotheses. We develop the framework carefully.
The Key Objects:
1. Hypothesis Space: $\mathcal{H}$ is a (potentially infinite) set of hypotheses $h: \mathcal{X} \to \mathcal{Y}$
2. Prior Distribution: $P$ is a distribution over $\mathcal{H}$ chosen before seeing data. This encodes prior beliefs about which hypotheses are plausible.
3. Posterior Distribution: $Q$ is a distribution over $\mathcal{H}$ chosen after seeing data. This is the output of learning—not a single hypothesis, but a distribution.
4. Stochastic Classifier: The learned predictor is randomized: to classify $x$, sample $h \sim Q$ and output $h(x)$. The expected error is: $$R(Q) = \mathbb{E}{h \sim Q}[R(h)] = \mathbb{E}{h \sim Q}\mathbb{E}_{(x,y) \sim \mathcal{D}}[\mathbf{1}[h(x) \neq y]]$$
Why Distributions Over Hypotheses?
Interpretation 1: Bayesian Learning
In Bayesian learning, we start with a prior $P$ and update to a posterior $Q$ given data. PAC-Bayes provides frequentist guarantees for this Bayesian procedure.
Interpretation 2: Randomized Predictors
The distribution $Q$ represents randomized prediction: we average predictions over multiple hypotheses, often improving robustness.
Interpretation 3: Algorithm Analysis
Learning algorithms don't output a single hypothesis—they output a procedure (parameters, training state) that can be viewed as a distribution. For neural networks, $Q$ might be a Gaussian around the learned weights.
Interpretation 4: Compression
A 'tight' distribution $Q$ (concentrated on few hypotheses) indicates that the algorithm has found a simple, compressible solution. PAC-Bayes rewards compression.
The prior $P$ must be chosen before seeing data—this is crucial for validity. A common choice is a Gaussian prior: $P = \mathcal{N}(0, \sigma_P^2 I)$ on weights. The prior encodes 'simplicity bias'—preference for hypotheses with small norms or near a reference point.
| Aspect | Standard PAC | PAC-Bayes |
|---|---|---|
| Output | Single hypothesis $h$ | Distribution $Q$ over $\mathcal{H}$ |
| Guarantee | $R(h) \leq \hat{R}(h) + \text{complexity}$ | $R(Q) \leq \hat{R}(Q) + \text{KL}(Q||P)/m$ |
| Complexity Measure | VC-dim, Rademacher | KL divergence from prior |
| Prior Knowledge | Implicit in $\mathcal{H}$ | Explicit prior $P$ |
| Tightness for DL | Often vacuous | Can be non-vacuous |
We now state and prove the fundamental PAC-Bayes inequality, the cornerstone of the theory.
Theorem (McAllester's PAC-Bayes Bound): Let $P$ be any prior distribution over $\mathcal{H}$ chosen before seeing data. For any $\delta > 0$, with probability at least $1 - \delta$ over the draw of $S \sim \mathcal{D}^m$, simultaneously for all distributions $Q$ over $\mathcal{H}$:
$$\mathbb{E}{h \sim Q}[R(h)] \leq \mathbb{E}{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{2m}}$$
where $D_{\text{KL}}(Q | P) = \mathbb{E}_{h \sim Q}\left[\ln \frac{Q(h)}{P(h)}\right]$ is the KL divergence from $Q$ to $P$.
Interpretation:
The bound says: the generalization gap of a posterior $Q$ is controlled by:
The KL divergence plays the role of 'complexity'—it measures how much information about the data is encoded in $Q$.
Proof Outline:
Step 1: Change of Measure
For any function $\phi(h)$: $$\mathbb{E}{h \sim Q}[\phi(h)] = \mathbb{E}{h \sim P}\left[\frac{Q(h)}{P(h)} \phi(h)\right]$$
This converts expectations under $Q$ to expectations under $P$ (which is data-independent).
Step 2: Donsker-Varadhan Variational Formula
For any random variable $X$ under $P$: $$\ln \mathbb{E}{h \sim P}[e^X] = \sup_Q \left{\mathbb{E}{h \sim Q}[X] - D_{\text{KL}}(Q | P)\right}$$
Rearranging for any $Q$: $$\mathbb{E}{h \sim Q}[X] \leq D{\text{KL}}(Q | P) + \ln \mathbb{E}_{h \sim P}[e^X]$$
Step 3: Apply to Generalization Gap
Let $X(h) = m \cdot (R(h) - \hat{R}(h))^2$. Using Hoeffding's lemma and the variational formula, we bound: $$\mathbb{E}{h \sim Q}[(R(h) - \hat{R}(h))^2] \leq \frac{D{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{2m}$$
Jensen's inequality and algebra complete the proof. $\square$
The PAC-Bayes bound holds simultaneously for all posteriors $Q$—there's no union bound over hypothesis classes! The cost of flexibility is paid through KL divergence: extreme posteriors (very different from the prior) pay a higher complexity penalty. This enables analyzing infinite-dimensional hypothesis classes like neural networks.
McAllester's bound is foundational but admits several refinements for practical applications.
Catoni's Bound (Tighter, Non-Square-Root Form):
With probability $\geq 1 - \delta$, for all $Q$ and any $\lambda > 0$:
$$D_{\text{kl}}\left(\mathbb{E}{h \sim Q}[\hat{R}(h)] | \mathbb{E}{h \sim Q}[R(h)]\right) \leq \frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{m}$$
where $D_{\text{kl}}(q | p) = q \ln(q/p) + (1-q)\ln((1-q)/(1-p))$ is the binary KL divergence.
Solving for $R(Q)$ gives tighter bounds than McAllester for small training errors.
Seeger's Bound (Direct Binary KL Inversion):
For bounded losses in $[0, 1]$, with prob $\geq 1 - \delta$:
$$D_{\text{kl}}(\hat{R}(Q) | R(Q)) \leq \frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{m}$$
Inverting this (numerically) gives the tightest known PAC-Bayes bounds.
Oracle PAC-Bayes (Data-Dependent Priors):
A subtle but powerful extension: use part of the data to construct the prior, then apply PAC-Bayes to the rest:
This allows data-dependent priors while maintaining validity.
PAC-Bayes with Unbounded Losses:
Extensions handle regression (squared loss) and other unbounded losses by truncating or using sub-Gaussian assumptions.
Comparison of Bounds:
| Bound | Form | Best For |
|---|---|---|
| McAllester | $R(Q) \leq \hat{R}(Q) + \sqrt{\frac{\text{KL} + O(\log m)}{2m}}$ | Simplicity, interpretability |
| Catoni | Implicit via binary KL | Small training error, tight |
| Seeger | Binary KL inversion | Tightest theoretical bounds |
| Oracle | Data-dependent prior | Practical deep learning |
| Langford-Caruana | Holdout-validated | When validation available |
For the tightest bounds, one inverts the binary KL inequality numerically. Given $\hat{R}$ and KL$/m$, find the largest $R$ such that $D_{\text{kl}}(\hat{R} | R) \leq \text{KL}/m$. This is a simple root-finding problem yielding much tighter bounds than the closed-form square-root version.
The KL divergence $D_{\text{KL}}(Q | P)$ is the central quantity in PAC-Bayes. Understanding its role provides deep insight.
Information-Theoretic Interpretation:
KL divergence measures the 'information gained' when updating from prior $P$ to posterior $Q$: $$D_{\text{KL}}(Q | P) = \mathbb{E}_{h \sim Q}\left[\ln \frac{Q(h)}{P(h)}\right]$$
Coding Interpretation:
$D_{\text{KL}}(Q | P)$ is the extra bits needed to describe a hypothesis from $Q$ using a code optimized for $P$. Small KL means the posterior is 'efficient' relative to the prior.
Compression Interpretation:
PAC-Bayes rewards solutions that can be compressed:
For Gaussian Prior/Posterior (Neural Networks):
In deep learning applications, common choices are:
The KL divergence has closed form: $$D_{\text{KL}}(Q | P) = \frac{d}{2}\left(\frac{\sigma_Q^2}{\sigma_P^2} - 1 - \ln\frac{\sigma_Q^2}{\sigma_P^2}\right) + \frac{|w|^2}{2\sigma_P^2}$$
where $d$ is the number of parameters.
Key Observation: The term $|w|^2 / (2\sigma_P^2)$ is exactly L2 regularization! PAC-Bayes provides a principled interpretation: L2 regularization bounds the KL divergence from an isotropic prior, which bounds generalization.
Practical Implication: For a network with $d$ parameters, $|w|^2$ bounded, and $\sigma_Q \to 0$ (deterministic): $$D_{\text{KL}}(Q | P) \approx \frac{|w|^2}{2\sigma_P^2} + \frac{d}{2}\ln\frac{\sigma_P^2}{\sigma_Q^2}$$
The logarithmic dependence on $d$ (rather than linear) is what enables non-vacuous bounds for large networks!
Standard bounds (VC, Rademacher) scale with parameter count $d$, giving vacuous bounds (> 1) for modern networks with millions of parameters. PAC-Bayes KL divergence can scale as $d \cdot \log(\sigma_P/\sigma_Q) + |w|^2/\sigma_P^2$. With appropriate $\sigma_P$ (large prior variance) and controlled $|w|^2$ (regularization), this can be $O(\sqrt{m})$ even for large $d$.
PAC-Bayes provides some of the first non-trivial generalization guarantees for deep neural networks. Let's explore how.
The Challenge:
Deep networks have millions of parameters. Standard bounds give:
For VGG-16 ($\sim 138$ million parameters), these exceed 1—vacuous!
PAC-Bayes Approach:
Step 1: Choose a Prior $$P = \mathcal{N}(0, \sigma_P^2 I)$$ with $\sigma_P$ chosen to reflect prior belief about weight scale.
Step 2: Define Posterior Around Learned Weights $$Q = \mathcal{N}(w^, \sigma_Q^2 I)$$ where $w^$ is the trained network's weights.
Step 3: Bound Stochastic Classifier Error
The stochastic classifier samples $w \sim Q$ and predicts. The expected error can be bounded using PAC-Bayes.
The Compression Perspective (Arora et al., 2018):
A key insight: neural networks can often be compressed without losing accuracy:
If a network can be compressed from $d$ parameters to effectively $k < d$ parameters, the PAC-Bayes bound depends on $k$ rather than $d$.
Theorem (Informal): If a neural network can be compressed to use $k$ effective bits (through quantization, pruning, etc.) while maintaining accuracy $\hat{R}$, then with high probability: $$R \leq \hat{R} + O\left(\sqrt{\frac{k + \log(1/\delta)}{m}}\right)$$
Noise Stability (Dziugaite & Roy, 2017):
Alternatively, if a network is stable to noise injection (adding Gaussian noise to weights doesn't hurt accuracy much), this implies a tight posterior $Q$:
Dziugaite & Roy (2017) achieved a non-vacuous PAC-Bayes bound of ~17% error on MNIST for a 3-layer network (vs. ~2% test error). While not tight, this was the first non-trivial bound for a neural network! Subsequent work has achieved ~5-10% bounds on CIFAR-10 and other datasets.
Let's implement PAC-Bayes bound computation and see how to evaluate bounds for real models.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom scipy.optimize import brentqfrom typing import Tuple def kl_binary(q: float, p: float) -> float: """ Binary KL divergence: D_kl(q || p) = q*log(q/p) + (1-q)*log((1-q)/(1-p)) """ if q == 0: return -np.log(1 - p) if p < 1 else np.inf if q == 1: return -np.log(p) if p > 0 else np.inf if p == 0 or p == 1: return np.inf return q * np.log(q / p) + (1 - q) * np.log((1 - q) / (1 - p)) def kl_gaussian(mu_q: np.ndarray, sigma_q: float, mu_p: np.ndarray, sigma_p: float) -> float: """ KL divergence between diagonal Gaussians. D_KL(N(mu_q, sigma_q^2 I) || N(mu_p, sigma_p^2 I)) """ d = len(mu_q) diff = mu_q - mu_p kl = 0.5 * ( d * (sigma_q**2 / sigma_p**2 - 1 - np.log(sigma_q**2 / sigma_p**2)) + np.sum(diff**2) / sigma_p**2 ) return kl def mcallester_bound(empirical_risk: float, kl_div: float, m: int, delta: float = 0.05) -> float: """ McAllester's PAC-Bayes bound (square-root form). R(Q) <= R_hat(Q) + sqrt((KL(Q||P) + ln(2*sqrt(m)/delta)) / (2*m)) """ complexity = np.sqrt((kl_div + np.log(2 * np.sqrt(m) / delta)) / (2 * m)) return min(1.0, empirical_risk + complexity) def seeger_bound(empirical_risk: float, kl_div: float, m: int, delta: float = 0.05) -> float: """ Seeger's PAC-Bayes bound via binary KL inversion. Find largest R such that D_kl(R_hat || R) <= (KL + log(2*sqrt(m)/delta)) / m """ bound_rhs = (kl_div + np.log(2 * np.sqrt(m) / delta)) / m if bound_rhs >= 1: # Vacuous return 1.0 # Binary search for R def objective(R): if R <= empirical_risk: return -np.inf return kl_binary(empirical_risk, R) - bound_rhs try: R_upper = brentq(objective, empirical_risk + 1e-10, 1 - 1e-10) return R_upper except ValueError: return 1.0 # Vacuous def evaluate_pac_bayes(train_error: float, test_error: float, weights: np.ndarray, sigma_q: float, sigma_p: float, n_train: int, delta: float = 0.05): """ Evaluate PAC-Bayes bounds for a neural network. Args: train_error: Empirical training error (0-1 loss) test_error: True test error (for comparison) weights: Flattened weight vector of the network sigma_q: Standard deviation of posterior perturbation sigma_p: Standard deviation of prior (centered at 0) n_train: Number of training samples delta: Confidence level """ # Compute KL divergence (prior centered at 0) mu_p = np.zeros_like(weights) kl = kl_gaussian(weights, sigma_q, mu_p, sigma_p) # Compute bounds mcallester = mcallester_bound(train_error, kl, n_train, delta) seeger = seeger_bound(train_error, kl, n_train, delta) return { 'kl_divergence': kl, 'mcallester_bound': mcallester, 'seeger_bound': seeger, 'test_error': test_error, 'is_non_vacuous': seeger < 0.5 # Meaningful if < random guessing } # Demonstrationprint("PAC-Bayes Bound Evaluation")print("=" * 60) # Simulated neural network scenariosscenarios = [ {'name': 'Small network, well-regularized', 'd': 10000, 'w_norm': 10, 'train_err': 0.02, 'test_err': 0.05}, {'name': 'Large network, overfit', 'd': 1000000, 'w_norm': 100, 'train_err': 0.0, 'test_err': 0.15}, {'name': 'Large network, regularized', 'd': 1000000, 'w_norm': 20, 'train_err': 0.01, 'test_err': 0.08},] n_train = 50000sigma_p = 10.0 # Prior stdsigma_q = 0.1 # Posterior perturbation std for scenario in scenarios: # Simulate weight vector with given norm d = scenario['d'] weights = np.random.randn(d) weights = weights / np.linalg.norm(weights) * scenario['w_norm'] result = evaluate_pac_bayes( scenario['train_err'], scenario['test_err'], weights, sigma_q, sigma_p, n_train ) print(f"\n{scenario['name']}") print(f" Parameters: {d:,}, ||w||: {scenario['w_norm']}") print(f" Train error: {scenario['train_err']:.2%}") print(f" Test error: {scenario['test_err']:.2%}") print(f" KL divergence: {result['kl_divergence']:.1f}") print(f" McAllester bound: {result['mcallester_bound']:.2%}") print(f" Seeger bound: {result['seeger_bound']:.2%}") print(f" Non-vacuous: {result['is_non_vacuous']}")For tighter PAC-Bayes bounds: (1) Use large prior variance $\sigma_P$ to reduce $|w|^2/\sigma_P^2$; (2) Regularize during training to keep $|w|$ small; (3) Use noise injection to find the largest $\sigma_Q$ that preserves accuracy; (4) Consider data-dependent priors (using a held-out set); (5) Use binary KL inversion (Seeger) instead of square-root bounds.
PAC-Bayes connects to many important concepts in machine learning theory.
Connection 1: Minimum Description Length (MDL)
The PAC-Bayes bound can be rewritten as: $$R(Q) \leq \hat{R}(Q) + \sqrt{\frac{L_P(Q) + O(\log m)}{m}}$$
where $L_P(Q) = D_{\text{KL}}(Q | P)$ is the description length of the posterior relative to the prior. MDL principle: prefer hypotheses with short description. PAC-Bayes formalizes this.
Connection 2: Bayesian Inference
The Bayesian posterior $Q^* = P(h | S) \propto P(S | h) P(h)$ minimizes a combination of empirical risk and KL divergence: $$Q^* = \arg\min_Q \left{\mathbb{E}{h \sim Q}[\hat{R}(h)] + \frac{1}{m} D{\text{KL}}(Q | P)\right}$$
This is the Gibbs posterior. PAC-Bayes shows it has optimal generalization guarantees!
Connection 3: Regularization
For Gaussian posteriors, KL divergence includes an L2 norm term. Thus:
Extension 1: PAC-Bayes + Margins
Combining PAC-Bayes with margin analysis gives even tighter bounds: $$R_{0-1}(Q) \leq \hat{R}\gamma(Q) + \frac{1}{\gamma}\sqrt{\frac{D{\text{KL}}(Q | P) + O(\log m)}{m}}$$
Large margins reduce the complexity penalty!
Extension 2: PAC-Bayes for Gradient Descent
Recent work analyzes gradient descent through PAC-Bayes:
Extension 3: PAC-Bayes with Excess Risk
Instead of bounding $R(Q)$ absolutely, bound the excess risk $R(Q) - R^$ where $R^$ is the Bayes-optimal error: $$R(Q) - R^* \leq \hat{R}(Q) - \hat{R}^* + O\left(\sqrt{\frac{D_{\text{KL}}(Q | P)}{m}}\right)$$
This shows how quickly we approach the best possible classifier.
A frontier research direction: SGD with specific learning rates and batch sizes converges to a posterior $Q$ that is 'close' to a data-independent prior $P$, explaining why overparameterized networks trained with SGD generalize. The optimization-dependent implicit regularization creates a 'compressible' posterior.
We have developed PAC-Bayes bounds—the frontier of generalization theory, providing algorithm-specific guarantees that can be non-vacuous for modern deep networks.
Module Complete!
You have now mastered the complete toolkit of generalization bounds in statistical learning theory:
Together, these tools form the mathematical foundation for understanding why machine learning works—why algorithms that minimize empirical risk on finite samples can generalize to unseen data. This understanding guides algorithm design, architecture choices, regularization strategies, and the development of new learning methods.
You have completed Module 6: Generalization Bounds. You now command the full mathematical machinery of statistical learning theory—from foundational concentration inequalities through sophisticated PAC-Bayes analysis. These tools provide the theoretical guarantees that underpin all of modern machine learning, explaining when and why learning from data is possible.