Machine LearningGeneralization Bounds

Generalization Bounds: Mathematical Guarantees for Learning

LevelAdvanced

Duration90 mins

TopicGeneralization Bounds

5 / 5

PAC-Bayes Bounds: Bayesian Generalization Theory

Beyond Worst-Case: Algorithm-Specific Guarantees

The generalization bounds we've developed—from the union bound through Rademacher complexity to margin-based analysis—share a common limitation: they provide guarantees for the worst-case hypothesis that could be selected. The union bound says: 'for any hypothesis you might choose, the gap is bounded.' But learning algorithms don't choose hypotheses arbitrarily; they follow specific procedures that favor certain hypotheses over others.

PAC-Bayes bounds represent a paradigm shift: instead of bounding generalization for any single hypothesis, they bound the expected generalization error of a distribution over hypotheses. This subtle change has profound implications:

Algorithm-Specific Bounds: The analysis can incorporate properties of the learning algorithm (e.g., SGD finds solutions with specific structure)
Prior Knowledge Integration: Bayesian priors encode beliefs about which hypotheses are plausible
Tighter Bounds: By averaging over hypotheses, PAC-Bayes often yields significantly tighter guarantees
Non-vacuous Deep Learning Bounds: PAC-Bayes provides some of the first non-trivial generalization bounds for deep neural networks

The name 'PAC-Bayes' reflects the synthesis: PAC (Probably Approximately Correct) learning framework meets Bayesian reasoning about hypothesis distributions.

What You Will Master

By the end of this page, you will: (1) Understand the PAC-Bayes framework and its relationship to Bayesian learning; (2) Derive the fundamental PAC-Bayes inequality; (3) Connect PAC-Bayes to KL divergence and information theory; (4) Apply PAC-Bayes to analyze practical learning algorithms; (5) Appreciate how PAC-Bayes provides non-vacuous bounds for deep networks.

From Hypotheses to Distributions

PAC-Bayes bounds consider distributions over hypotheses rather than single hypotheses. We develop the framework carefully.

The Key Objects:

1. Hypothesis Space: $\mathcal{H}$ is a (potentially infinite) set of hypotheses $h: \mathcal{X} \to \mathcal{Y}$

2. Prior Distribution: $P$ is a distribution over $\mathcal{H}$ chosen before seeing data. This encodes prior beliefs about which hypotheses are plausible.

3. Posterior Distribution: $Q$ is a distribution over $\mathcal{H}$ chosen after seeing data. This is the output of learning—not a single hypothesis, but a distribution.

4. Stochastic Classifier: The learned predictor is randomized: to classify $x$, sample $h \sim Q$ and output $h(x)$. The expected error is: $$R(Q) = \mathbb{E}{h \sim Q}[R(h)] = \mathbb{E}{h \sim Q}\mathbb{E}_{(x,y) \sim \mathcal{D}}[\mathbf{1}[h(x) \neq y]]$$

Why Distributions Over Hypotheses?

Interpretation 1: Bayesian Learning

In Bayesian learning, we start with a prior $P$ and update to a posterior $Q$ given data. PAC-Bayes provides frequentist guarantees for this Bayesian procedure.

Interpretation 2: Randomized Predictors

The distribution $Q$ represents randomized prediction: we average predictions over multiple hypotheses, often improving robustness.

Interpretation 3: Algorithm Analysis

Learning algorithms don't output a single hypothesis—they output a procedure (parameters, training state) that can be viewed as a distribution. For neural networks, $Q$ might be a Gaussian around the learned weights.

Interpretation 4: Compression

A 'tight' distribution $Q$ (concentrated on few hypotheses) indicates that the algorithm has found a simple, compressible solution. PAC-Bayes rewards compression.

The Role of the Prior

The prior $P$ must be chosen before seeing data—this is crucial for validity. A common choice is a Gaussian prior: $P = \mathcal{N}(0, \sigma_P^2 I)$ on weights. The prior encodes 'simplicity bias'—preference for hypotheses with small norms or near a reference point.

Comparison: Standard PAC vs. PAC-Bayes
Aspect	Standard PAC	PAC-Bayes
Output	Single hypothesis $h$	Distribution $Q$ over $\mathcal{H}$
Guarantee	$R(h) \leq \hat{R}(h) + \text{complexity}$	$R(Q) \leq \hat{R}(Q) + \text{KL}(Q\|\|P)/m$
Complexity Measure	VC-dim, Rademacher	KL divergence from prior
Prior Knowledge	Implicit in $\mathcal{H}$	Explicit prior $P$
Tightness for DL	Often vacuous	Can be non-vacuous

The PAC-Bayes Theorem

We now state and prove the fundamental PAC-Bayes inequality, the cornerstone of the theory.

Theorem (McAllester's PAC-Bayes Bound): Let $P$ be any prior distribution over $\mathcal{H}$ chosen before seeing data. For any $\delta > 0$, with probability at least $1 - \delta$ over the draw of $S \sim \mathcal{D}^m$, simultaneously for all distributions $Q$ over $\mathcal{H}$:

$$\mathbb{E}{h \sim Q}[R(h)] \leq \mathbb{E}{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{2m}}$$

where $D_{\text{KL}}(Q | P) = \mathbb{E}_{h \sim Q}\left[\ln \frac{Q(h)}{P(h)}\right]$ is the KL divergence from $Q$ to $P$.

Interpretation:

The bound says: the generalization gap of a posterior $Q$ is controlled by:

KL Divergence: How far $Q$ has moved from the prior $P$. Large changes require more data.
Sample Size: $\sqrt{1/m}$ scaling, as expected from concentration.

The KL divergence plays the role of 'complexity'—it measures how much information about the data is encoded in $Q$.

Proof Outline:

Step 1: Change of Measure

For any function $\phi(h)$: $$\mathbb{E}{h \sim Q}[\phi(h)] = \mathbb{E}{h \sim P}\left[\frac{Q(h)}{P(h)} \phi(h)\right]$$

This converts expectations under $Q$ to expectations under $P$ (which is data-independent).

Step 2: Donsker-Varadhan Variational Formula

For any random variable $X$ under $P$: $$\ln \mathbb{E}{h \sim P}[e^X] = \sup_Q \left{\mathbb{E}{h \sim Q}[X] - D_{\text{KL}}(Q | P)\right}$$

Rearranging for any $Q$: $$\mathbb{E}{h \sim Q}[X] \leq D{\text{KL}}(Q | P) + \ln \mathbb{E}_{h \sim P}[e^X]$$

Step 3: Apply to Generalization Gap

Let $X(h) = m \cdot (R(h) - \hat{R}(h))^2$. Using Hoeffding's lemma and the variational formula, we bound: $$\mathbb{E}{h \sim Q}[(R(h) - \hat{R}(h))^2] \leq \frac{D{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{2m}$$

Jensen's inequality and algebra complete the proof. $\square$

The Genius of PAC-Bayes

The PAC-Bayes bound holds simultaneously for all posteriors $Q$—there's no union bound over hypothesis classes! The cost of flexibility is paid through KL divergence: extreme posteriors (very different from the prior) pay a higher complexity penalty. This enables analyzing infinite-dimensional hypothesis classes like neural networks.

Variants and Improvements

McAllester's bound is foundational but admits several refinements for practical applications.

Catoni's Bound (Tighter, Non-Square-Root Form):

With probability $\geq 1 - \delta$, for all $Q$ and any $\lambda > 0$:

$$D_{\text{kl}}\left(\mathbb{E}{h \sim Q}[\hat{R}(h)] | \mathbb{E}{h \sim Q}[R(h)]\right) \leq \frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{m}$$

where $D_{\text{kl}}(q | p) = q \ln(q/p) + (1-q)\ln((1-q)/(1-p))$ is the binary KL divergence.

Solving for $R(Q)$ gives tighter bounds than McAllester for small training errors.

Seeger's Bound (Direct Binary KL Inversion):

For bounded losses in $[0, 1]$, with prob $\geq 1 - \delta$:

$$D_{\text{kl}}(\hat{R}(Q) | R(Q)) \leq \frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{m}$$

Inverting this (numerically) gives the tightest known PAC-Bayes bounds.

Oracle PAC-Bayes (Data-Dependent Priors):

A subtle but powerful extension: use part of the data to construct the prior, then apply PAC-Bayes to the rest:

Split data: $S = S_1 \cup S_2$
Use $S_1$ to construct prior $P_{S_1}$
Apply PAC-Bayes with prior $P_{S_1}$ and data $S_2$

This allows data-dependent priors while maintaining validity.

PAC-Bayes with Unbounded Losses:

Extensions handle regression (squared loss) and other unbounded losses by truncating or using sub-Gaussian assumptions.

Comparison of Bounds:

PAC-Bayes Bound Variants
Bound	Form	Best For
McAllester	$R(Q) \leq \hat{R}(Q) + \sqrt{\frac{\text{KL} + O(\log m)}{2m}}$	Simplicity, interpretability
Catoni	Implicit via binary KL	Small training error, tight
Seeger	Binary KL inversion	Tightest theoretical bounds
Oracle	Data-dependent prior	Practical deep learning
Langford-Caruana	Holdout-validated	When validation available

Numerical Inversion

For the tightest bounds, one inverts the binary KL inequality numerically. Given $\hat{R}$ and KL$/m$, find the largest $R$ such that $D_{\text{kl}}(\hat{R} | R) \leq \text{KL}/m$. This is a simple root-finding problem yielding much tighter bounds than the closed-form square-root version.

The Role of KL Divergence

The KL divergence $D_{\text{KL}}(Q | P)$ is the central quantity in PAC-Bayes. Understanding its role provides deep insight.

Information-Theoretic Interpretation:

KL divergence measures the 'information gained' when updating from prior $P$ to posterior $Q$: $$D_{\text{KL}}(Q | P) = \mathbb{E}_{h \sim Q}\left[\ln \frac{Q(h)}{P(h)}\right]$$

If $Q = P$: no information gained, $D_{\text{KL}} = 0$
If $Q$ concentrates on hypotheses unlikely under $P$: large $D_{\text{KL}}$
If $Q$ is broad but shifted from $P$: moderate $D_{\text{KL}}$

Coding Interpretation:

$D_{\text{KL}}(Q | P)$ is the extra bits needed to describe a hypothesis from $Q$ using a code optimized for $P$. Small KL means the posterior is 'efficient' relative to the prior.

Compression Interpretation:

PAC-Bayes rewards solutions that can be compressed:

A narrow posterior $Q$ (concentrated on few hypotheses) from a broad prior $P$ has small $D_{\text{KL}}$
This indicates the algorithm found a simple pattern, not memorization

For Gaussian Prior/Posterior (Neural Networks):

In deep learning applications, common choices are:

Prior: $P = \mathcal{N}(0, \sigma_P^2 I)$ (spherical Gaussian centered at origin)
Posterior: $Q = \mathcal{N}(w, \sigma_Q^2 I)$ (Gaussian around learned weights $w$)

The KL divergence has closed form: $$D_{\text{KL}}(Q | P) = \frac{d}{2}\left(\frac{\sigma_Q^2}{\sigma_P^2} - 1 - \ln\frac{\sigma_Q^2}{\sigma_P^2}\right) + \frac{|w|^2}{2\sigma_P^2}$$

where $d$ is the number of parameters.

Key Observation: The term $|w|^2 / (2\sigma_P^2)$ is exactly L2 regularization! PAC-Bayes provides a principled interpretation: L2 regularization bounds the KL divergence from an isotropic prior, which bounds generalization.

Practical Implication: For a network with $d$ parameters, $|w|^2$ bounded, and $\sigma_Q \to 0$ (deterministic): $$D_{\text{KL}}(Q | P) \approx \frac{|w|^2}{2\sigma_P^2} + \frac{d}{2}\ln\frac{\sigma_P^2}{\sigma_Q^2}$$

The logarithmic dependence on $d$ (rather than linear) is what enables non-vacuous bounds for large networks!

Why PAC-Bayes Beats Standard Bounds for NNs

Standard bounds (VC, Rademacher) scale with parameter count $d$, giving vacuous bounds (> 1) for modern networks with millions of parameters. PAC-Bayes KL divergence can scale as $d \cdot \log(\sigma_P/\sigma_Q) + |w|^2/\sigma_P^2$. With appropriate $\sigma_P$ (large prior variance) and controlled $|w|^2$ (regularization), this can be $O(\sqrt{m})$ even for large $d$.

PAC-Bayes for Deep Learning

PAC-Bayes provides some of the first non-trivial generalization guarantees for deep neural networks. Let's explore how.

The Challenge:

Deep networks have millions of parameters. Standard bounds give:

VC dimension: $\tilde{O}(d \cdot L \cdot \log W)$ for $d$ parameters, $L$ layers
Rademacher: $\tilde{O}(\prod_\ell |W_\ell|_F / \sqrt{m})$

For VGG-16 ($\sim 138$ million parameters), these exceed 1—vacuous!

PAC-Bayes Approach:

Step 1: Choose a Prior $$P = \mathcal{N}(0, \sigma_P^2 I)$$ with $\sigma_P$ chosen to reflect prior belief about weight scale.

Step 2: Define Posterior Around Learned Weights $$Q = \mathcal{N}(w^, \sigma_Q^2 I)$$ where $w^$ is the trained network's weights.

Step 3: Bound Stochastic Classifier Error

The stochastic classifier samples $w \sim Q$ and predicts. The expected error can be bounded using PAC-Bayes.

The Compression Perspective (Arora et al., 2018):

A key insight: neural networks can often be compressed without losing accuracy:

Pruning: Remove small weights
Quantization: Round weights to low precision
Knowledge distillation: Train smaller network to mimic larger one

If a network can be compressed from $d$ parameters to effectively $k < d$ parameters, the PAC-Bayes bound depends on $k$ rather than $d$.

Theorem (Informal): If a neural network can be compressed to use $k$ effective bits (through quantization, pruning, etc.) while maintaining accuracy $\hat{R}$, then with high probability: $$R \leq \hat{R} + O\left(\sqrt{\frac{k + \log(1/\delta)}{m}}\right)$$

Noise Stability (Dziugaite & Roy, 2017):

Alternatively, if a network is stable to noise injection (adding Gaussian noise to weights doesn't hurt accuracy much), this implies a tight posterior $Q$:

Train network to weights $w^*$
Measure accuracy when using $w^* + \epsilon$ for $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$
If accuracy is preserved, $\sigma$ can be large, making $D_{\text{KL}}(Q | P)$ small

Non-Vacuous Bounds Achieved!

Dziugaite & Roy (2017) achieved a non-vacuous PAC-Bayes bound of ~17% error on MNIST for a 3-layer network (vs. ~2% test error). While not tight, this was the first non-trivial bound for a neural network! Subsequent work has achieved ~5-10% bounds on CIFAR-10 and other datasets.

Computing PAC-Bayes Bounds

Let's implement PAC-Bayes bound computation and see how to evaluate bounds for real models.

pac_bayes_bounds.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy.optimize import brentq
from typing import Tuple
 
def kl_binary(q: float, p: float) -> float:
    """
    Binary KL divergence: D_kl(q || p) = q*log(q/p) + (1-q)*log((1-q)/(1-p))
    """
    if q == 0:
        return -np.log(1 - p) if p < 1 else np.inf
    if q == 1:
        return -np.log(p) if p > 0 else np.inf
    if p == 0 or p == 1:
        return np.inf
    return q * np.log(q / p) + (1 - q) * np.log((1 - q) / (1 - p))
 
def kl_gaussian(mu_q: np.ndarray, sigma_q: float, 
                mu_p: np.ndarray, sigma_p: float) -> float:
    """
    KL divergence between diagonal Gaussians.
    D_KL(N(mu_q, sigma_q^2 I) || N(mu_p, sigma_p^2 I))
    """
    d = len(mu_q)
    diff = mu_q - mu_p
    kl = 0.5 * (
        d * (sigma_q**2 / sigma_p**2 - 1 - np.log(sigma_q**2 / sigma_p**2)) +
        np.sum(diff**2) / sigma_p**2
    )
    return kl
 
def mcallester_bound(empirical_risk: float, kl_div: float, 
                     m: int, delta: float = 0.05) -> float:
    """
    McAllester's PAC-Bayes bound (square-root form).
    R(Q) <= R_hat(Q) + sqrt((KL(Q||P) + ln(2*sqrt(m)/delta)) / (2*m))
    """
    complexity = np.sqrt((kl_div + np.log(2 * np.sqrt(m) / delta)) / (2 * m))
    return min(1.0, empirical_risk + complexity)
 
def seeger_bound(empirical_risk: float, kl_div: float,
                 m: int, delta: float = 0.05) -> float:
    """
    Seeger's PAC-Bayes bound via binary KL inversion.
    Find largest R such that D_kl(R_hat || R) <= (KL + log(2*sqrt(m)/delta)) / m
    """
    bound_rhs = (kl_div + np.log(2 * np.sqrt(m) / delta)) / m
    
    if bound_rhs >= 1:  # Vacuous
        return 1.0
    
    # Binary search for R
    def objective(R):
        if R <= empirical_risk:
            return -np.inf
        return kl_binary(empirical_risk, R) - bound_rhs
    
    try:
        R_upper = brentq(objective, empirical_risk + 1e-10, 1 - 1e-10)
        return R_upper
    except ValueError:
        return 1.0  # Vacuous
 
def evaluate_pac_bayes(train_error: float, test_error: float,
                       weights: np.ndarray, sigma_q: float,
                       sigma_p: float, n_train: int, delta: float = 0.05):
    """
    Evaluate PAC-Bayes bounds for a neural network.
    
    Args:
        train_error: Empirical training error (0-1 loss)
        test_error: True test error (for comparison)
        weights: Flattened weight vector of the network
        sigma_q: Standard deviation of posterior perturbation  
        sigma_p: Standard deviation of prior (centered at 0)
        n_train: Number of training samples
        delta: Confidence level
    """
    # Compute KL divergence (prior centered at 0)
    mu_p = np.zeros_like(weights)
    kl = kl_gaussian(weights, sigma_q, mu_p, sigma_p)
    
    # Compute bounds
    mcallester = mcallester_bound(train_error, kl, n_train, delta)
    seeger = seeger_bound(train_error, kl, n_train, delta)
    
    return {
        'kl_divergence': kl,
        'mcallester_bound': mcallester,
        'seeger_bound': seeger,
        'test_error': test_error,
        'is_non_vacuous': seeger < 0.5  # Meaningful if < random guessing
    }
 
# Demonstration
print("PAC-Bayes Bound Evaluation")
print("=" * 60)
 
# Simulated neural network scenarios
scenarios = [
    {'name': 'Small network, well-regularized',
     'd': 10000, 'w_norm': 10, 'train_err': 0.02, 'test_err': 0.05},
    {'name': 'Large network, overfit',
     'd': 1000000, 'w_norm': 100, 'train_err': 0.0, 'test_err': 0.15},
    {'name': 'Large network, regularized',
     'd': 1000000, 'w_norm': 20, 'train_err': 0.01, 'test_err': 0.08},
]
 
n_train = 50000
sigma_p = 10.0  # Prior std
sigma_q = 0.1   # Posterior perturbation std
 
for scenario in scenarios:
    # Simulate weight vector with given norm
    d = scenario['d']
    weights = np.random.randn(d)
    weights = weights / np.linalg.norm(weights) * scenario['w_norm']
    
    result = evaluate_pac_bayes(
        scenario['train_err'], scenario['test_err'],
        weights, sigma_q, sigma_p, n_train
    )
    
    print(f"\n{scenario['name']}")
    print(f"  Parameters: {d:,}, ||w||: {scenario['w_norm']}")
    print(f"  Train error: {scenario['train_err']:.2%}")
    print(f"  Test error: {scenario['test_err']:.2%}")
    print(f"  KL divergence: {result['kl_divergence']:.1f}")
    print(f"  McAllester bound: {result['mcallester_bound']:.2%}")
    print(f"  Seeger bound: {result['seeger_bound']:.2%}")
    print(f"  Non-vacuous: {result['is_non_vacuous']}")

Practical Tips for Tight Bounds

For tighter PAC-Bayes bounds: (1) Use large prior variance $\sigma_P$ to reduce $|w|^2/\sigma_P^2$; (2) Regularize during training to keep $|w|$ small; (3) Use noise injection to find the largest $\sigma_Q$ that preserves accuracy; (4) Consider data-dependent priors (using a held-out set); (5) Use binary KL inversion (Seeger) instead of square-root bounds.

Connections and Extensions

PAC-Bayes connects to many important concepts in machine learning theory.

Connection 1: Minimum Description Length (MDL)

The PAC-Bayes bound can be rewritten as: $$R(Q) \leq \hat{R}(Q) + \sqrt{\frac{L_P(Q) + O(\log m)}{m}}$$

where $L_P(Q) = D_{\text{KL}}(Q | P)$ is the description length of the posterior relative to the prior. MDL principle: prefer hypotheses with short description. PAC-Bayes formalizes this.

Connection 2: Bayesian Inference

The Bayesian posterior $Q^* = P(h | S) \propto P(S | h) P(h)$ minimizes a combination of empirical risk and KL divergence: $$Q^* = \arg\min_Q \left{\mathbb{E}{h \sim Q}[\hat{R}(h)] + \frac{1}{m} D{\text{KL}}(Q | P)\right}$$

This is the Gibbs posterior. PAC-Bayes shows it has optimal generalization guarantees!

Connection 3: Regularization

For Gaussian posteriors, KL divergence includes an L2 norm term. Thus:

L2 regularization ↔ Gaussian prior → PAC-Bayes bound
Dropout ↔ Specific noise distribution → PAC-Bayes bound
Early stopping ↔ Limiting KL from initialization → PAC-Bayes bound

Extension 1: PAC-Bayes + Margins

Combining PAC-Bayes with margin analysis gives even tighter bounds: $$R_{0-1}(Q) \leq \hat{R}\gamma(Q) + \frac{1}{\gamma}\sqrt{\frac{D{\text{KL}}(Q | P) + O(\log m)}{m}}$$

Large margins reduce the complexity penalty!

Extension 2: PAC-Bayes for Gradient Descent

Recent work analyzes gradient descent through PAC-Bayes:

The trajectory of SGD can be viewed as defining a posterior
The implicit regularization of SGD corresponds to a specific KL penalty
This provides generalization bounds specific to the optimization algorithm

Extension 3: PAC-Bayes with Excess Risk

Instead of bounding $R(Q)$ absolutely, bound the excess risk $R(Q) - R^$ where $R^$ is the Bayes-optimal error: $$R(Q) - R^* \leq \hat{R}(Q) - \hat{R}^* + O\left(\sqrt{\frac{D_{\text{KL}}(Q | P)}{m}}\right)$$

This shows how quickly we approach the best possible classifier.

The PAC-Bayes-SGD Connection

A frontier research direction: SGD with specific learning rates and batch sizes converges to a posterior $Q$ that is 'close' to a data-independent prior $P$, explaining why overparameterized networks trained with SGD generalize. The optimization-dependent implicit regularization creates a 'compressible' posterior.

Summary and Key Takeaways

We have developed PAC-Bayes bounds—the frontier of generalization theory, providing algorithm-specific guarantees that can be non-vacuous for modern deep networks.

Key Takeaways

•PAC-Bayes Paradigm: Bound generalization of distributions over hypotheses, not single hypotheses. The posterior $Q$ represents the learned distribution; the prior $P$ encodes prior beliefs.
•The Fundamental Bound: $R(Q) \leq \hat{R}(Q) + \sqrt{(D_{\text{KL}}(Q | P) + O(\log m))/m}$. KL divergence from prior to posterior controls the generalization gap.
•KL as Complexity: $D_{\text{KL}}(Q | P)$ measures how much the posterior has 'moved' from the prior—how much the algorithm has learned. Small KL = simple solution = good generalization.
•Non-Vacuous for Deep Learning: Unlike VC/Rademacher bounds, PAC-Bayes can give meaningful bounds for neural networks by exploiting regularization and noise stability.
•Connection to Regularization: L2 regularization corresponds to a Gaussian prior; it directly reduces the KL term in the PAC-Bayes bound.
•Compression Interpretation: Posteriors that can be described 'simply' relative to the prior (small KL) generalize well. Compressible networks generalize.

Module Complete!

You have now mastered the complete toolkit of generalization bounds in statistical learning theory:

Union Bound: The bridge from individual to uniform guarantees
Hoeffding's Inequality: Exponential concentration for bounded random variables
Rademacher Complexity: Data-dependent complexity for infinite classes
Margin-Based Bounds: Geometric structure for tighter guarantees
PAC-Bayes Bounds: Algorithm-specific, Bayesian generalization theory

Together, these tools form the mathematical foundation for understanding why machine learning works—why algorithms that minimize empirical risk on finite samples can generalize to unseen data. This understanding guides algorithm design, architecture choices, regularization strategies, and the development of new learning methods.

Module Complete

You have completed Module 6: Generalization Bounds. You now command the full mathematical machinery of statistical learning theory—from foundational concentration inequalities through sophisticated PAC-Bayes analysis. These tools provide the theoretical guarantees that underpin all of modern machine learning, explaining when and why learning from data is possible.

5 / 5

Loading learning content...

Machine LearningGeneralization Bounds

Generalization Bounds: Mathematical Guarantees for Learning

LevelAdvanced

Duration90 mins

TopicGeneralization Bounds

5 / 5

PAC-Bayes Bounds: Bayesian Generalization Theory

Beyond Worst-Case: Algorithm-Specific Guarantees

Algorithm-Specific Bounds: The analysis can incorporate properties of the learning algorithm (e.g., SGD finds solutions with specific structure)
Prior Knowledge Integration: Bayesian priors encode beliefs about which hypotheses are plausible
Tighter Bounds: By averaging over hypotheses, PAC-Bayes often yields significantly tighter guarantees
Non-vacuous Deep Learning Bounds: PAC-Bayes provides some of the first non-trivial generalization bounds for deep neural networks

The name 'PAC-Bayes' reflects the synthesis: PAC (Probably Approximately Correct) learning framework meets Bayesian reasoning about hypothesis distributions.

What You Will Master

From Hypotheses to Distributions

PAC-Bayes bounds consider distributions over hypotheses rather than single hypotheses. We develop the framework carefully.

The Key Objects:

1. Hypothesis Space: $\mathcal{H}$ is a (potentially infinite) set of hypotheses $h: \mathcal{X} \to \mathcal{Y}$

2. Prior Distribution: $P$ is a distribution over $\mathcal{H}$ chosen before seeing data. This encodes prior beliefs about which hypotheses are plausible.

3. Posterior Distribution: $Q$ is a distribution over $\mathcal{H}$ chosen after seeing data. This is the output of learning—not a single hypothesis, but a distribution.

Why Distributions Over Hypotheses?

Interpretation 1: Bayesian Learning

In Bayesian learning, we start with a prior $P$ and update to a posterior $Q$ given data. PAC-Bayes provides frequentist guarantees for this Bayesian procedure.

Interpretation 2: Randomized Predictors

The distribution $Q$ represents randomized prediction: we average predictions over multiple hypotheses, often improving robustness.

Interpretation 3: Algorithm Analysis

Interpretation 4: Compression

A 'tight' distribution $Q$ (concentrated on few hypotheses) indicates that the algorithm has found a simple, compressible solution. PAC-Bayes rewards compression.

The Role of the Prior

Comparison: Standard PAC vs. PAC-Bayes
Aspect	Standard PAC	PAC-Bayes
Output	Single hypothesis $h$	Distribution $Q$ over $\mathcal{H}$
Guarantee	$R(h) \leq \hat{R}(h) + \text{complexity}$	$R(Q) \leq \hat{R}(Q) + \text{KL}(Q\|\|P)/m$
Complexity Measure	VC-dim, Rademacher	KL divergence from prior
Prior Knowledge	Implicit in $\mathcal{H}$	Explicit prior $P$
Tightness for DL	Often vacuous	Can be non-vacuous

The PAC-Bayes Theorem

We now state and prove the fundamental PAC-Bayes inequality, the cornerstone of the theory.

$$\mathbb{E}{h \sim Q}[R(h)] \leq \mathbb{E}{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{2m}}$$

where $D_{\text{KL}}(Q | P) = \mathbb{E}_{h \sim Q}\left[\ln \frac{Q(h)}{P(h)}\right]$ is the KL divergence from $Q$ to $P$.

Interpretation:

The bound says: the generalization gap of a posterior $Q$ is controlled by:

KL Divergence: How far $Q$ has moved from the prior $P$. Large changes require more data.
Sample Size: $\sqrt{1/m}$ scaling, as expected from concentration.

The KL divergence plays the role of 'complexity'—it measures how much information about the data is encoded in $Q$.

Proof Outline:

Step 1: Change of Measure

For any function $\phi(h)$: $$\mathbb{E}{h \sim Q}[\phi(h)] = \mathbb{E}{h \sim P}\left[\frac{Q(h)}{P(h)} \phi(h)\right]$$

This converts expectations under $Q$ to expectations under $P$ (which is data-independent).

Step 2: Donsker-Varadhan Variational Formula

For any random variable $X$ under $P$: $$\ln \mathbb{E}{h \sim P}[e^X] = \sup_Q \left{\mathbb{E}{h \sim Q}[X] - D_{\text{KL}}(Q | P)\right}$$

Rearranging for any $Q$: $$\mathbb{E}{h \sim Q}[X] \leq D{\text{KL}}(Q | P) + \ln \mathbb{E}_{h \sim P}[e^X]$$

Step 3: Apply to Generalization Gap

Jensen's inequality and algebra complete the proof. $\square$

The Genius of PAC-Bayes

Variants and Improvements

McAllester's bound is foundational but admits several refinements for practical applications.

Catoni's Bound (Tighter, Non-Square-Root Form):

With probability $\geq 1 - \delta$, for all $Q$ and any $\lambda > 0$:

$$D_{\text{kl}}\left(\mathbb{E}{h \sim Q}[\hat{R}(h)] | \mathbb{E}{h \sim Q}[R(h)]\right) \leq \frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{m}$$

where $D_{\text{kl}}(q | p) = q \ln(q/p) + (1-q)\ln((1-q)/(1-p))$ is the binary KL divergence.

Solving for $R(Q)$ gives tighter bounds than McAllester for small training errors.

Seeger's Bound (Direct Binary KL Inversion):

For bounded losses in $[0, 1]$, with prob $\geq 1 - \delta$:

$$D_{\text{kl}}(\hat{R}(Q) | R(Q)) \leq \frac{D_{\text{KL}}(Q | P) + \ln(2\sqrt{m}/\delta)}{m}$$

Inverting this (numerically) gives the tightest known PAC-Bayes bounds.

Oracle PAC-Bayes (Data-Dependent Priors):

A subtle but powerful extension: use part of the data to construct the prior, then apply PAC-Bayes to the rest:

Split data: $S = S_1 \cup S_2$
Use $S_1$ to construct prior $P_{S_1}$
Apply PAC-Bayes with prior $P_{S_1}$ and data $S_2$

This allows data-dependent priors while maintaining validity.

PAC-Bayes with Unbounded Losses:

Extensions handle regression (squared loss) and other unbounded losses by truncating or using sub-Gaussian assumptions.

Comparison of Bounds:

PAC-Bayes Bound Variants
Bound	Form	Best For
McAllester	$R(Q) \leq \hat{R}(Q) + \sqrt{\frac{\text{KL} + O(\log m)}{2m}}$	Simplicity, interpretability
Catoni	Implicit via binary KL	Small training error, tight
Seeger	Binary KL inversion	Tightest theoretical bounds
Oracle	Data-dependent prior	Practical deep learning
Langford-Caruana	Holdout-validated	When validation available

Numerical Inversion

The Role of KL Divergence

The KL divergence $D_{\text{KL}}(Q | P)$ is the central quantity in PAC-Bayes. Understanding its role provides deep insight.

Information-Theoretic Interpretation:

KL divergence measures the 'information gained' when updating from prior $P$ to posterior $Q$: $$D_{\text{KL}}(Q | P) = \mathbb{E}_{h \sim Q}\left[\ln \frac{Q(h)}{P(h)}\right]$$

If $Q = P$: no information gained, $D_{\text{KL}} = 0$
If $Q$ concentrates on hypotheses unlikely under $P$: large $D_{\text{KL}}$
If $Q$ is broad but shifted from $P$: moderate $D_{\text{KL}}$

Coding Interpretation:

$D_{\text{KL}}(Q | P)$ is the extra bits needed to describe a hypothesis from $Q$ using a code optimized for $P$. Small KL means the posterior is 'efficient' relative to the prior.

Compression Interpretation:

PAC-Bayes rewards solutions that can be compressed:

A narrow posterior $Q$ (concentrated on few hypotheses) from a broad prior $P$ has small $D_{\text{KL}}$
This indicates the algorithm found a simple pattern, not memorization

For Gaussian Prior/Posterior (Neural Networks):

In deep learning applications, common choices are:

Prior: $P = \mathcal{N}(0, \sigma_P^2 I)$ (spherical Gaussian centered at origin)
Posterior: $Q = \mathcal{N}(w, \sigma_Q^2 I)$ (Gaussian around learned weights $w$)

The KL divergence has closed form: $$D_{\text{KL}}(Q | P) = \frac{d}{2}\left(\frac{\sigma_Q^2}{\sigma_P^2} - 1 - \ln\frac{\sigma_Q^2}{\sigma_P^2}\right) + \frac{|w|^2}{2\sigma_P^2}$$

where $d$ is the number of parameters.

The logarithmic dependence on $d$ (rather than linear) is what enables non-vacuous bounds for large networks!

Why PAC-Bayes Beats Standard Bounds for NNs

PAC-Bayes for Deep Learning

PAC-Bayes provides some of the first non-trivial generalization guarantees for deep neural networks. Let's explore how.

The Challenge:

Deep networks have millions of parameters. Standard bounds give:

VC dimension: $\tilde{O}(d \cdot L \cdot \log W)$ for $d$ parameters, $L$ layers
Rademacher: $\tilde{O}(\prod_\ell |W_\ell|_F / \sqrt{m})$

For VGG-16 ($\sim 138$ million parameters), these exceed 1—vacuous!

PAC-Bayes Approach:

Step 1: Choose a Prior $$P = \mathcal{N}(0, \sigma_P^2 I)$$ with $\sigma_P$ chosen to reflect prior belief about weight scale.

Step 2: Define Posterior Around Learned Weights $$Q = \mathcal{N}(w^, \sigma_Q^2 I)$$ where $w^$ is the trained network's weights.

Step 3: Bound Stochastic Classifier Error

The stochastic classifier samples $w \sim Q$ and predicts. The expected error can be bounded using PAC-Bayes.

The Compression Perspective (Arora et al., 2018):

A key insight: neural networks can often be compressed without losing accuracy:

Pruning: Remove small weights
Quantization: Round weights to low precision
Knowledge distillation: Train smaller network to mimic larger one

If a network can be compressed from $d$ parameters to effectively $k < d$ parameters, the PAC-Bayes bound depends on $k$ rather than $d$.

Noise Stability (Dziugaite & Roy, 2017):

Alternatively, if a network is stable to noise injection (adding Gaussian noise to weights doesn't hurt accuracy much), this implies a tight posterior $Q$:

Train network to weights $w^*$
Measure accuracy when using $w^* + \epsilon$ for $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$
If accuracy is preserved, $\sigma$ can be large, making $D_{\text{KL}}(Q | P)$ small

Non-Vacuous Bounds Achieved!

Computing PAC-Bayes Bounds

Let's implement PAC-Bayes bound computation and see how to evaluate bounds for real models.

pac_bayes_bounds.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy.optimize import brentq
from typing import Tuple
 
def kl_binary(q: float, p: float) -> float:
    """
    Binary KL divergence: D_kl(q || p) = q*log(q/p) + (1-q)*log((1-q)/(1-p))
    """
    if q == 0:
        return -np.log(1 - p) if p < 1 else np.inf
    if q == 1:
        return -np.log(p) if p > 0 else np.inf
    if p == 0 or p == 1:
        return np.inf
    return q * np.log(q / p) + (1 - q) * np.log((1 - q) / (1 - p))
 
def kl_gaussian(mu_q: np.ndarray, sigma_q: float, 
                mu_p: np.ndarray, sigma_p: float) -> float:
    """
    KL divergence between diagonal Gaussians.
    D_KL(N(mu_q, sigma_q^2 I) || N(mu_p, sigma_p^2 I))
    """
    d = len(mu_q)
    diff = mu_q - mu_p
    kl = 0.5 * (
        d * (sigma_q**2 / sigma_p**2 - 1 - np.log(sigma_q**2 / sigma_p**2)) +
        np.sum(diff**2) / sigma_p**2
    )
    return kl
 
def mcallester_bound(empirical_risk: float, kl_div: float, 
                     m: int, delta: float = 0.05) -> float:
    """
    McAllester's PAC-Bayes bound (square-root form).
    R(Q) <= R_hat(Q) + sqrt((KL(Q||P) + ln(2*sqrt(m)/delta)) / (2*m))
    """
    complexity = np.sqrt((kl_div + np.log(2 * np.sqrt(m) / delta)) / (2 * m))
    return min(1.0, empirical_risk + complexity)
 
def seeger_bound(empirical_risk: float, kl_div: float,
                 m: int, delta: float = 0.05) -> float:
    """
    Seeger's PAC-Bayes bound via binary KL inversion.
    Find largest R such that D_kl(R_hat || R) <= (KL + log(2*sqrt(m)/delta)) / m
    """
    bound_rhs = (kl_div + np.log(2 * np.sqrt(m) / delta)) / m
    
    if bound_rhs >= 1:  # Vacuous
        return 1.0
    
    # Binary search for R
    def objective(R):
        if R <= empirical_risk:
            return -np.inf
        return kl_binary(empirical_risk, R) - bound_rhs
    
    try:
        R_upper = brentq(objective, empirical_risk + 1e-10, 1 - 1e-10)
        return R_upper
    except ValueError:
        return 1.0  # Vacuous
 
def evaluate_pac_bayes(train_error: float, test_error: float,
                       weights: np.ndarray, sigma_q: float,
                       sigma_p: float, n_train: int, delta: float = 0.05):
    """
    Evaluate PAC-Bayes bounds for a neural network.
    
    Args:
        train_error: Empirical training error (0-1 loss)
        test_error: True test error (for comparison)
        weights: Flattened weight vector of the network
        sigma_q: Standard deviation of posterior perturbation  
        sigma_p: Standard deviation of prior (centered at 0)
        n_train: Number of training samples
        delta: Confidence level
    """
    # Compute KL divergence (prior centered at 0)
    mu_p = np.zeros_like(weights)
    kl = kl_gaussian(weights, sigma_q, mu_p, sigma_p)
    
    # Compute bounds
    mcallester = mcallester_bound(train_error, kl, n_train, delta)
    seeger = seeger_bound(train_error, kl, n_train, delta)
    
    return {
        'kl_divergence': kl,
        'mcallester_bound': mcallester,
        'seeger_bound': seeger,
        'test_error': test_error,
        'is_non_vacuous': seeger < 0.5  # Meaningful if < random guessing
    }
 
# Demonstration
print("PAC-Bayes Bound Evaluation")
print("=" * 60)
 
# Simulated neural network scenarios
scenarios = [
    {'name': 'Small network, well-regularized',
     'd': 10000, 'w_norm': 10, 'train_err': 0.02, 'test_err': 0.05},
    {'name': 'Large network, overfit',
     'd': 1000000, 'w_norm': 100, 'train_err': 0.0, 'test_err': 0.15},
    {'name': 'Large network, regularized',
     'd': 1000000, 'w_norm': 20, 'train_err': 0.01, 'test_err': 0.08},
]
 
n_train = 50000
sigma_p = 10.0  # Prior std
sigma_q = 0.1   # Posterior perturbation std
 
for scenario in scenarios:
    # Simulate weight vector with given norm
    d = scenario['d']
    weights = np.random.randn(d)
    weights = weights / np.linalg.norm(weights) * scenario['w_norm']
    
    result = evaluate_pac_bayes(
        scenario['train_err'], scenario['test_err'],
        weights, sigma_q, sigma_p, n_train
    )
    
    print(f"\n{scenario['name']}")
    print(f"  Parameters: {d:,}, ||w||: {scenario['w_norm']}")
    print(f"  Train error: {scenario['train_err']:.2%}")
    print(f"  Test error: {scenario['test_err']:.2%}")
    print(f"  KL divergence: {result['kl_divergence']:.1f}")
    print(f"  McAllester bound: {result['mcallester_bound']:.2%}")
    print(f"  Seeger bound: {result['seeger_bound']:.2%}")
    print(f"  Non-vacuous: {result['is_non_vacuous']}")

Practical Tips for Tight Bounds

Connections and Extensions

PAC-Bayes connects to many important concepts in machine learning theory.

Connection 1: Minimum Description Length (MDL)

The PAC-Bayes bound can be rewritten as: $$R(Q) \leq \hat{R}(Q) + \sqrt{\frac{L_P(Q) + O(\log m)}{m}}$$

where $L_P(Q) = D_{\text{KL}}(Q | P)$ is the description length of the posterior relative to the prior. MDL principle: prefer hypotheses with short description. PAC-Bayes formalizes this.

Connection 2: Bayesian Inference

This is the Gibbs posterior. PAC-Bayes shows it has optimal generalization guarantees!

Connection 3: Regularization

For Gaussian posteriors, KL divergence includes an L2 norm term. Thus:

L2 regularization ↔ Gaussian prior → PAC-Bayes bound
Dropout ↔ Specific noise distribution → PAC-Bayes bound
Early stopping ↔ Limiting KL from initialization → PAC-Bayes bound

Extension 1: PAC-Bayes + Margins

Combining PAC-Bayes with margin analysis gives even tighter bounds: $$R_{0-1}(Q) \leq \hat{R}\gamma(Q) + \frac{1}{\gamma}\sqrt{\frac{D{\text{KL}}(Q | P) + O(\log m)}{m}}$$

Large margins reduce the complexity penalty!

Extension 2: PAC-Bayes for Gradient Descent

Recent work analyzes gradient descent through PAC-Bayes:

The trajectory of SGD can be viewed as defining a posterior
The implicit regularization of SGD corresponds to a specific KL penalty
This provides generalization bounds specific to the optimization algorithm

Extension 3: PAC-Bayes with Excess Risk

This shows how quickly we approach the best possible classifier.

The PAC-Bayes-SGD Connection

Summary and Key Takeaways

We have developed PAC-Bayes bounds—the frontier of generalization theory, providing algorithm-specific guarantees that can be non-vacuous for modern deep networks.

Key Takeaways

•PAC-Bayes Paradigm: Bound generalization of distributions over hypotheses, not single hypotheses. The posterior $Q$ represents the learned distribution; the prior $P$ encodes prior beliefs.
•The Fundamental Bound: $R(Q) \leq \hat{R}(Q) + \sqrt{(D_{\text{KL}}(Q | P) + O(\log m))/m}$. KL divergence from prior to posterior controls the generalization gap.
•KL as Complexity: $D_{\text{KL}}(Q | P)$ measures how much the posterior has 'moved' from the prior—how much the algorithm has learned. Small KL = simple solution = good generalization.
•Non-Vacuous for Deep Learning: Unlike VC/Rademacher bounds, PAC-Bayes can give meaningful bounds for neural networks by exploiting regularization and noise stability.
•Connection to Regularization: L2 regularization corresponds to a Gaussian prior; it directly reduces the KL term in the PAC-Bayes bound.
•Compression Interpretation: Posteriors that can be described 'simply' relative to the prior (small KL) generalize well. Compressible networks generalize.

Module Complete!

You have now mastered the complete toolkit of generalization bounds in statistical learning theory:

Union Bound: The bridge from individual to uniform guarantees
Hoeffding's Inequality: Exponential concentration for bounded random variables
Rademacher Complexity: Data-dependent complexity for infinite classes
Margin-Based Bounds: Geometric structure for tighter guarantees
PAC-Bayes Bounds: Algorithm-specific, Bayesian generalization theory

Module Complete

5 / 5