Vc Dimension - Learning Module

Loading content...

0/278

Generalization Bounds from VC Dimension

The Promise of Learning Theory

The ultimate goal of statistical learning theory is to answer: What guarantees can we make about a model's performance on unseen data, based only on its training performance?

VC dimension provides the key. By measuring hypothesis class complexity, we can derive generalization bounds—mathematical inequalities that bound the gap between training error and true (test) error. These bounds tell us:

How much data we need to learn reliably (sample complexity)
When training error is trustworthy as an estimate of test error
How model complexity affects generalization

This page develops the fundamental VC generalization bound, building from concentration inequalities to the celebrated Vapnik-Chervonenkis theorem.

What You Will Learn

By the end of this page, you will understand how to derive generalization bounds using VC dimension, interpret what these bounds mean for practical learning, and appreciate both the power and limitations of VC-based analysis.

The Generalization Problem: Formal Setup

The Learning Framework:

We work in the standard supervised learning setup:

Input space: $\mathcal{X}$ (e.g., $\mathbb{R}^d$)
Output space: $\mathcal{Y} = {-1, +1}$ (binary classification)
Data distribution: Unknown $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$
Training sample: $S = {(x_1, y_1), \ldots, (x_m, y_m)}$ drawn i.i.d. from $\mathcal{D}$
Hypothesis class: $\mathcal{H}$ of functions $h: \mathcal{X} \to {-1, +1}$

True Risk (Test Error):

For any hypothesis $h \in \mathcal{H}$, the true risk is: $$L_\mathcal{D}(h) = \mathbb{P}_{(x,y) \sim \mathcal{D}}[h(x) eq y]$$

This is what we truly care about—performance on the unknown distribution.

Empirical Risk (Training Error):

The empirical risk on sample $S$ is: $$\hat{L}S(h) = \frac{1}{m} \sum{i=1}^{m} \mathbf{1}[h(x_i) eq y_i]$$

This is what we can compute from data.

The Central Question

The central question of learning theory: How close is the empirical risk L̂_S(h) to the true risk L_D(h)? If they're reliably close, we can use training performance to predict test performance. VC dimension provides the answer.

The Generalization Gap:

The generalization gap for a hypothesis $h$ is: $$|L_\mathcal{D}(h) - \hat{L}_S(h)|$$

We want to bound this quantity uniformly over all $h \in \mathcal{H}$: $$\sup_{h \in \mathcal{H}} |L_\mathcal{D}(h) - \hat{L}_S(h)|$$

If this supremum is small, then for any hypothesis we select (including the ERM solution), its training error is a reliable estimate of its true error.

The Goal:

Derive a bound of the form: $$\mathbb{P}\left[\sup_{h \in \mathcal{H}} |L_\mathcal{D}(h) - \hat{L}_S(h)| > \epsilon\right] \leq \delta(m, d, \epsilon)$$

where $\delta$ decreases with sample size $m$ and depends on VCdim$(\mathcal{H}) = d$.

Warm-Up: Generalization for Finite Hypothesis Classes

Before tackling infinite hypothesis classes, let's develop intuition with finite ones.

Setup: Let $|\mathcal{H}| = N$ (a finite class of $N$ hypotheses).

For a Single Hypothesis:

Fix any $h \in \mathcal{H}$. The empirical risk $\hat{L}S(h) = \frac{1}{m}\sum_i \mathbf{1}[h(x_i) eq y_i]$ is an average of i.i.d. Bernoulli random variables with mean $L\mathcal{D}(h)$.

By Hoeffding's inequality: $$\mathbb{P}[|\hat{L}S(h) - L\mathcal{D}(h)| > \epsilon] \leq 2e^{-2m\epsilon^2}$$

Hoeffding's Inequality

Hoeffding's inequality states that for n i.i.d. bounded random variables X_i ∈ [a,b] with mean μ, the sample mean X̄ satisfies: P[|X̄ - μ| > ε] ≤ 2exp(-2nε²/(b-a)²). For Bernoulli (0/1) variables, b-a=1, giving the clean bound above.

Extending to the Whole Class (Union Bound):

We want to bound the probability that any hypothesis has a large generalization gap: $$\mathbb{P}\left[\exists h \in \mathcal{H}: |\hat{L}S(h) - L\mathcal{D}(h)| > \epsilon\right]$$

Using the union bound: $$\mathbb{P}\left[\exists h: |\hat{L}S(h) - L\mathcal{D}(h)| > \epsilon\right] \leq \sum_{h \in \mathcal{H}} \mathbb{P}[|\hat{L}S(h) - L\mathcal{D}(h)| > \epsilon]$$ $$\leq N \cdot 2e^{-2m\epsilon^2} = 2|\mathcal{H}|e^{-2m\epsilon^2}$$

Finite Class Generalization Bound:

With probability at least $1 - \delta$: $$\forall h \in \mathcal{H}: |L_\mathcal{D}(h) - \hat{L}_S(h)| \leq \sqrt{\frac{\ln|\mathcal{H}| + \ln(2/\delta)}{2m}}$$

Sample Complexity for Finite Hypothesis Classes
Class Size \|H\|	Samples for ε=0.1, δ=0.05	Log Factor
10	~300	ln(10) ≈ 2.3
1,000	~600	ln(1000) ≈ 6.9
1,000,000	~900	ln(10⁶) ≈ 13.8
10¹⁰⁰	~12,300	ln(10¹⁰⁰) ≈ 230

Logarithmic Dependence

Crucially, sample complexity depends on log|H|, not |H|. This means even astronomically large finite classes are learnable with reasonable sample sizes. But what about infinite classes?

From Finite to Infinite: The Key Insight

For infinite hypothesis classes, the union bound over all hypotheses gives a trivially useless bound ($\infty \cdot \epsilon = \infty$). We need a smarter approach.

The Key Insight: Effective Finiteness

Even though $\mathcal{H}$ may be infinite, what matters for learning is how $\mathcal{H}$ behaves on the data we actually see. On any finite sample $S$ of size $m$, the hypothesis class induces at most $|\mathcal{H}(S)| \leq \Pi_\mathcal{H}(m)$ distinct dichotomies.

From the learner's perspective, hypotheses that agree on all sample points are indistinguishable—they have the same empirical risk. So we only need to count equivalence classes of hypotheses.

The Symmetrization Trick:

The formal argument uses a 'ghost sample' $S'$ of the same size as $S$, drawn independently from $\mathcal{D}$.

Step 1: Replace the population risk with empirical risk on $S'$: $$|L_\mathcal{D}(h) - \hat{L}S(h)| \approx |\hat{L}{S'}(h) - \hat{L}_S(h)|$$

For large $m$, $\hat{L}{S'}(h)$ concentrates around $L\mathcal{D}(h)$.

Step 2: Condition on the combined sample $S \cup S'$: Once we fix $S \cup S'$, the only randomness is in the partition—which elements go to $S$ vs $S'$.

Step 3: Use symmetry: For any fixed partition of $2m$ points, the expected difference between $S$-risk and $S'$-risk is zero by symmetry. Concentration gives tail bounds.

The Ghost Sample Trick

Symmetrization is one of the most powerful techniques in learning theory. By introducing a 'test set' that is statistically identical to the training set, we can convert statements about the unknown distribution D to statements about observable quantities.

Effective Hypothesis Count:

After symmetrization, we apply the union bound over dichotomies on the combined sample $S \cup S'$:

$$\text{Number of distinct behaviors} \leq \Pi_\mathcal{H}(2m)$$

By the Sauer-Shelah lemma, if VCdim$(\mathcal{H}) = d$: $$\Pi_\mathcal{H}(2m) \leq \left(\frac{2em}{d}\right)^d$$

This is polynomial in $m$! The infinite hypothesis class acts like a finite class of size $(2em/d)^d$ when restricted to samples of size $m$.

The Vapnik-Chervonenkis Generalization Theorem

We now state and prove the fundamental VC generalization bound.

Theorem (Vapnik-Chervonenkis Generalization Bound):

Let $\mathcal{H}$ be a hypothesis class with VCdim$(\mathcal{H}) = d < \infty$. For any distribution $\mathcal{D}$ and any $\delta > 0$, with probability at least $1 - \delta$ over a random sample $S$ of size $m$:

$$\forall h \in \mathcal{H}: L_\mathcal{D}(h) \leq \hat{L}_S(h) + \sqrt{\frac{d \ln(em/d) + \ln(4/\delta)}{m}}$$

Equivalently, the uniform deviation is bounded: $$\sup_{h \in \mathcal{H}} |L_\mathcal{D}(h) - \hat{L}_S(h)| \leq O\left(\sqrt{\frac{d \ln(m/d)}{m}}\right)$$

The Bound's Significance

This theorem is profound: it guarantees that for ANY hypothesis class with finite VC dimension, training error converges uniformly to true error. The generalization gap shrinks as O(√(d/m)), meaning we need m ∝ d samples to learn. VC dimension precisely characterizes sample complexity!

Proof Sketch:

Step 1: Symmetrization

Let $S'$ be an independent 'ghost sample'. For $h$ with $L_\mathcal{D}(h) - \hat{L}S(h) > \epsilon$: $$\mathbb{P}[\hat{L}{S'}(h) - \hat{L}S(h) > \epsilon/2] \geq \frac{1}{2}$$ when $m$ is large enough (since $\hat{L}{S'}(h) \to L_\mathcal{D}(h)$).

Step 2: Union Bound over Dichotomies

$$\mathbb{P}[\exists h: \hat{L}_{S'}(h) - \hat{L}S(h) > \epsilon/2] \leq \Pi\mathcal{H}(2m) \cdot \mathbb{P}[\text{fixed } h \text{ has gap } > \epsilon/2]$$

Step 3: Concentration on Fixed Dichotomy

For a fixed dichotomy, the difference $\hat{L}_{S'}(h) - \hat{L}S(h)$ is a difference of averages. By Hoeffding: $$\mathbb{P}[\hat{L}{S'}(h) - \hat{L}_S(h) > \epsilon/2] \leq e^{-m\epsilon^2/8}$$

Step 4: Combine

$$\mathbb{P}[\exists h: |L_\mathcal{D}(h) - \hat{L}S(h)| > \epsilon] \leq 2\Pi\mathcal{H}(2m) \cdot e^{-m\epsilon^2/8}$$

Substituting $\Pi_\mathcal{H}(2m) \leq (2em/d)^d$ and solving for $\epsilon$ in terms of $\delta$ gives the bound. $\blacksquare$

Sample Complexity: How Much Data Do We Need?

The VC bound directly answers the fundamental question: how many samples are sufficient to learn?

Sample Complexity for PAC Learning:

To achieve error at most $\epsilon$ above the best-in-class error with probability at least $1 - \delta$, it suffices to have:

$$m \geq O\left(\frac{d + \ln(1/\delta)}{\epsilon^2}\right)$$

where $d = $ VCdim$(\mathcal{H})$.

More Precisely:

$$m \geq \frac{8}{\epsilon^2}\left(d \ln\frac{16e}{\epsilon} + \ln\frac{4}{\delta}\right)$$

ensures that with probability $\geq 1 - \delta$, the ERM hypothesis $\hat{h}$ satisfies $L_\mathcal{D}(\hat{h}) \leq \min_{h \in \mathcal{H}} L_\mathcal{D}(h) + \epsilon$.

Sample Complexity Estimates (ε = 0.05, δ = 0.05)
VC Dimension d	Sample Requirement m	Example Hypothesis Class
2	~6,500	Intervals on ℝ
3	~9,700	Lines in ℝ² (halfspaces)
10	~32,000	Hyperplanes in ℝ⁹
100	~320,000	Hyperplanes in ℝ⁹⁹
1,000	~3,200,000	Large linear models

Bound Tightness

These bounds are often pessimistic. The O(d/ε²) scaling is tight in the worst case, but many real problems have extra structure (noise, margin, low intrinsic dimension) that allows faster learning. VC bounds give guarantees, not predictions—actual performance is often better.

Lower Bounds:

Remarkably, VC dimension also characterizes necessary sample complexity. There exist distributions where learning requires: $$m \geq \Omega\left(\frac{d + \ln(1/\delta)}{\epsilon}\right)$$

The gap between $\epsilon^{-2}$ (upper) and $\epsilon^{-1}$ (lower) is the 'slow rates' vs 'fast rates' phenomenon. Under additional assumptions (low noise, margin conditions), one can achieve the faster $1/\epsilon$ rate.

Interpreting Generalization Bounds

The VC generalization bound has several important components worth examining carefully.

What the Bound Guarantees

•Uniform convergence: All hypotheses simultaneously have bounded generalization gap
•Distribution-free: Holds for ANY data distribution D
•ERM consistency: Minimizing training error eventually works
•Finite sample: Explicit bounds for any m

What the Bound Doesn't Capture

•Algorithm-specific behavior: Bounds on H, not on specific learning algorithms
•Benign overfitting: Over-parameterized models that generalize
•Distribution-specific structure: May be very loose for 'nice' distributions
•Computational tractability: Nothing about finding good h efficiently

Decomposing the Bound:

$$L_\mathcal{D}(\hat{h}) \leq \underbrace{\hat{L}S(\hat{h})}{\text{Training error}} + \underbrace{\sqrt{\frac{d \ln(em/d) + \ln(4/\delta)}{m}}}_{\text{Complexity penalty}}$$

Training error term: What we can measure. Lower is better.

Complexity penalty term: The price we pay for using a complex hypothesis class.

Increases with VC dimension $d$
Decreases with sample size $m$
The $\ln(m/d)$ factor is mild (log growth)

The Bias-Variance Vibe:

This decomposition echoes bias-variance:

Small $d$: Low complexity penalty, but may have high training error (underfitting)
Large $d$: Low training error possible, but high complexity penalty (overfitting)

The bound suggests choosing $d$ to minimize the sum.

VC Bounds in Practice

How useful are VC bounds in practice? The honest answer: mixed.

Why VC Bounds Often Seem Loose:

Worst-case analysis: Bounds hold for all distributions, but most real distributions have exploitable structure.
Uniform over all hypotheses: We bound all of $\mathcal{H}$, but the algorithm may only explore a small part.
No algorithm information: ERM finds a solution, but SGD finds a specific solution with additional implicit regularization.
Overparameterized models: Modern neural networks have $d \gg m$ yet generalize, violating VC intuition.

Worked Example: Computing a VC BoundWe train a linear classifier on 10,000 samples in ℝ¹⁰⁰. What does the VC bound guarantee?

Input

d = 101 (linear classifier), m = 10,000, δ = 0.05

Output

The VC bound gives:

$\epsilon = \sqrt{\frac{101 \cdot \ln(10000 \cdot e/101) + \ln(80)}{10000}}$
$= \sqrt{\frac{101 \cdot \ln(267.4) + 4.38}{10000}}$
$= \sqrt{\frac{101 \cdot 5.59 + 4.38}{10000}}$
$= \sqrt{\frac{569}{10000}} \approx 0.239$

With 95% confidence: Test error ≤ Training error + 0.24

Explanation

For training error 5%, the bound guarantees test error ≤ 29%. This is a valid guarantee but likely very loose—actual test error is probably closer to 6-8% for a well-trained model.

When VC Bounds Are Most Useful

VC bounds are most valuable as qualitative guidance:

They explain WHY simpler models generalize better
They justify regularization and model selection strategies
They provide theoretical foundations for practical intuitions
They give worst-case guarantees when robustness matters

For tighter numerical bounds, use validation or more refined techniques (Rademacher complexity, PAC-Bayes).

Extensions and Refinements

The basic VC bound can be refined in several directions.

Rademacher Complexity:

A tighter complexity measure that is data-dependent:

$$\mathcal{R}S(\mathcal{H}) = \mathbb{E}\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^m \sigma_i h(x_i)\right]$$

where $\sigma_i$ are i.i.d. Rademacher random variables ($\pm 1$ with equal probability).

Bound: $$L_\mathcal{D}(h) \leq \hat{L}_S(h) + 2\mathcal{R}_S(\mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2m}}$$

Advantages:

Data-dependent: Adapts to actual sample
Often tighter than VC bounds
Computable (via sampling)

Relationship to VC: $$\mathcal{R}_S(\mathcal{H}) = O\left(\sqrt{\frac{d}{m}}\right)$$

Computing Generalization Bounds

vc_generalization_bounds.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
import matplotlib.pyplot as plt
from math import log, sqrt, e
 
def vc_generalization_bound(vc_dim: int, m: int, delta: float) -> float:
    """
    Compute the VC generalization bound.
    
    Returns epsilon such that with probability >= 1-delta:
    |L_D(h) - L_S(h)| <= epsilon for all h in H
    """
    if m <= vc_dim:
        return 1.0  # Not enough samples
    
    # Standard VC bound: sqrt((d*ln(em/d) + ln(4/delta)) / m)
    complexity_term = vc_dim * log(e * m / vc_dim)
    confidence_term = log(4 / delta)
    
    epsilon = sqrt((complexity_term + confidence_term) / m)
    return min(epsilon, 1.0)
 
def sample_complexity(vc_dim: int, epsilon: float, delta: float) -> int:
    """
    Compute minimum samples needed to achieve epsilon error with prob 1-delta.
    
    Uses: m >= (8/eps^2) * (d * ln(16e/eps) + ln(4/delta))
    """
    complexity_term = vc_dim * log(16 * e / epsilon)
    confidence_term = log(4 / delta)
    
    m = int(8 / epsilon**2 * (complexity_term + confidence_term))
    return m
 
def plot_generalization_bounds():
    """Visualize how generalization bound varies with samples and VC dimension."""
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    delta = 0.05
    
    # Plot 1: Bound vs samples for different VC dimensions
    ax1 = axes[0]
    sample_sizes = np.logspace(2, 6, 100).astype(int)
    
    for d in [5, 20, 100, 500]:
        bounds = [vc_generalization_bound(d, m, delta) for m in sample_sizes]
        ax1.plot(sample_sizes, bounds, label=f'd = {d}', linewidth=2)
    
    ax1.set_xscale('log')
    ax1.set_xlabel('Sample Size m', fontsize=12)
    ax1.set_ylabel('Generalization Bound ε', fontsize=12)
    ax1.set_title('VC Bound vs Sample Size', fontsize=14)
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(0, 1)
    
    # Plot 2: Sample complexity vs VC dimension for different targets
    ax2 = axes[1]
    vc_dims = np.arange(1, 201)
    
    for eps in [0.1, 0.05, 0.01]:
        samples = [sample_complexity(d, eps, delta) for d in vc_dims]
        ax2.plot(vc_dims, samples, label=f'ε = {eps}', linewidth=2)
    
    ax2.set_xlabel('VC Dimension d', fontsize=12)
    ax2.set_ylabel('Sample Complexity m', fontsize=12)
    ax2.set_title('Sample Complexity vs VC Dimension', fontsize=14)
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    ax2.set_yscale('log')
    
    plt.tight_layout()
    plt.savefig('vc_bounds_visualization.png', dpi=150)
    plt.show()
 
# Demonstrate bound computation
if __name__ == "__main__":
    print("=" * 60)
    print("VC Generalization Bounds Demonstration")
    print("=" * 60)
    
    # Example scenarios
    scenarios = [
        {"name": "Linear classifier in R^10", "d": 11, "m": 1000},
        {"name": "Linear classifier in R^100", "d": 101, "m": 10000},
        {"name": "Quadratic classifier in R^10", "d": 66, "m": 5000},
        {"name": "Deep features (d~10000)", "d": 10000, "m": 50000},
    ]
    
    delta = 0.05
    print(f"
Confidence level: {100*(1-delta)}%
")
    
    for scenario in scenarios:
        name, d, m = scenario["name"], scenario["d"], scenario["m"]
        eps = vc_generalization_bound(d, m, delta)
        print(f"{name}:")
        print(f"  VC Dimension: {d}")
        print(f"  Samples: {m:,}")
        print(f"  Bound: Test Error ≤ Train Error + {eps:.3f}")
        print()
    
    # Sample complexity examples
    print("
" + "=" * 60)
    print("Sample Complexity Analysis")
    print("=" * 60)
    
    for eps, d in [(0.1, 10), (0.05, 50), (0.01, 100)]:
        m = sample_complexity(d, eps, delta)
        print(f"To achieve ε={eps} with VC-dim={d}: need m ≥ {m:,} samples")
    
    plot_generalization_bounds()

Summary: The Power of VC Generalization Theory

VC generalization bounds provide the theoretical foundation for understanding when machine learning works.

Key Takeaways

•The VC bound: $L(h) \leq \hat{L}(h) + O(\sqrt{d \log(m)/m})$ uniformly for all $h \in \mathcal{H}$.
•Sample complexity: $m = O(d/\epsilon^2)$ samples suffice to learn to accuracy $\epsilon$.
•Symmetrization trick: Compare training to a ghost sample to avoid reasoning about the unknown distribution.
•Growth function bridge: The Sauer-Shelah lemma makes infinite classes effectively finite.
•Distribution-free: Bounds hold for ANY data distribution (at the cost of looseness).
•Extensions: Rademacher complexity, margin bounds, PAC-Bayes often give tighter results.

What's Next:

With generalization bounds established, we turn to Structural Risk Minimization—the principled approach to model selection that balances training error against complexity. SRM operationalizes VC theory into a practical algorithm design principle.

Page Complete

You now understand how VC dimension translates into generalization guarantees. You can derive and interpret VC bounds, compute sample complexity, and appreciate both the power and limitations of distribution-free learning theory. Next, we apply these insights to model selection via Structural Risk Minimization.