Machine LearningPAC Learning

PAC Learning: Probably Approximately Correct Framework

LevelAdvanced

Duration75 mins

TopicPAC Learning

2 / 5

Sample Complexity

The Central Question: How Much Data is Enough?

In the previous page, we established the PAC learning framework—a rigorous way to formalize generalization guarantees. We defined when a concept class is PAC-learnable: when there exists an algorithm that can learn any concept from the class to accuracy ε with probability 1-δ, given sufficiently many samples.

But 'sufficiently many' is vague. The natural follow-up question is:

Exactly how many samples are sufficient? And how many are necessary?

This is the study of sample complexity: determining m(ε, δ, C)—the number of samples needed to PAC-learn a concept class C to accuracy ε with confidence 1-δ.

Sample complexity is practically crucial. In the real world, data is expensive—it requires collection, labeling, storage, and processing. If we could learn effectively with 1,000 samples instead of 100,000, that's a massive reduction in cost and time. Sample complexity theory tells us what's possible.

What You Will Learn

By the end of this page, you will understand: (1) how to derive sample complexity bounds for finite hypothesis classes using concentration inequalities, (2) the relationship between hypothesis class size and sample requirements, (3) the distinction between proper and improper learning, and (4) why infinite hypothesis classes require more sophisticated analysis (VC dimension).

The Generalization Gap

To understand sample complexity, we must first analyze the generalization gap—the difference between training error and true error.

Definitions:

True Error (Generalization Error): err_D(h) = Pr_{x~D}[h(x) ≠ c*(x)]
- The probability of misclassification on a random new example
- This is what we ultimately care about
- We cannot compute it directly (D is unknown)
Training Error (Empirical Error): err_S(h) = (1/m) Σ_{i=1}^{m} 𝟙[h(xᵢ) ≠ c*(xᵢ)]
- The fraction of training examples misclassified
- This we can compute
- It's a random variable (depends on the random sample S)

The Key Insight:

Training error is an estimate of true error based on a finite sample. Like any estimate from finite data, it might be off. The fundamental question is: by how much can it be off?

Why the Gap Matters

If we could guarantee that |err_D(h) - err_S(h)| ≤ ε/2 for all hypotheses h, then finding h with err_S(h) ≤ ε/2 would guarantee err_D(h) ≤ ε. The generalization gap is the bridge between what we can observe (training error) and what we want to minimize (true error).

The Role of Sample Size:

Intuitively, training error becomes a better estimate of true error as we get more samples. With 10 samples, our estimate is noisy. With 10,000 samples, it's quite precise.

Formally, this is captured by concentration inequalities—mathematical tools that bound how far a sample average can deviate from its expected value. The two most important for PAC learning are:

Hoeffding's Inequality: Bounds deviation of sample means from true means
Union Bound: Handles multiple hypotheses simultaneously

generalization_gap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
from typing import Callable
 
def demonstrate_generalization_gap():
    """
    Empirically demonstrate how training error converges to true error
    as sample size increases.
    """
    np.random.seed(42)
    
    # True concept: AND of first 3 variables (x0 ∧ x1 ∧ x2)
    def true_concept(x):
        return int(x[0] == 1 and x[1] == 1 and x[2] == 1)
    
    # Hypothesis: Approximation (AND of first 2 variables)
    def hypothesis(x):
        return int(x[0] == 1 and x[1] == 1)  # Missing x2
    
    # True distribution: Uniform over {0,1}^4
    def sample_from_D():
        return np.random.randint(0, 2, 4)
    
    # True error of hypothesis
    # h(x)=1 when x0=x1=1, c*(x)=1 when x0=x1=x2=1
    # Error when h(x) != c*(x):
    #   - h=1, c*=0: x0=1, x1=1, x2=0 (prob = 1/8)
    # True error = 1/8 = 0.125
    TRUE_ERROR = 0.125
    
    # Measure training error for various sample sizes
    sample_sizes = [10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
    n_trials = 100
    
    results = {m: [] for m in sample_sizes}
    
    for m in sample_sizes:
        for _ in range(n_trials):
            # Generate training set
            training_errors = 0
            for _ in range(m):
                x = sample_from_D()
                if hypothesis(x) != true_concept(x):
                    training_errors += 1
            
            training_error = training_errors / m
            gap = abs(training_error - TRUE_ERROR)
            results[m].append(gap)
    
    # Compute statistics
    mean_gaps = [np.mean(results[m]) for m in sample_sizes]
    std_gaps = [np.std(results[m]) for m in sample_sizes]
    
    # Hoeffding bound for comparison
    hoeffding_bounds = [np.sqrt(np.log(2) / (2 * m)) for m in sample_sizes]
    
    print("Generalization Gap Analysis:")
    print("-" * 60)
    for m, mean, std, bound in zip(sample_sizes, mean_gaps, std_gaps, 
                                   hoeffding_bounds):
        print(f"m={m:5d}: Mean gap={mean:.4f} ± {std:.4f}, "
              f"Hoeffding bound={bound:.4f}")
    
    # Key observations:
    # 1. Gap decreases as m increases (as expected)
    # 2. Empirical gap is typically smaller than Hoeffding bound
    # 3. Gap scales roughly as 1/sqrt(m)
    
    return results
 
# Example output:
# m=   10: Mean gap=0.0875 ± 0.0632, Hoeffding bound=0.1863
# m=  100: Mean gap=0.0275 ± 0.0188, Hoeffding bound=0.0589
# m= 1000: Mean gap=0.0092 ± 0.0058, Hoeffding bound=0.0186
# m= 5000: Mean gap=0.0037 ± 0.0024, Hoeffding bound=0.0083

Hoeffding's Inequality: The Foundation

Hoeffding's inequality is the workhorse of PAC learning theory. It tells us how tightly a sample average concentrates around its expected value.

Theorem (Hoeffding's Inequality):

Let X₁, X₂, ..., Xₘ be independent random variables with Xᵢ ∈ [0, 1]. Let X̄ = (1/m) Σᵢ Xᵢ be their sample mean and μ = E[X̄] be the true mean. Then for any t > 0:

Pr[|X̄ - μ| > t] ≤ 2·exp(-2mt²)

This bound is remarkable for several reasons:

Exponential decay: The probability of large deviation decreases exponentially with m and t²
Distribution-free: Works for any distribution in [0, 1]
Tight: Matches the Chernoff bound for Bernoulli variables

Applying Hoeffding to Learning

For PAC learning, we set Xᵢ = 𝟙[h(xᵢ) ≠ c*(xᵢ)] (1 if misclassified, 0 otherwise). Then X̄ = err_S(h) is the training error, and μ = err_D(h) is the true error. Hoeffding says: Pr[|err_S(h) - err_D(h)| > t] ≤ 2exp(-2mt²). Training error concentrates around true error!

Deriving a Sample Complexity Bound for One Hypothesis:

Suppose we want: Pr[|err_S(h) - err_D(h)| > ε] ≤ δ

By Hoeffding: 2·exp(-2mε²) ≤ δ

Solving for m:

exp(-2mε²) ≤ δ/2
-2mε² ≤ ln(δ/2)
m ≥ (1/(2ε²))·ln(2/δ)

m ≥ (1/(2ε²))·ln(2/δ)

For a single fixed hypothesis h, this many samples suffice to guarantee that training error is within ε of true error with probability ≥ 1-δ.

Example: For ε = 0.05 and δ = 0.01: m ≥ (1/(2·0.0025))·ln(200) ≈ 200·5.3 ≈ 1060 samples

Sample Complexity for Single Hypothesis (from Hoeffding)
ε (accuracy)	δ (failure prob)	m = (1/2ε²)ln(2/δ)	Interpretation
0.10	0.10	~115	Rough estimate
0.10	0.01	~265	Moderate confidence
0.05	0.01	~1,060	Good accuracy, high confidence
0.01	0.01	~26,500	Very high accuracy
0.01	0.001	~38,000	Very high accuracy & confidence

The Catch: Multiple Hypotheses

The bound above works for a single, fixed hypothesis h. But PAC learning requires finding a good hypothesis from a class H—we don't know which one in advance!

The problem: we want training error to approximate true error simultaneously for all hypotheses in H. If H contains many hypotheses, some might have misleadingly low training error just by chance.

This is where the union bound comes in.

The Union Bound: Handling Multiple Hypotheses

The Union Bound (Boole's Inequality):

For any collection of events A₁, A₂, ..., Aₙ:

Pr[A₁ ∪ A₂ ∪ ... ∪ Aₙ] ≤ Σᵢ Pr[Aᵢ]

The probability that at least one event occurs is at most the sum of individual probabilities. This is a worst-case bound that treats events as independent (even if they're correlated).

Applying to Learning:

Define event Aₕ = 'hypothesis h has training error far from true error'

Aₕ = {|err_S(h) - err_D(h)| > ε}

We want: Pr[∃h ∈ H: |err_S(h) - err_D(h)| > ε] ≤ δ

This is Pr[∪ₕ Aₕ] ≤ Σₕ Pr[Aₕ] ≤ |H| · 2exp(-2mε²)

For this to be ≤ δ, we need: |H| · 2exp(-2mε²) ≤ δ

Sample Complexity for Finite Hypothesis Classes

For a finite hypothesis class H with |H| hypotheses:

m ≥ (1/(2ε²)) · ln(2|H|/δ)

samples suffice to guarantee that with probability ≥ 1-δ, FOR ALL h ∈ H: |err_S(h) - err_D(h)| ≤ ε. This is a fundamental result: sample complexity grows logarithmically with hypothesis class size.

Why Logarithmic Dependence on |H|?

The ln(|H|) factor is crucial—it means hypothesis class size affects sample complexity only logarithmically. This is tremendously good news:

| |H| | ln(|H|) | Increase | |-----|---------|----------| | 10 | 2.3 | baseline | | 1,000 | 6.9 | 3x | | 1,000,000 | 13.8 | 6x | | 10^9 | 20.7 | 9x |

A billion-hypothesis class needs only about 9× more samples than a 10-hypothesis class! This makes learning over large but finite classes tractable.

The Intuition:

Why logarithmic? Because we're using a union bound over 'bad events.' Each hypothesis has exponentially small probability of being misleading (by Hoeffding). Even summing over many hypotheses, the total probability stays bounded—as long as the exponential decay in m beats the linear growth in |H|.

sample_complexity_finite.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from math import log, ceil
 
def sample_complexity_finite_class(epsilon: float, delta: float, 
                                    hypothesis_class_size: int) -> int:
    """
    Compute sample complexity for PAC learning a finite hypothesis class.
    
    Args:
        epsilon: Desired accuracy (max generalization error)
        delta: Failure probability
        hypothesis_class_size: |H|, number of hypotheses
    
    Returns:
        m: Number of samples sufficient for PAC learning
    
    Derivation:
        From union bound + Hoeffding:
        Pr[∃h: |err_S(h) - err_D(h)| > ε] ≤ 2|H|exp(-2mε²)
        
        Setting this ≤ δ and solving:
        m ≥ (1/2ε²) ln(2|H|/δ)
    """
    return ceil((1 / (2 * epsilon**2)) * log(2 * hypothesis_class_size / delta))
 
 
def analyze_sample_complexity():
    """Analyze how parameters affect sample complexity."""
    
    print("Sample Complexity Analysis for Finite Hypothesis Classes")
    print("=" * 70)
    
    # Fix δ, vary ε
    print("
1. Effect of ε (accuracy) [|H|=1000, δ=0.05]:")
    for epsilon in [0.20, 0.10, 0.05, 0.02, 0.01]:
        m = sample_complexity_finite_class(epsilon, 0.05, 1000)
        print(f"   ε={epsilon}: m={m:,}")
    
    # Fix ε, vary |H|
    print("
2. Effect of |H| (hypothesis class size) [ε=0.05, δ=0.05]:")
    for H_size in [10, 100, 1000, 10000, 100000, 1000000]:
        m = sample_complexity_finite_class(0.05, 0.05, H_size)
        print(f"   |H|={H_size:>10,}: m={m:,}")
    
    # Fix ε, vary δ
    print("
3. Effect of δ (failure probability) [|H|=1000, ε=0.05]:")
    for delta in [0.20, 0.10, 0.05, 0.01, 0.001]:
        m = sample_complexity_finite_class(0.05, delta, 1000)
        print(f"   δ={delta}: m={m:,}")
    
    # Concrete examples
    print("
4. Concrete Learning Scenarios:")
    
    # Boolean conjunctions: |H| = 3^n
    n = 10
    H_conj = 3 ** n  # Each of n variables: included positive, negative, or absent
    m = sample_complexity_finite_class(0.05, 0.05, H_conj)
    print(f"   Conjunctions (n={n}, |H|={H_conj:,}): m={m:,}")
    
    # Decision stumps: |H| = 2n  
    n = 100
    H_stump = 2 * n  # Each of n features, threshold above or below
    m = sample_complexity_finite_class(0.05, 0.05, H_stump)
    print(f"   Decision stumps (n={n}, |H|={H_stump}): m={m:,}")
    
    # Decision trees depth d: |H| ≈ (2n)^(2^d)
    n, d = 10, 3
    H_tree = (2 * n) ** (2 ** d)
    m = sample_complexity_finite_class(0.05, 0.05, H_tree)
    print(f"   Decision trees (n={n}, d={d}, |H|≈{H_tree:,}): m={m:,}")
 
analyze_sample_complexity()
 
# Sample Output:
# 1. Effect of ε (accuracy):
#    ε=0.20: m=204
#    ε=0.10: m=817
#    ε=0.05: m=3,268
#    ε=0.02: m=20,422
#    ε=0.01: m=81,687
#
# 2. Effect of |H| (logarithmic growth!):
#    |H|=    10: m=1,838
#    |H|=   100: m=2,298
#    |H|= 1,000: m=2,759
#    |H|=10,000: m=3,219
#    |H|=100,000: m=3,679
#    |H|=1,000,000: m=4,139

The Realizable Case: Zero Training Error

The bound we derived applies uniformly to all hypotheses. But in the realizable case (where c* ∈ C), we can do better by exploiting a crucial property: the target concept c* has zero training error!

The Realizable Setting:

There exists c* ∈ C such that c*(x) is the true label for all x
Our learner searches for h ∈ H (often H = C)
Any h consistent with the training data (err_S(h) = 0) is a candidate

Key Insight:

If h is consistent with training data and has high true error (err_D(h) > ε), then h got 'lucky'—it disagreed with c* on some region, but no training examples fell in that region.

The probability of this 'lucky streak' decreases exponentially with m.

The Consistency Assumption

A hypothesis h is 'consistent' if err_S(h) = 0—it correctly labels all training examples. In the realizable case, the target c* is always consistent. We analyze: what's the probability that a BAD hypothesis (err_D(h) > ε) is also consistent?

Deriving the Bound:

Fix a 'bad' hypothesis h with err_D(h) > ε. For h to be consistent with S:

Each training example xᵢ must NOT fall in the 'error region' of h
That region has probability > ε under D
Probability that one example avoids error region: < 1 - ε
Probability that all m examples avoid: < (1 - ε)^m

Using the inequality (1 - ε) ≤ e^{-ε}: Pr[h consistent | err_D(h) > ε] < e^{-εm}

By union bound over all 'bad' hypotheses (at most |H| of them): Pr[∃ bad h that is consistent] < |H| · e^{-εm}

Setting this ≤ δ and solving: m ≥ (1/ε) · ln(|H|/δ)

Realizable PAC Sample Complexity

For a finite hypothesis class H in the realizable setting:

m ≥ (1/ε) · ln(|H|/δ)

samples suffice. Note: this is (1/ε), not (1/ε²)! The realizable case has BETTER sample complexity than the general case. This is a fundamental result in PAC learning theory.

General (Agnostic) Case

•m = O((1/ε²) log(|H|/δ))
•No assumption on c* ∈ H
•Training error may be non-zero
•Bounds training error approximation
•General but less efficient

Realizable Case

•m = O((1/ε) log(|H|/δ))
•Assumes c* ∈ H
•Consistent hypothesis exists
•Bounds bad hypothesis survival
•Much more efficient!

Why the Improvement?

In the realizable case, we don't need to estimate error precisely—we just need to eliminate bad hypotheses. A hypothesis with 10% true error has probability (0.9)^m of surviving m training examples. For m = 100, that's about 0.00003.

The (1/ε) dependence is significant in practice:

ε	General: O(1/ε²)	Realizable: O(1/ε)	Ratio
0.10	100	10	10×
0.05	400	20	20×
0.01	10,000	100	100×

Empirical Risk Minimization (ERM)

Our sample complexity bounds assume we're working with a learner that minimizes training error. This principle is formalized as Empirical Risk Minimization (ERM).

The ERM Principle:

Given training set S and hypothesis class H, the ERM learner outputs:

h_ERM = argmin_{h ∈ H} err_S(h)

ERM is the most natural learning strategy: find the hypothesis that makes the fewest mistakes on observed data.

Why ERM Works (Informal):

We've shown that training error approximates true error (with enough samples)
ERM finds the hypothesis with lowest training error
Therefore, ERM finds a hypothesis with approximately the lowest true error

This argument can be made rigorous using the bounds we've derived.

ERM is Not Always Easy

While ERM is simple to state, it may be computationally hard to implement. Finding arg min over H might require searching over an exponentially large space. The COMPUTATIONAL complexity of ERM—distinct from sample complexity—is a crucial consideration we'll address later.

erm_learning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from typing import List, Tuple, Set, Optional
from itertools import combinations
 
class ERMLearner:
    """
    Empirical Risk Minimization for monotone conjunctions.
    
    For monotone conjunctions, ERM is tractable because we can 
    efficiently find the consistent hypothesis (if one exists).
    """
    
    def __init__(self, n: int):
        """
        Initialize ERM learner for n Boolean variables.
        
        Hypothesis class: All subsets of {0, 1, ..., n-1}
        Each subset S represents the conjunction ∧_{i ∈ S} xᵢ
        """
        self.n = n
        self.hypothesis: Set[int] = set()
    
    def learn(self, samples: List[Tuple[np.ndarray, int]]) -> Set[int]:
        """
        Find the ERM hypothesis: consistent hypothesis with 
        minimum training error.
        
        For conjunctions, the 'tightest' consistent hypothesis is:
        - Start with all variables
        - Remove any variable that's 0 in a positive example
        """
        # Start with all variables
        self.hypothesis = set(range(self.n))
        
        # Remove variables inconsistent with positive examples
        for x, label in samples:
            if label == 1:  # Positive example
                for i in range(self.n):
                    if x[i] == 0 and i in self.hypothesis:
                        self.hypothesis.remove(i)
        
        # Verify: this is the UNIQUE consistent hypothesis (if any exists)
        # It has err_S(h) = 0 if data is consistent with some conjunction
        
        training_error = self.compute_training_error(samples)
        return self.hypothesis
    
    def compute_training_error(self, samples: List[Tuple[np.ndarray, int]]) -> float:
        """Compute training error of current hypothesis."""
        errors = sum(1 for x, label in samples if self.predict(x) != label)
        return errors / len(samples) if samples else 0.0
    
    def predict(self, x: np.ndarray) -> int:
        """Predict using current hypothesis."""
        return int(all(x[i] == 1 for i in self.hypothesis))
 
 
def demonstrate_erm_learning():
    """Demonstrate ERM learning for conjunctions."""
    np.random.seed(42)
    
    n = 6  # 6 Boolean variables
    
    # True target: x0 ∧ x2 ∧ x4 (conjunction of even-indexed vars)
    target = {0, 2, 4}
    
    def true_concept(x):
        return int(all(x[i] == 1 for i in target))
    
    # Generate training data from uniform distribution
    def generate_samples(m: int):
        samples = []
        for _ in range(m):
            x = np.random.randint(0, 2, n)
            label = true_concept(x)
            samples.append((x, label))
        return samples
    
    # Learn with increasing sample sizes
    print("ERM Learning for Conjunctions")
    print("Target: x0 ∧ x2 ∧ x4")
    print("-" * 50)
    
    for m in [5, 10, 20, 50, 100]:
        samples = generate_samples(m)
        learner = ERMLearner(n)
        hypothesis = learner.learn(samples)
        
        # Count positive examples (these determine what we can eliminate)
        n_positive = sum(1 for _, label in samples if label == 1)
        
        # Training error (should be 0 for ERM on realizable data)
        train_err = learner.compute_training_error(samples)
        
        # Is hypothesis correct?
        correct = hypothesis == target
        
        print(f"m={m:3d}: "
              f"Positive examples: {n_positive:2d}, "
              f"Learned: {sorted(hypothesis)}, "
              f"Correct: {correct}")
    
    # With enough positive examples, ERM converges to target
    # Variables NOT in target get eliminated by positive examples
    # Variables IN target never get eliminated
 
demonstrate_erm_learning()
 
# Sample Output:
# m=  5: Positive examples:  0, Learned: [0, 1, 2, 3, 4, 5], Correct: False
# m= 10: Positive examples:  1, Learned: [0, 2, 4], Correct: True
# m= 20: Positive examples:  2, Learned: [0, 2, 4], Correct: True
# m= 50: Positive examples:  4, Learned: [0, 2, 4], Correct: True
# m=100: Positive examples: 11, Learned: [0, 2, 4], Correct: True

ERM and PAC Guarantee:

Combining ERM with our sample complexity bounds:

With m ≥ (1/ε) ln(|H|/δ) samples (realizable case)
The ERM hypothesis h_ERM has err_S(h) = 0
With probability ≥ 1-δ, all consistent hypotheses have err_D(h) ≤ ε
Therefore, err_D(h_ERM) ≤ ε with probability ≥ 1-δ

This proves that ERM is a valid PAC learning algorithm for finite hypothesis classes in the realizable setting.

Lower Bounds: How Many Samples Are Necessary?

We've established upper bounds: sufficient conditions for PAC learning. But are these bounds tight? Could a clever algorithm learn with fewer samples?

The answer is: essentially no. The bounds are tight up to constant factors.

Lower Bound Theorem:

For any hypothesis class H with |H| ≥ 2, any PAC learning algorithm requires:

m ≥ (1/(2ε)) · ln(|H|/2) - O(1)

samples to achieve accuracy ε with constant probability (e.g., success probability ≥ 7/8).

Proof Intuition:

The lower bound uses an adversarial argument:

Consider learning over a class H with |H| hypotheses
An adversary picks c* uniformly at random from H
To identify c*, the learner must distinguish between |H| possibilities
Each sample provides at most 1 bit of information
Therefore, Ω(log |H|) samples are information-theoretically necessary

Information-Theoretic Intuition

If there are |H| hypotheses and we need to identify the correct one, we need at least log₂(|H|) bits of information. Each training example provides at most 1 bit (the label). Thus, Ω(log|H|) samples are necessary. The 1/ε factor comes from the need to distinguish hypotheses that disagree on only ε fraction of the domain.

The Tightness Result:

Bound Type	Sample Complexity	Factor
Upper bound	O((1/ε) log(	H
Lower bound	Ω((1/ε) log(	H
Ratio	O(log(1/δ))	Tight!

The (1/ε) and log(|H|) factors are tight—no algorithm can do better. The only gap is in the dependence on δ, which varies between algorithms.

Implications:

ERM is optimal (up to constants) in terms of sample complexity
Hypothesis class size fundamentally determines sample requirements
No 'clever tricks' can bypass information-theoretic limits

lower_bound_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from math import log2, ceil
 
def information_theoretic_lower_bound():
    """
    Illustrate the information-theoretic lower bound for PAC learning.
    
    Key insight: To identify the correct hypothesis from |H| options,
    we need log₂(|H|) bits of information. Each labeled example
    provides at most 1 bit.
    
    Factor of 1/ε comes from: hypotheses might differ on only ε
    fraction of the domain. To detect differences with probability
    > 1/2, we need ~1/ε samples in that ε-measure region.
    """
    
    print("Information-Theoretic Lower Bound Analysis")
    print("=" * 60)
    
    # Scenario: Hidden hypothesis among |H| candidates
    for H_size in [16, 256, 4096, 65536]:
        bits_needed = log2(H_size)
        
        # Each example: ~1 bit of information
        # To achieve ε error, need to distinguish hypotheses
        # differing on ε fraction of domain
        
        for epsilon in [0.10, 0.05, 0.01]:
            # Rough lower bound: (1/ε) * log(|H|)
            lower_bound = (1/epsilon) * np.log(H_size)
            
            print(f"|H|={H_size:>6}, ε={epsilon:.2f}: "
                  f"≥{lower_bound:,.0f} samples needed")
        print()
    
    # Adversarial construction intuition
    print("
Adversarial Argument:")
    print("-" * 60)
    print("""
    1. Adversary picks c* uniformly from H after seeing algorithm
    2. Algorithm must output h ≈ c* with high probability
    3. But algorithm's choice depends only on m labeled examples
    4. Each example rules out at most half the hypotheses
    5. After m samples, at least |H|/2^m hypotheses remain consistent
    6. For |H|/2^m < 2, need m ≥ log₂(|H|) - 1
    
    The 1/ε factor: if hypotheses differ by only ε on distribution D,
    probability of a random sample distinguishing them is only ε.
    Need 1/ε samples to see the distinguishing region.
    """)
 
information_theoretic_lower_bound()

Beyond Finite Hypothesis Classes: A Preview

Our results so far characterize sample complexity for finite hypothesis classes: m = O((1/ε) log(|H|/δ)). But many important hypothesis classes are infinite:

Linear separators in ℝⁿ: infinitely many possible weight vectors
Neural networks: continuous parameters, infinite hypothesis space
Polynomial classifiers: continuous coefficients
Kernel methods: infinite-dimensional feature spaces

For infinite classes, |H| = ∞, and our bound becomes vacuous (infinite samples needed).

The Resolution: VC Dimension

The key insight is that not all hypotheses in an infinite class are distinguishable by finite data. Many hypotheses behave identically on any finite sample, so they don't increase sample complexity.

VC dimension captures this—it measures the 'effective' size of a hypothesis class, even when the class is infinite.

Finite Hypothesis Classes

•Sample complexity: O(log|H|/ε)
•Union bound over all h ∈ H
•Works directly
•Examples: conjunctions, decision trees of fixed depth

Infinite Hypothesis Classes

•Sample complexity: O(VC(H)/ε)
•Union bound fails (infinite)
•Need VC dimension theory
•Examples: linear classifiers, neural networks

Preview: VC Dimension

VC dimension d = VC(H) captures the 'complexity' of H: the largest set of points that H can 'shatter' (label in all possible ways). For linear classifiers in ℝⁿ, VC(H) = n + 1. Sample complexity becomes O(d/ε · log(d/ε)), which is finite even for infinite classes!

Why VC Dimension Works:

The key observation: although H may contain infinitely many hypotheses, when restricted to any m points, there are at most |H_m| = O(m^d) distinct labelings.

where d = VC(H). This is the Sauer-Shelah lemma.

With this bound, we can apply a union-bound-like argument over the O(m^d) effective hypotheses rather than the infinite H. This yields sample complexity:

m = O((d/ε) log(1/ε) + (1/ε) log(1/δ))

We'll develop this fully in the VC dimension module.

Hypothesis Class Size vs. VC Dimension
Hypothesis Class	\|H\|	VC Dimension	Sample Complexity
Conjunctions (n vars)	3ⁿ	n	O(n/ε log(n))
Linear classifiers (ℝⁿ)	∞	n + 1	O(n/ε log(1/ε))
Axis-aligned rectangles (ℝ²)	∞	4	O(1/ε log(1/ε))
Halfspaces (ℝⁿ)	∞	n + 1	O(n/ε log(1/ε))
Neural nets (d params)	∞	O(d log d)	O(d/ε log(d))

Summary: Sample Complexity Fundamentals

We've established the foundational theory of sample complexity for PAC learning. Let's consolidate the key results:

Core Results

•Training error approximates true error — With enough samples, err_S(h) ≈ err_D(h) for all h ∈ H
•Hoeffding + Union Bound — For finite H: m = O((1/ε²) log(|H|/δ)) suffices for uniform convergence
•Realizable case is easier — When c* ∈ H: m = O((1/ε) log(|H|/δ)) suffices (linear in 1/ε, not quadratic)
•Logarithmic dependence on |H| — Hypothesis class size affects sample complexity only logarithmically
•Lower bounds match — The bounds are tight: Ω((1/ε) log(|H|)) samples are necessary
•ERM is optimal — Empirical Risk Minimization achieves optimal sample complexity for finite H

The Fundamental Equation

For PAC learning a finite hypothesis class H to accuracy ε with confidence 1-δ:

Realizable: m ≥ (1/ε) ln(|H|/δ) Agnostic: m ≥ (1/2ε²) ln(2|H|/δ)

These bounds are tight up to constant factors.

What's Next:

With sample complexity established, we'll explore PAC-learnability more deeply: what characterizes the concept classes that can be learned efficiently? We'll see that PAC-learnability depends on both sample complexity (how much data) and computational complexity (how much time).

The distinction is profound: some problems are learnable with polynomial samples but require exponential time. Understanding this separation is crucial for theoretical and practical machine learning.

Page Complete

You now understand sample complexity—the quantitative relationship between data requirements and learning guarantees. You can derive sample bounds for finite hypothesis classes and understand why these bounds are tight. Next, we'll explore what makes a concept class PAC-learnable in the first place.

2 / 5

Loading learning content...

Machine LearningPAC Learning

PAC Learning: Probably Approximately Correct Framework

LevelAdvanced

Duration75 mins

TopicPAC Learning

2 / 5

Sample Complexity

The Central Question: How Much Data is Enough?

But 'sufficiently many' is vague. The natural follow-up question is:

Exactly how many samples are sufficient? And how many are necessary?

This is the study of sample complexity: determining m(ε, δ, C)—the number of samples needed to PAC-learn a concept class C to accuracy ε with confidence 1-δ.

What You Will Learn

The Generalization Gap

To understand sample complexity, we must first analyze the generalization gap—the difference between training error and true error.

Definitions:

True Error (Generalization Error): err_D(h) = Pr_{x~D}[h(x) ≠ c*(x)]
- The probability of misclassification on a random new example
- This is what we ultimately care about
- We cannot compute it directly (D is unknown)
Training Error (Empirical Error): err_S(h) = (1/m) Σ_{i=1}^{m} 𝟙[h(xᵢ) ≠ c*(xᵢ)]
- The fraction of training examples misclassified
- This we can compute
- It's a random variable (depends on the random sample S)

The Key Insight:

Training error is an estimate of true error based on a finite sample. Like any estimate from finite data, it might be off. The fundamental question is: by how much can it be off?

Why the Gap Matters

The Role of Sample Size:

Intuitively, training error becomes a better estimate of true error as we get more samples. With 10 samples, our estimate is noisy. With 10,000 samples, it's quite precise.

Formally, this is captured by concentration inequalities—mathematical tools that bound how far a sample average can deviate from its expected value. The two most important for PAC learning are:

Hoeffding's Inequality: Bounds deviation of sample means from true means
Union Bound: Handles multiple hypotheses simultaneously

generalization_gap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
import matplotlib.pyplot as plt
from typing import Callable
 
def demonstrate_generalization_gap():
    """
    Empirically demonstrate how training error converges to true error
    as sample size increases.
    """
    np.random.seed(42)
    
    # True concept: AND of first 3 variables (x0 ∧ x1 ∧ x2)
    def true_concept(x):
        return int(x[0] == 1 and x[1] == 1 and x[2] == 1)
    
    # Hypothesis: Approximation (AND of first 2 variables)
    def hypothesis(x):
        return int(x[0] == 1 and x[1] == 1)  # Missing x2
    
    # True distribution: Uniform over {0,1}^4
    def sample_from_D():
        return np.random.randint(0, 2, 4)
    
    # True error of hypothesis
    # h(x)=1 when x0=x1=1, c*(x)=1 when x0=x1=x2=1
    # Error when h(x) != c*(x):
    #   - h=1, c*=0: x0=1, x1=1, x2=0 (prob = 1/8)
    # True error = 1/8 = 0.125
    TRUE_ERROR = 0.125
    
    # Measure training error for various sample sizes
    sample_sizes = [10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
    n_trials = 100
    
    results = {m: [] for m in sample_sizes}
    
    for m in sample_sizes:
        for _ in range(n_trials):
            # Generate training set
            training_errors = 0
            for _ in range(m):
                x = sample_from_D()
                if hypothesis(x) != true_concept(x):
                    training_errors += 1
            
            training_error = training_errors / m
            gap = abs(training_error - TRUE_ERROR)
            results[m].append(gap)
    
    # Compute statistics
    mean_gaps = [np.mean(results[m]) for m in sample_sizes]
    std_gaps = [np.std(results[m]) for m in sample_sizes]
    
    # Hoeffding bound for comparison
    hoeffding_bounds = [np.sqrt(np.log(2) / (2 * m)) for m in sample_sizes]
    
    print("Generalization Gap Analysis:")
    print("-" * 60)
    for m, mean, std, bound in zip(sample_sizes, mean_gaps, std_gaps, 
                                   hoeffding_bounds):
        print(f"m={m:5d}: Mean gap={mean:.4f} ± {std:.4f}, "
              f"Hoeffding bound={bound:.4f}")
    
    # Key observations:
    # 1. Gap decreases as m increases (as expected)
    # 2. Empirical gap is typically smaller than Hoeffding bound
    # 3. Gap scales roughly as 1/sqrt(m)
    
    return results
 
# Example output:
# m=   10: Mean gap=0.0875 ± 0.0632, Hoeffding bound=0.1863
# m=  100: Mean gap=0.0275 ± 0.0188, Hoeffding bound=0.0589
# m= 1000: Mean gap=0.0092 ± 0.0058, Hoeffding bound=0.0186
# m= 5000: Mean gap=0.0037 ± 0.0024, Hoeffding bound=0.0083

Hoeffding's Inequality: The Foundation

Hoeffding's inequality is the workhorse of PAC learning theory. It tells us how tightly a sample average concentrates around its expected value.

Theorem (Hoeffding's Inequality):

Let X₁, X₂, ..., Xₘ be independent random variables with Xᵢ ∈ [0, 1]. Let X̄ = (1/m) Σᵢ Xᵢ be their sample mean and μ = E[X̄] be the true mean. Then for any t > 0:

Pr[|X̄ - μ| > t] ≤ 2·exp(-2mt²)

This bound is remarkable for several reasons:

Exponential decay: The probability of large deviation decreases exponentially with m and t²
Distribution-free: Works for any distribution in [0, 1]
Tight: Matches the Chernoff bound for Bernoulli variables

Applying Hoeffding to Learning

Deriving a Sample Complexity Bound for One Hypothesis:

Suppose we want: Pr[|err_S(h) - err_D(h)| > ε] ≤ δ

By Hoeffding: 2·exp(-2mε²) ≤ δ

Solving for m:

exp(-2mε²) ≤ δ/2
-2mε² ≤ ln(δ/2)
m ≥ (1/(2ε²))·ln(2/δ)

m ≥ (1/(2ε²))·ln(2/δ)

For a single fixed hypothesis h, this many samples suffice to guarantee that training error is within ε of true error with probability ≥ 1-δ.

Example: For ε = 0.05 and δ = 0.01: m ≥ (1/(2·0.0025))·ln(200) ≈ 200·5.3 ≈ 1060 samples

Sample Complexity for Single Hypothesis (from Hoeffding)
ε (accuracy)	δ (failure prob)	m = (1/2ε²)ln(2/δ)	Interpretation
0.10	0.10	~115	Rough estimate
0.10	0.01	~265	Moderate confidence
0.05	0.01	~1,060	Good accuracy, high confidence
0.01	0.01	~26,500	Very high accuracy
0.01	0.001	~38,000	Very high accuracy & confidence

The Catch: Multiple Hypotheses

The bound above works for a single, fixed hypothesis h. But PAC learning requires finding a good hypothesis from a class H—we don't know which one in advance!

This is where the union bound comes in.

The Union Bound: Handling Multiple Hypotheses

The Union Bound (Boole's Inequality):

For any collection of events A₁, A₂, ..., Aₙ:

Pr[A₁ ∪ A₂ ∪ ... ∪ Aₙ] ≤ Σᵢ Pr[Aᵢ]

The probability that at least one event occurs is at most the sum of individual probabilities. This is a worst-case bound that treats events as independent (even if they're correlated).

Applying to Learning:

Define event Aₕ = 'hypothesis h has training error far from true error'

Aₕ = {|err_S(h) - err_D(h)| > ε}

We want: Pr[∃h ∈ H: |err_S(h) - err_D(h)| > ε] ≤ δ

This is Pr[∪ₕ Aₕ] ≤ Σₕ Pr[Aₕ] ≤ |H| · 2exp(-2mε²)

For this to be ≤ δ, we need: |H| · 2exp(-2mε²) ≤ δ

Sample Complexity for Finite Hypothesis Classes

For a finite hypothesis class H with |H| hypotheses:

m ≥ (1/(2ε²)) · ln(2|H|/δ)

Why Logarithmic Dependence on |H|?

The ln(|H|) factor is crucial—it means hypothesis class size affects sample complexity only logarithmically. This is tremendously good news:

| |H| | ln(|H|) | Increase | |-----|---------|----------| | 10 | 2.3 | baseline | | 1,000 | 6.9 | 3x | | 1,000,000 | 13.8 | 6x | | 10^9 | 20.7 | 9x |

A billion-hypothesis class needs only about 9× more samples than a 10-hypothesis class! This makes learning over large but finite classes tractable.

The Intuition:

sample_complexity_finite.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from math import log, ceil
 
def sample_complexity_finite_class(epsilon: float, delta: float, 
                                    hypothesis_class_size: int) -> int:
    """
    Compute sample complexity for PAC learning a finite hypothesis class.
    
    Args:
        epsilon: Desired accuracy (max generalization error)
        delta: Failure probability
        hypothesis_class_size: |H|, number of hypotheses
    
    Returns:
        m: Number of samples sufficient for PAC learning
    
    Derivation:
        From union bound + Hoeffding:
        Pr[∃h: |err_S(h) - err_D(h)| > ε] ≤ 2|H|exp(-2mε²)
        
        Setting this ≤ δ and solving:
        m ≥ (1/2ε²) ln(2|H|/δ)
    """
    return ceil((1 / (2 * epsilon**2)) * log(2 * hypothesis_class_size / delta))
 
 
def analyze_sample_complexity():
    """Analyze how parameters affect sample complexity."""
    
    print("Sample Complexity Analysis for Finite Hypothesis Classes")
    print("=" * 70)
    
    # Fix δ, vary ε
    print("
1. Effect of ε (accuracy) [|H|=1000, δ=0.05]:")
    for epsilon in [0.20, 0.10, 0.05, 0.02, 0.01]:
        m = sample_complexity_finite_class(epsilon, 0.05, 1000)
        print(f"   ε={epsilon}: m={m:,}")
    
    # Fix ε, vary |H|
    print("
2. Effect of |H| (hypothesis class size) [ε=0.05, δ=0.05]:")
    for H_size in [10, 100, 1000, 10000, 100000, 1000000]:
        m = sample_complexity_finite_class(0.05, 0.05, H_size)
        print(f"   |H|={H_size:>10,}: m={m:,}")
    
    # Fix ε, vary δ
    print("
3. Effect of δ (failure probability) [|H|=1000, ε=0.05]:")
    for delta in [0.20, 0.10, 0.05, 0.01, 0.001]:
        m = sample_complexity_finite_class(0.05, delta, 1000)
        print(f"   δ={delta}: m={m:,}")
    
    # Concrete examples
    print("
4. Concrete Learning Scenarios:")
    
    # Boolean conjunctions: |H| = 3^n
    n = 10
    H_conj = 3 ** n  # Each of n variables: included positive, negative, or absent
    m = sample_complexity_finite_class(0.05, 0.05, H_conj)
    print(f"   Conjunctions (n={n}, |H|={H_conj:,}): m={m:,}")
    
    # Decision stumps: |H| = 2n  
    n = 100
    H_stump = 2 * n  # Each of n features, threshold above or below
    m = sample_complexity_finite_class(0.05, 0.05, H_stump)
    print(f"   Decision stumps (n={n}, |H|={H_stump}): m={m:,}")
    
    # Decision trees depth d: |H| ≈ (2n)^(2^d)
    n, d = 10, 3
    H_tree = (2 * n) ** (2 ** d)
    m = sample_complexity_finite_class(0.05, 0.05, H_tree)
    print(f"   Decision trees (n={n}, d={d}, |H|≈{H_tree:,}): m={m:,}")
 
analyze_sample_complexity()
 
# Sample Output:
# 1. Effect of ε (accuracy):
#    ε=0.20: m=204
#    ε=0.10: m=817
#    ε=0.05: m=3,268
#    ε=0.02: m=20,422
#    ε=0.01: m=81,687
#
# 2. Effect of |H| (logarithmic growth!):
#    |H|=    10: m=1,838
#    |H|=   100: m=2,298
#    |H|= 1,000: m=2,759
#    |H|=10,000: m=3,219
#    |H|=100,000: m=3,679
#    |H|=1,000,000: m=4,139

The Realizable Case: Zero Training Error

The Realizable Setting:

There exists c* ∈ C such that c*(x) is the true label for all x
Our learner searches for h ∈ H (often H = C)
Any h consistent with the training data (err_S(h) = 0) is a candidate

Key Insight:

If h is consistent with training data and has high true error (err_D(h) > ε), then h got 'lucky'—it disagreed with c* on some region, but no training examples fell in that region.

The probability of this 'lucky streak' decreases exponentially with m.

The Consistency Assumption

Deriving the Bound:

Fix a 'bad' hypothesis h with err_D(h) > ε. For h to be consistent with S:

Each training example xᵢ must NOT fall in the 'error region' of h
That region has probability > ε under D
Probability that one example avoids error region: < 1 - ε
Probability that all m examples avoid: < (1 - ε)^m

Using the inequality (1 - ε) ≤ e^{-ε}: Pr[h consistent | err_D(h) > ε] < e^{-εm}

By union bound over all 'bad' hypotheses (at most |H| of them): Pr[∃ bad h that is consistent] < |H| · e^{-εm}

Setting this ≤ δ and solving: m ≥ (1/ε) · ln(|H|/δ)

Realizable PAC Sample Complexity

For a finite hypothesis class H in the realizable setting:

m ≥ (1/ε) · ln(|H|/δ)

samples suffice. Note: this is (1/ε), not (1/ε²)! The realizable case has BETTER sample complexity than the general case. This is a fundamental result in PAC learning theory.

General (Agnostic) Case

•m = O((1/ε²) log(|H|/δ))
•No assumption on c* ∈ H
•Training error may be non-zero
•Bounds training error approximation
•General but less efficient

Realizable Case

•m = O((1/ε) log(|H|/δ))
•Assumes c* ∈ H
•Consistent hypothesis exists
•Bounds bad hypothesis survival
•Much more efficient!

Why the Improvement?

The (1/ε) dependence is significant in practice:

ε	General: O(1/ε²)	Realizable: O(1/ε)	Ratio
0.10	100	10	10×
0.05	400	20	20×
0.01	10,000	100	100×

Empirical Risk Minimization (ERM)

Our sample complexity bounds assume we're working with a learner that minimizes training error. This principle is formalized as Empirical Risk Minimization (ERM).

The ERM Principle:

Given training set S and hypothesis class H, the ERM learner outputs:

h_ERM = argmin_{h ∈ H} err_S(h)

ERM is the most natural learning strategy: find the hypothesis that makes the fewest mistakes on observed data.

Why ERM Works (Informal):

We've shown that training error approximates true error (with enough samples)
ERM finds the hypothesis with lowest training error
Therefore, ERM finds a hypothesis with approximately the lowest true error

This argument can be made rigorous using the bounds we've derived.

ERM is Not Always Easy

erm_learning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from typing import List, Tuple, Set, Optional
from itertools import combinations
 
class ERMLearner:
    """
    Empirical Risk Minimization for monotone conjunctions.
    
    For monotone conjunctions, ERM is tractable because we can 
    efficiently find the consistent hypothesis (if one exists).
    """
    
    def __init__(self, n: int):
        """
        Initialize ERM learner for n Boolean variables.
        
        Hypothesis class: All subsets of {0, 1, ..., n-1}
        Each subset S represents the conjunction ∧_{i ∈ S} xᵢ
        """
        self.n = n
        self.hypothesis: Set[int] = set()
    
    def learn(self, samples: List[Tuple[np.ndarray, int]]) -> Set[int]:
        """
        Find the ERM hypothesis: consistent hypothesis with 
        minimum training error.
        
        For conjunctions, the 'tightest' consistent hypothesis is:
        - Start with all variables
        - Remove any variable that's 0 in a positive example
        """
        # Start with all variables
        self.hypothesis = set(range(self.n))
        
        # Remove variables inconsistent with positive examples
        for x, label in samples:
            if label == 1:  # Positive example
                for i in range(self.n):
                    if x[i] == 0 and i in self.hypothesis:
                        self.hypothesis.remove(i)
        
        # Verify: this is the UNIQUE consistent hypothesis (if any exists)
        # It has err_S(h) = 0 if data is consistent with some conjunction
        
        training_error = self.compute_training_error(samples)
        return self.hypothesis
    
    def compute_training_error(self, samples: List[Tuple[np.ndarray, int]]) -> float:
        """Compute training error of current hypothesis."""
        errors = sum(1 for x, label in samples if self.predict(x) != label)
        return errors / len(samples) if samples else 0.0
    
    def predict(self, x: np.ndarray) -> int:
        """Predict using current hypothesis."""
        return int(all(x[i] == 1 for i in self.hypothesis))
 
 
def demonstrate_erm_learning():
    """Demonstrate ERM learning for conjunctions."""
    np.random.seed(42)
    
    n = 6  # 6 Boolean variables
    
    # True target: x0 ∧ x2 ∧ x4 (conjunction of even-indexed vars)
    target = {0, 2, 4}
    
    def true_concept(x):
        return int(all(x[i] == 1 for i in target))
    
    # Generate training data from uniform distribution
    def generate_samples(m: int):
        samples = []
        for _ in range(m):
            x = np.random.randint(0, 2, n)
            label = true_concept(x)
            samples.append((x, label))
        return samples
    
    # Learn with increasing sample sizes
    print("ERM Learning for Conjunctions")
    print("Target: x0 ∧ x2 ∧ x4")
    print("-" * 50)
    
    for m in [5, 10, 20, 50, 100]:
        samples = generate_samples(m)
        learner = ERMLearner(n)
        hypothesis = learner.learn(samples)
        
        # Count positive examples (these determine what we can eliminate)
        n_positive = sum(1 for _, label in samples if label == 1)
        
        # Training error (should be 0 for ERM on realizable data)
        train_err = learner.compute_training_error(samples)
        
        # Is hypothesis correct?
        correct = hypothesis == target
        
        print(f"m={m:3d}: "
              f"Positive examples: {n_positive:2d}, "
              f"Learned: {sorted(hypothesis)}, "
              f"Correct: {correct}")
    
    # With enough positive examples, ERM converges to target
    # Variables NOT in target get eliminated by positive examples
    # Variables IN target never get eliminated
 
demonstrate_erm_learning()
 
# Sample Output:
# m=  5: Positive examples:  0, Learned: [0, 1, 2, 3, 4, 5], Correct: False
# m= 10: Positive examples:  1, Learned: [0, 2, 4], Correct: True
# m= 20: Positive examples:  2, Learned: [0, 2, 4], Correct: True
# m= 50: Positive examples:  4, Learned: [0, 2, 4], Correct: True
# m=100: Positive examples: 11, Learned: [0, 2, 4], Correct: True

ERM and PAC Guarantee:

Combining ERM with our sample complexity bounds:

With m ≥ (1/ε) ln(|H|/δ) samples (realizable case)
The ERM hypothesis h_ERM has err_S(h) = 0
With probability ≥ 1-δ, all consistent hypotheses have err_D(h) ≤ ε
Therefore, err_D(h_ERM) ≤ ε with probability ≥ 1-δ

This proves that ERM is a valid PAC learning algorithm for finite hypothesis classes in the realizable setting.

Lower Bounds: How Many Samples Are Necessary?

We've established upper bounds: sufficient conditions for PAC learning. But are these bounds tight? Could a clever algorithm learn with fewer samples?

The answer is: essentially no. The bounds are tight up to constant factors.

Lower Bound Theorem:

For any hypothesis class H with |H| ≥ 2, any PAC learning algorithm requires:

m ≥ (1/(2ε)) · ln(|H|/2) - O(1)

samples to achieve accuracy ε with constant probability (e.g., success probability ≥ 7/8).

Proof Intuition:

The lower bound uses an adversarial argument:

Consider learning over a class H with |H| hypotheses
An adversary picks c* uniformly at random from H
To identify c*, the learner must distinguish between |H| possibilities
Each sample provides at most 1 bit of information
Therefore, Ω(log |H|) samples are information-theoretically necessary

Information-Theoretic Intuition

The Tightness Result:

Bound Type	Sample Complexity	Factor
Upper bound	O((1/ε) log(	H
Lower bound	Ω((1/ε) log(	H
Ratio	O(log(1/δ))	Tight!

The (1/ε) and log(|H|) factors are tight—no algorithm can do better. The only gap is in the dependence on δ, which varies between algorithms.

Implications:

ERM is optimal (up to constants) in terms of sample complexity
Hypothesis class size fundamentally determines sample requirements
No 'clever tricks' can bypass information-theoretic limits

lower_bound_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from math import log2, ceil
 
def information_theoretic_lower_bound():
    """
    Illustrate the information-theoretic lower bound for PAC learning.
    
    Key insight: To identify the correct hypothesis from |H| options,
    we need log₂(|H|) bits of information. Each labeled example
    provides at most 1 bit.
    
    Factor of 1/ε comes from: hypotheses might differ on only ε
    fraction of the domain. To detect differences with probability
    > 1/2, we need ~1/ε samples in that ε-measure region.
    """
    
    print("Information-Theoretic Lower Bound Analysis")
    print("=" * 60)
    
    # Scenario: Hidden hypothesis among |H| candidates
    for H_size in [16, 256, 4096, 65536]:
        bits_needed = log2(H_size)
        
        # Each example: ~1 bit of information
        # To achieve ε error, need to distinguish hypotheses
        # differing on ε fraction of domain
        
        for epsilon in [0.10, 0.05, 0.01]:
            # Rough lower bound: (1/ε) * log(|H|)
            lower_bound = (1/epsilon) * np.log(H_size)
            
            print(f"|H|={H_size:>6}, ε={epsilon:.2f}: "
                  f"≥{lower_bound:,.0f} samples needed")
        print()
    
    # Adversarial construction intuition
    print("
Adversarial Argument:")
    print("-" * 60)
    print("""
    1. Adversary picks c* uniformly from H after seeing algorithm
    2. Algorithm must output h ≈ c* with high probability
    3. But algorithm's choice depends only on m labeled examples
    4. Each example rules out at most half the hypotheses
    5. After m samples, at least |H|/2^m hypotheses remain consistent
    6. For |H|/2^m < 2, need m ≥ log₂(|H|) - 1
    
    The 1/ε factor: if hypotheses differ by only ε on distribution D,
    probability of a random sample distinguishing them is only ε.
    Need 1/ε samples to see the distinguishing region.
    """)
 
information_theoretic_lower_bound()

Beyond Finite Hypothesis Classes: A Preview

Our results so far characterize sample complexity for finite hypothesis classes: m = O((1/ε) log(|H|/δ)). But many important hypothesis classes are infinite:

Linear separators in ℝⁿ: infinitely many possible weight vectors
Neural networks: continuous parameters, infinite hypothesis space
Polynomial classifiers: continuous coefficients
Kernel methods: infinite-dimensional feature spaces

For infinite classes, |H| = ∞, and our bound becomes vacuous (infinite samples needed).

The Resolution: VC Dimension

VC dimension captures this—it measures the 'effective' size of a hypothesis class, even when the class is infinite.

Finite Hypothesis Classes

•Sample complexity: O(log|H|/ε)
•Union bound over all h ∈ H
•Works directly
•Examples: conjunctions, decision trees of fixed depth

Infinite Hypothesis Classes

•Sample complexity: O(VC(H)/ε)
•Union bound fails (infinite)
•Need VC dimension theory
•Examples: linear classifiers, neural networks

Preview: VC Dimension

Why VC Dimension Works:

The key observation: although H may contain infinitely many hypotheses, when restricted to any m points, there are at most |H_m| = O(m^d) distinct labelings.

where d = VC(H). This is the Sauer-Shelah lemma.

With this bound, we can apply a union-bound-like argument over the O(m^d) effective hypotheses rather than the infinite H. This yields sample complexity:

m = O((d/ε) log(1/ε) + (1/ε) log(1/δ))

We'll develop this fully in the VC dimension module.

Hypothesis Class Size vs. VC Dimension
Hypothesis Class	\|H\|	VC Dimension	Sample Complexity
Conjunctions (n vars)	3ⁿ	n	O(n/ε log(n))
Linear classifiers (ℝⁿ)	∞	n + 1	O(n/ε log(1/ε))
Axis-aligned rectangles (ℝ²)	∞	4	O(1/ε log(1/ε))
Halfspaces (ℝⁿ)	∞	n + 1	O(n/ε log(1/ε))
Neural nets (d params)	∞	O(d log d)	O(d/ε log(d))

Summary: Sample Complexity Fundamentals

We've established the foundational theory of sample complexity for PAC learning. Let's consolidate the key results:

Core Results

•Training error approximates true error — With enough samples, err_S(h) ≈ err_D(h) for all h ∈ H
•Hoeffding + Union Bound — For finite H: m = O((1/ε²) log(|H|/δ)) suffices for uniform convergence
•Realizable case is easier — When c* ∈ H: m = O((1/ε) log(|H|/δ)) suffices (linear in 1/ε, not quadratic)
•Logarithmic dependence on |H| — Hypothesis class size affects sample complexity only logarithmically
•Lower bounds match — The bounds are tight: Ω((1/ε) log(|H|)) samples are necessary
•ERM is optimal — Empirical Risk Minimization achieves optimal sample complexity for finite H

The Fundamental Equation

For PAC learning a finite hypothesis class H to accuracy ε with confidence 1-δ:

Realizable: m ≥ (1/ε) ln(|H|/δ) Agnostic: m ≥ (1/2ε²) ln(2|H|/δ)

These bounds are tight up to constant factors.

What's Next:

The distinction is profound: some problems are learnable with polynomial samples but require exponential time. Understanding this separation is crucial for theoretical and practical machine learning.

Page Complete

2 / 5