Loading learning content...
In the previous page, we established the PAC learning framework—a rigorous way to formalize generalization guarantees. We defined when a concept class is PAC-learnable: when there exists an algorithm that can learn any concept from the class to accuracy ε with probability 1-δ, given sufficiently many samples.
But 'sufficiently many' is vague. The natural follow-up question is:
Exactly how many samples are sufficient? And how many are necessary?
This is the study of sample complexity: determining m(ε, δ, C)—the number of samples needed to PAC-learn a concept class C to accuracy ε with confidence 1-δ.
Sample complexity is practically crucial. In the real world, data is expensive—it requires collection, labeling, storage, and processing. If we could learn effectively with 1,000 samples instead of 100,000, that's a massive reduction in cost and time. Sample complexity theory tells us what's possible.
By the end of this page, you will understand: (1) how to derive sample complexity bounds for finite hypothesis classes using concentration inequalities, (2) the relationship between hypothesis class size and sample requirements, (3) the distinction between proper and improper learning, and (4) why infinite hypothesis classes require more sophisticated analysis (VC dimension).
To understand sample complexity, we must first analyze the generalization gap—the difference between training error and true error.
Definitions:
True Error (Generalization Error): err_D(h) = Pr_{x~D}[h(x) ≠ c*(x)]
Training Error (Empirical Error): err_S(h) = (1/m) Σ_{i=1}^{m} 𝟙[h(xᵢ) ≠ c*(xᵢ)]
The Key Insight:
Training error is an estimate of true error based on a finite sample. Like any estimate from finite data, it might be off. The fundamental question is: by how much can it be off?
If we could guarantee that |err_D(h) - err_S(h)| ≤ ε/2 for all hypotheses h, then finding h with err_S(h) ≤ ε/2 would guarantee err_D(h) ≤ ε. The generalization gap is the bridge between what we can observe (training error) and what we want to minimize (true error).
The Role of Sample Size:
Intuitively, training error becomes a better estimate of true error as we get more samples. With 10 samples, our estimate is noisy. With 10,000 samples, it's quite precise.
Formally, this is captured by concentration inequalities—mathematical tools that bound how far a sample average can deviate from its expected value. The two most important for PAC learning are:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npimport matplotlib.pyplot as pltfrom typing import Callable def demonstrate_generalization_gap(): """ Empirically demonstrate how training error converges to true error as sample size increases. """ np.random.seed(42) # True concept: AND of first 3 variables (x0 ∧ x1 ∧ x2) def true_concept(x): return int(x[0] == 1 and x[1] == 1 and x[2] == 1) # Hypothesis: Approximation (AND of first 2 variables) def hypothesis(x): return int(x[0] == 1 and x[1] == 1) # Missing x2 # True distribution: Uniform over {0,1}^4 def sample_from_D(): return np.random.randint(0, 2, 4) # True error of hypothesis # h(x)=1 when x0=x1=1, c*(x)=1 when x0=x1=x2=1 # Error when h(x) != c*(x): # - h=1, c*=0: x0=1, x1=1, x2=0 (prob = 1/8) # True error = 1/8 = 0.125 TRUE_ERROR = 0.125 # Measure training error for various sample sizes sample_sizes = [10, 25, 50, 100, 250, 500, 1000, 2500, 5000] n_trials = 100 results = {m: [] for m in sample_sizes} for m in sample_sizes: for _ in range(n_trials): # Generate training set training_errors = 0 for _ in range(m): x = sample_from_D() if hypothesis(x) != true_concept(x): training_errors += 1 training_error = training_errors / m gap = abs(training_error - TRUE_ERROR) results[m].append(gap) # Compute statistics mean_gaps = [np.mean(results[m]) for m in sample_sizes] std_gaps = [np.std(results[m]) for m in sample_sizes] # Hoeffding bound for comparison hoeffding_bounds = [np.sqrt(np.log(2) / (2 * m)) for m in sample_sizes] print("Generalization Gap Analysis:") print("-" * 60) for m, mean, std, bound in zip(sample_sizes, mean_gaps, std_gaps, hoeffding_bounds): print(f"m={m:5d}: Mean gap={mean:.4f} ± {std:.4f}, " f"Hoeffding bound={bound:.4f}") # Key observations: # 1. Gap decreases as m increases (as expected) # 2. Empirical gap is typically smaller than Hoeffding bound # 3. Gap scales roughly as 1/sqrt(m) return results # Example output:# m= 10: Mean gap=0.0875 ± 0.0632, Hoeffding bound=0.1863# m= 100: Mean gap=0.0275 ± 0.0188, Hoeffding bound=0.0589# m= 1000: Mean gap=0.0092 ± 0.0058, Hoeffding bound=0.0186# m= 5000: Mean gap=0.0037 ± 0.0024, Hoeffding bound=0.0083Hoeffding's inequality is the workhorse of PAC learning theory. It tells us how tightly a sample average concentrates around its expected value.
Theorem (Hoeffding's Inequality):
Let X₁, X₂, ..., Xₘ be independent random variables with Xᵢ ∈ [0, 1]. Let X̄ = (1/m) Σᵢ Xᵢ be their sample mean and μ = E[X̄] be the true mean. Then for any t > 0:
Pr[|X̄ - μ| > t] ≤ 2·exp(-2mt²)
This bound is remarkable for several reasons:
For PAC learning, we set Xᵢ = 𝟙[h(xᵢ) ≠ c*(xᵢ)] (1 if misclassified, 0 otherwise). Then X̄ = err_S(h) is the training error, and μ = err_D(h) is the true error. Hoeffding says: Pr[|err_S(h) - err_D(h)| > t] ≤ 2exp(-2mt²). Training error concentrates around true error!
Deriving a Sample Complexity Bound for One Hypothesis:
Suppose we want: Pr[|err_S(h) - err_D(h)| > ε] ≤ δ
By Hoeffding: 2·exp(-2mε²) ≤ δ
Solving for m:
m ≥ (1/(2ε²))·ln(2/δ)
For a single fixed hypothesis h, this many samples suffice to guarantee that training error is within ε of true error with probability ≥ 1-δ.
Example: For ε = 0.05 and δ = 0.01: m ≥ (1/(2·0.0025))·ln(200) ≈ 200·5.3 ≈ 1060 samples
| ε (accuracy) | δ (failure prob) | m = (1/2ε²)ln(2/δ) | Interpretation |
|---|---|---|---|
| 0.10 | 0.10 | ~115 | Rough estimate |
| 0.10 | 0.01 | ~265 | Moderate confidence |
| 0.05 | 0.01 | ~1,060 | Good accuracy, high confidence |
| 0.01 | 0.01 | ~26,500 | Very high accuracy |
| 0.01 | 0.001 | ~38,000 | Very high accuracy & confidence |
The Catch: Multiple Hypotheses
The bound above works for a single, fixed hypothesis h. But PAC learning requires finding a good hypothesis from a class H—we don't know which one in advance!
The problem: we want training error to approximate true error simultaneously for all hypotheses in H. If H contains many hypotheses, some might have misleadingly low training error just by chance.
This is where the union bound comes in.
The Union Bound (Boole's Inequality):
For any collection of events A₁, A₂, ..., Aₙ:
Pr[A₁ ∪ A₂ ∪ ... ∪ Aₙ] ≤ Σᵢ Pr[Aᵢ]
The probability that at least one event occurs is at most the sum of individual probabilities. This is a worst-case bound that treats events as independent (even if they're correlated).
Applying to Learning:
Define event Aₕ = 'hypothesis h has training error far from true error'
We want: Pr[∃h ∈ H: |err_S(h) - err_D(h)| > ε] ≤ δ
This is Pr[∪ₕ Aₕ] ≤ Σₕ Pr[Aₕ] ≤ |H| · 2exp(-2mε²)
For this to be ≤ δ, we need: |H| · 2exp(-2mε²) ≤ δ
For a finite hypothesis class H with |H| hypotheses:
m ≥ (1/(2ε²)) · ln(2|H|/δ)
samples suffice to guarantee that with probability ≥ 1-δ, FOR ALL h ∈ H: |err_S(h) - err_D(h)| ≤ ε. This is a fundamental result: sample complexity grows logarithmically with hypothesis class size.
Why Logarithmic Dependence on |H|?
The ln(|H|) factor is crucial—it means hypothesis class size affects sample complexity only logarithmically. This is tremendously good news:
| |H| | ln(|H|) | Increase | |-----|---------|----------| | 10 | 2.3 | baseline | | 1,000 | 6.9 | 3x | | 1,000,000 | 13.8 | 6x | | 10^9 | 20.7 | 9x |
A billion-hypothesis class needs only about 9× more samples than a 10-hypothesis class! This makes learning over large but finite classes tractable.
The Intuition:
Why logarithmic? Because we're using a union bound over 'bad events.' Each hypothesis has exponentially small probability of being misleading (by Hoeffding). Even summing over many hypotheses, the total probability stays bounded—as long as the exponential decay in m beats the linear growth in |H|.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as npfrom math import log, ceil def sample_complexity_finite_class(epsilon: float, delta: float, hypothesis_class_size: int) -> int: """ Compute sample complexity for PAC learning a finite hypothesis class. Args: epsilon: Desired accuracy (max generalization error) delta: Failure probability hypothesis_class_size: |H|, number of hypotheses Returns: m: Number of samples sufficient for PAC learning Derivation: From union bound + Hoeffding: Pr[∃h: |err_S(h) - err_D(h)| > ε] ≤ 2|H|exp(-2mε²) Setting this ≤ δ and solving: m ≥ (1/2ε²) ln(2|H|/δ) """ return ceil((1 / (2 * epsilon**2)) * log(2 * hypothesis_class_size / delta)) def analyze_sample_complexity(): """Analyze how parameters affect sample complexity.""" print("Sample Complexity Analysis for Finite Hypothesis Classes") print("=" * 70) # Fix δ, vary ε print("1. Effect of ε (accuracy) [|H|=1000, δ=0.05]:") for epsilon in [0.20, 0.10, 0.05, 0.02, 0.01]: m = sample_complexity_finite_class(epsilon, 0.05, 1000) print(f" ε={epsilon}: m={m:,}") # Fix ε, vary |H| print("2. Effect of |H| (hypothesis class size) [ε=0.05, δ=0.05]:") for H_size in [10, 100, 1000, 10000, 100000, 1000000]: m = sample_complexity_finite_class(0.05, 0.05, H_size) print(f" |H|={H_size:>10,}: m={m:,}") # Fix ε, vary δ print("3. Effect of δ (failure probability) [|H|=1000, ε=0.05]:") for delta in [0.20, 0.10, 0.05, 0.01, 0.001]: m = sample_complexity_finite_class(0.05, delta, 1000) print(f" δ={delta}: m={m:,}") # Concrete examples print("4. Concrete Learning Scenarios:") # Boolean conjunctions: |H| = 3^n n = 10 H_conj = 3 ** n # Each of n variables: included positive, negative, or absent m = sample_complexity_finite_class(0.05, 0.05, H_conj) print(f" Conjunctions (n={n}, |H|={H_conj:,}): m={m:,}") # Decision stumps: |H| = 2n n = 100 H_stump = 2 * n # Each of n features, threshold above or below m = sample_complexity_finite_class(0.05, 0.05, H_stump) print(f" Decision stumps (n={n}, |H|={H_stump}): m={m:,}") # Decision trees depth d: |H| ≈ (2n)^(2^d) n, d = 10, 3 H_tree = (2 * n) ** (2 ** d) m = sample_complexity_finite_class(0.05, 0.05, H_tree) print(f" Decision trees (n={n}, d={d}, |H|≈{H_tree:,}): m={m:,}") analyze_sample_complexity() # Sample Output:# 1. Effect of ε (accuracy):# ε=0.20: m=204# ε=0.10: m=817# ε=0.05: m=3,268# ε=0.02: m=20,422# ε=0.01: m=81,687## 2. Effect of |H| (logarithmic growth!):# |H|= 10: m=1,838# |H|= 100: m=2,298# |H|= 1,000: m=2,759# |H|=10,000: m=3,219# |H|=100,000: m=3,679# |H|=1,000,000: m=4,139The bound we derived applies uniformly to all hypotheses. But in the realizable case (where c* ∈ C), we can do better by exploiting a crucial property: the target concept c* has zero training error!
The Realizable Setting:
Key Insight:
If h is consistent with training data and has high true error (err_D(h) > ε), then h got 'lucky'—it disagreed with c* on some region, but no training examples fell in that region.
The probability of this 'lucky streak' decreases exponentially with m.
A hypothesis h is 'consistent' if err_S(h) = 0—it correctly labels all training examples. In the realizable case, the target c* is always consistent. We analyze: what's the probability that a BAD hypothesis (err_D(h) > ε) is also consistent?
Deriving the Bound:
Fix a 'bad' hypothesis h with err_D(h) > ε. For h to be consistent with S:
Using the inequality (1 - ε) ≤ e^{-ε}: Pr[h consistent | err_D(h) > ε] < e^{-εm}
By union bound over all 'bad' hypotheses (at most |H| of them): Pr[∃ bad h that is consistent] < |H| · e^{-εm}
Setting this ≤ δ and solving: m ≥ (1/ε) · ln(|H|/δ)
For a finite hypothesis class H in the realizable setting:
m ≥ (1/ε) · ln(|H|/δ)
samples suffice. Note: this is (1/ε), not (1/ε²)! The realizable case has BETTER sample complexity than the general case. This is a fundamental result in PAC learning theory.
Why the Improvement?
In the realizable case, we don't need to estimate error precisely—we just need to eliminate bad hypotheses. A hypothesis with 10% true error has probability (0.9)^m of surviving m training examples. For m = 100, that's about 0.00003.
The (1/ε) dependence is significant in practice:
| ε | General: O(1/ε²) | Realizable: O(1/ε) | Ratio |
|---|---|---|---|
| 0.10 | 100 | 10 | 10× |
| 0.05 | 400 | 20 | 20× |
| 0.01 | 10,000 | 100 | 100× |
Our sample complexity bounds assume we're working with a learner that minimizes training error. This principle is formalized as Empirical Risk Minimization (ERM).
The ERM Principle:
Given training set S and hypothesis class H, the ERM learner outputs:
h_ERM = argmin_{h ∈ H} err_S(h)
ERM is the most natural learning strategy: find the hypothesis that makes the fewest mistakes on observed data.
Why ERM Works (Informal):
This argument can be made rigorous using the bounds we've derived.
While ERM is simple to state, it may be computationally hard to implement. Finding arg min over H might require searching over an exponentially large space. The COMPUTATIONAL complexity of ERM—distinct from sample complexity—is a crucial consideration we'll address later.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom typing import List, Tuple, Set, Optionalfrom itertools import combinations class ERMLearner: """ Empirical Risk Minimization for monotone conjunctions. For monotone conjunctions, ERM is tractable because we can efficiently find the consistent hypothesis (if one exists). """ def __init__(self, n: int): """ Initialize ERM learner for n Boolean variables. Hypothesis class: All subsets of {0, 1, ..., n-1} Each subset S represents the conjunction ∧_{i ∈ S} xᵢ """ self.n = n self.hypothesis: Set[int] = set() def learn(self, samples: List[Tuple[np.ndarray, int]]) -> Set[int]: """ Find the ERM hypothesis: consistent hypothesis with minimum training error. For conjunctions, the 'tightest' consistent hypothesis is: - Start with all variables - Remove any variable that's 0 in a positive example """ # Start with all variables self.hypothesis = set(range(self.n)) # Remove variables inconsistent with positive examples for x, label in samples: if label == 1: # Positive example for i in range(self.n): if x[i] == 0 and i in self.hypothesis: self.hypothesis.remove(i) # Verify: this is the UNIQUE consistent hypothesis (if any exists) # It has err_S(h) = 0 if data is consistent with some conjunction training_error = self.compute_training_error(samples) return self.hypothesis def compute_training_error(self, samples: List[Tuple[np.ndarray, int]]) -> float: """Compute training error of current hypothesis.""" errors = sum(1 for x, label in samples if self.predict(x) != label) return errors / len(samples) if samples else 0.0 def predict(self, x: np.ndarray) -> int: """Predict using current hypothesis.""" return int(all(x[i] == 1 for i in self.hypothesis)) def demonstrate_erm_learning(): """Demonstrate ERM learning for conjunctions.""" np.random.seed(42) n = 6 # 6 Boolean variables # True target: x0 ∧ x2 ∧ x4 (conjunction of even-indexed vars) target = {0, 2, 4} def true_concept(x): return int(all(x[i] == 1 for i in target)) # Generate training data from uniform distribution def generate_samples(m: int): samples = [] for _ in range(m): x = np.random.randint(0, 2, n) label = true_concept(x) samples.append((x, label)) return samples # Learn with increasing sample sizes print("ERM Learning for Conjunctions") print("Target: x0 ∧ x2 ∧ x4") print("-" * 50) for m in [5, 10, 20, 50, 100]: samples = generate_samples(m) learner = ERMLearner(n) hypothesis = learner.learn(samples) # Count positive examples (these determine what we can eliminate) n_positive = sum(1 for _, label in samples if label == 1) # Training error (should be 0 for ERM on realizable data) train_err = learner.compute_training_error(samples) # Is hypothesis correct? correct = hypothesis == target print(f"m={m:3d}: " f"Positive examples: {n_positive:2d}, " f"Learned: {sorted(hypothesis)}, " f"Correct: {correct}") # With enough positive examples, ERM converges to target # Variables NOT in target get eliminated by positive examples # Variables IN target never get eliminated demonstrate_erm_learning() # Sample Output:# m= 5: Positive examples: 0, Learned: [0, 1, 2, 3, 4, 5], Correct: False# m= 10: Positive examples: 1, Learned: [0, 2, 4], Correct: True# m= 20: Positive examples: 2, Learned: [0, 2, 4], Correct: True# m= 50: Positive examples: 4, Learned: [0, 2, 4], Correct: True# m=100: Positive examples: 11, Learned: [0, 2, 4], Correct: TrueERM and PAC Guarantee:
Combining ERM with our sample complexity bounds:
This proves that ERM is a valid PAC learning algorithm for finite hypothesis classes in the realizable setting.
We've established upper bounds: sufficient conditions for PAC learning. But are these bounds tight? Could a clever algorithm learn with fewer samples?
The answer is: essentially no. The bounds are tight up to constant factors.
Lower Bound Theorem:
For any hypothesis class H with |H| ≥ 2, any PAC learning algorithm requires:
m ≥ (1/(2ε)) · ln(|H|/2) - O(1)
samples to achieve accuracy ε with constant probability (e.g., success probability ≥ 7/8).
Proof Intuition:
The lower bound uses an adversarial argument:
If there are |H| hypotheses and we need to identify the correct one, we need at least log₂(|H|) bits of information. Each training example provides at most 1 bit (the label). Thus, Ω(log|H|) samples are necessary. The 1/ε factor comes from the need to distinguish hypotheses that disagree on only ε fraction of the domain.
The Tightness Result:
| Bound Type | Sample Complexity | Factor |
|---|---|---|
| Upper bound | O((1/ε) log( | H |
| Lower bound | Ω((1/ε) log( | H |
| Ratio | O(log(1/δ)) | Tight! |
The (1/ε) and log(|H|) factors are tight—no algorithm can do better. The only gap is in the dependence on δ, which varies between algorithms.
Implications:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npfrom math import log2, ceil def information_theoretic_lower_bound(): """ Illustrate the information-theoretic lower bound for PAC learning. Key insight: To identify the correct hypothesis from |H| options, we need log₂(|H|) bits of information. Each labeled example provides at most 1 bit. Factor of 1/ε comes from: hypotheses might differ on only ε fraction of the domain. To detect differences with probability > 1/2, we need ~1/ε samples in that ε-measure region. """ print("Information-Theoretic Lower Bound Analysis") print("=" * 60) # Scenario: Hidden hypothesis among |H| candidates for H_size in [16, 256, 4096, 65536]: bits_needed = log2(H_size) # Each example: ~1 bit of information # To achieve ε error, need to distinguish hypotheses # differing on ε fraction of domain for epsilon in [0.10, 0.05, 0.01]: # Rough lower bound: (1/ε) * log(|H|) lower_bound = (1/epsilon) * np.log(H_size) print(f"|H|={H_size:>6}, ε={epsilon:.2f}: " f"≥{lower_bound:,.0f} samples needed") print() # Adversarial construction intuition print("Adversarial Argument:") print("-" * 60) print(""" 1. Adversary picks c* uniformly from H after seeing algorithm 2. Algorithm must output h ≈ c* with high probability 3. But algorithm's choice depends only on m labeled examples 4. Each example rules out at most half the hypotheses 5. After m samples, at least |H|/2^m hypotheses remain consistent 6. For |H|/2^m < 2, need m ≥ log₂(|H|) - 1 The 1/ε factor: if hypotheses differ by only ε on distribution D, probability of a random sample distinguishing them is only ε. Need 1/ε samples to see the distinguishing region. """) information_theoretic_lower_bound()Our results so far characterize sample complexity for finite hypothesis classes: m = O((1/ε) log(|H|/δ)). But many important hypothesis classes are infinite:
For infinite classes, |H| = ∞, and our bound becomes vacuous (infinite samples needed).
The Resolution: VC Dimension
The key insight is that not all hypotheses in an infinite class are distinguishable by finite data. Many hypotheses behave identically on any finite sample, so they don't increase sample complexity.
VC dimension captures this—it measures the 'effective' size of a hypothesis class, even when the class is infinite.
VC dimension d = VC(H) captures the 'complexity' of H: the largest set of points that H can 'shatter' (label in all possible ways). For linear classifiers in ℝⁿ, VC(H) = n + 1. Sample complexity becomes O(d/ε · log(d/ε)), which is finite even for infinite classes!
Why VC Dimension Works:
The key observation: although H may contain infinitely many hypotheses, when restricted to any m points, there are at most |H_m| = O(m^d) distinct labelings.
where d = VC(H). This is the Sauer-Shelah lemma.
With this bound, we can apply a union-bound-like argument over the O(m^d) effective hypotheses rather than the infinite H. This yields sample complexity:
m = O((d/ε) log(1/ε) + (1/ε) log(1/δ))
We'll develop this fully in the VC dimension module.
| Hypothesis Class | |H| | VC Dimension | Sample Complexity |
|---|---|---|---|
| Conjunctions (n vars) | 3ⁿ | n | O(n/ε log(n)) |
| Linear classifiers (ℝⁿ) | ∞ | n + 1 | O(n/ε log(1/ε)) |
| Axis-aligned rectangles (ℝ²) | ∞ | 4 | O(1/ε log(1/ε)) |
| Halfspaces (ℝⁿ) | ∞ | n + 1 | O(n/ε log(1/ε)) |
| Neural nets (d params) | ∞ | O(d log d) | O(d/ε log(d)) |
We've established the foundational theory of sample complexity for PAC learning. Let's consolidate the key results:
For PAC learning a finite hypothesis class H to accuracy ε with confidence 1-δ:
Realizable: m ≥ (1/ε) ln(|H|/δ) Agnostic: m ≥ (1/2ε²) ln(2|H|/δ)
These bounds are tight up to constant factors.
What's Next:
With sample complexity established, we'll explore PAC-learnability more deeply: what characterizes the concept classes that can be learned efficiently? We'll see that PAC-learnability depends on both sample complexity (how much data) and computational complexity (how much time).
The distinction is profound: some problems are learnable with polynomial samples but require exponential time. Understanding this separation is crucial for theoretical and practical machine learning.
You now understand sample complexity—the quantitative relationship between data requirements and learning guarantees. You can derive sample bounds for finite hypothesis classes and understand why these bounds are tight. Next, we'll explore what makes a concept class PAC-learnable in the first place.