Loading content...
We've established the PAC framework and derived sample complexity bounds. But a deeper question remains: which concept classes are PAC-learnable in the first place?
This isn't a trivial question. Not every learning problem admits a solution—some concept classes are provably not learnable, no matter how much data we collect or how clever our algorithm.
Understanding PAC-learnability helps us:
By the end of this page, you will understand: (1) the formal definition of PAC-learnability, (2) the difference between proper and improper learning, (3) examples of learnable concept classes (conjunctions, linear separators), (4) the crucial connection between sample complexity and learnability, and (5) why some concept classes are not efficiently learnable.
Let's revisit and refine the formal definition of PAC-learnability, paying careful attention to all components.
Setup:
A concept class C is PAC-learnable by hypothesis class H if there exists:
such that for every ε, δ ∈ (0, 1), every distribution D over X, and every target c* ∈ C:
When A is given m ≥ m₀(1/ε, 1/δ, n) i.i.d. samples from D labeled by c*, it outputs h ∈ H satisfying:
Pr[err_D(h) ≤ ε] ≥ 1 - δ
If additionally A runs in time polynomial in m, 1/ε, 1/δ, and n, we say C is efficiently PAC-learnable.
Key Nuances:
1. The Role of n (Representation Size):
For most concept classes, complexity grows with a parameter n:
PAC-learnability requires m₀ to be polynomial in n. Exponential dependence on n would make learning intractable for high-dimensional problems.
2. Proper vs. Improper Learning:
Improper learning can sometimes be easier! The learner has more flexibility in representing the learned concept.
3. Distribution-Free Requirement:
The definition requires learning to work for every distribution D—we cannot assume D is uniform, Gaussian, or any particular form. This distribution-free property makes PAC guarantees robust but also makes some problems harder.
4. Realizability Assumption:
The standard PAC definition assumes c* ∈ C (the target is in the concept class). This is known as the realizable setting. The agnostic setting drops this assumption, allowing the target to be any function—we then compete with the best hypothesis in H.
A fundamental result in learning theory characterizes exactly when PAC learning is possible:
Fundamental Theorem of PAC Learning:
A concept class C is PAC-learnable if and only if C has finite VC dimension.
Furthermore, for a class with VC dimension d:
We'll explore VC dimension in detail in the next module. For now, the key insight is that learnability is tied to a combinatorial measure of complexity.
VC dimension measures the 'richness' of a hypothesis class—specifically, the largest set of points that can be labeled in all possible ways by hypotheses in the class. A class with VC dimension d requires Ω(d) samples to learn, and O(d) samples suffice. Infinite VC dimension implies the class is not PAC-learnable.
Why Finite VC Dimension Implies Learnability:
The argument proceeds in two steps:
Uniform Convergence: If VC(C) = d is finite, then with m = O(d/ε²) samples, training error uniformly approximates true error for all hypotheses in C. This uses the Sauer-Shelah lemma.
ERM Works: With uniform convergence, ERM finds a hypothesis with low true error. Any consistent hypothesis (in the realizable case) has low true error.
Why Infinite VC Dimension Implies Non-Learnability:
If VC(C) = ∞, then for any sample size m, there exists:
No finite sample can distinguish between all concepts—an adversary can always pick a c consistent with training data but different on test data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
import numpy as npfrom typing import Set, List, Tuple def illustrate_shattering(): """ Illustrate the concept of shattering, which underlies VC dimension. A set S is 'shattered' by class H if every labeling of S is achievable by some hypothesis in H. VC dimension = size of largest shattered set """ print("Shattering Illustration") print("=" * 60) # Example 1: Linear separators in R^2 # Can shatter any 3 non-collinear points # Cannot shatter any 4 points (one configuration fails) # VC dimension = 3 print("1. Linear Separators in R^2:") print(" - 3 non-collinear points: ALL 8 labelings achievable ✓") print(" - 4 points (any config): Some labeling impossible ✗") print(" - VC dimension = 3") # Example 2: Axis-aligned rectangles in R^2 # Can shatter 4 points (corners of a rectangle) # Cannot shatter 5 points # VC dimension = 4 print("2. Axis-Aligned Rectangles in R^2:") print(" - 4 points (rectangle corners): ALL 16 labelings ✓") print(" - 5 points: Configuration with center fails ✗") print(" - VC dimension = 4") # Example 3: Intervals on R # Can shatter 2 points (any pair) # Cannot shatter 3 points (endpoint labeling fails) # VC dimension = 2 print("3. Intervals on R:") print(" - 2 points: ALL 4 labelings (empty, left, right, both) ✓") print(" - 3 points: Labeling (1, 0, 1) impossible ✗") print(" - VC dimension = 2") # Example 4: All functions X → {0,1} # Can shatter ANY finite set # VC dimension = infinity # NOT PAC-learnable! print("4. All Functions X → {0,1}:") print(" - Can shatter ANY finite set") print(" - VC dimension = ∞") print(" - NOT PAC-learnable!") print("" + "=" * 60) print("Key Insight: Finite VC dimension ⟺ PAC-learnable") def sample_complexity_from_vc(d: int, epsilon: float, delta: float) -> int: """ Compute sample complexity from VC dimension. Fundamental theorem: m = O((d + log(1/δ)) / ε) More precise bound (realizable): m = O((d log(1/ε) + log(1/δ)) / ε) Agnostic bound: m = O((d + log(1/δ)) / ε²) """ from math import log, ceil # Realizable case bound realizable = ceil((d * log(1/epsilon) + log(1/delta)) / epsilon) # Agnostic case bound agnostic = ceil((d + log(1/delta)) / (epsilon ** 2)) return realizable, agnostic def compare_complexity_bounds(): """Compare sample complexity for different hypothesis classes.""" print("Sample Complexity Comparison (ε=0.05, δ=0.05)") print("-" * 60) cases = [ ("Intervals on R", 2), ("Rectangles in R²", 4), ("Linear separators R²", 3), ("Linear separators R¹⁰", 11), ("Linear separators R¹⁰⁰", 101), ("Decision stumps (100 features)", 2), ] epsilon, delta = 0.05, 0.05 for name, vc_dim in cases: real, agn = sample_complexity_from_vc(vc_dim, epsilon, delta) print(f"{name:30} (d={vc_dim:3}): " f"Realizable={real:>6,}, Agnostic={agn:>7,}") illustrate_shattering()compare_complexity_bounds()Let's examine several concept classes and prove their PAC-learnability.
Example 1: Conjunctions over n Boolean Variables
Concept class: C = {conjunctions of literals over x₁, ..., xₙ}
Examples: x₁ ∧ x₃, x₁ ∧ ¬x₂ ∧ x₅, etc.
Class size: |C| = 3ⁿ (each variable: positive, negative, or absent)
Theorem: Conjunctions are PAC-learnable with sample complexity O((n/ε) log(n/δ)).
Initialize h = x₁ ∧ ¬x₁ ∧ x₂ ∧ ¬x₂ ∧ ... ∧ xₙ ∧ ¬xₙ (all literals). For each positive example x: remove any literal ℓ where ℓ(x) = 0. Output h = remaining literals. This runs in O(mn) time and finds the unique most-specific consistent conjunction.
Proof of Learnability:
Computational Efficiency: The algorithm runs in O(mn) time—polynomial in sample size and dimension. Thus, conjunctions are efficiently PAC-learnable.
Example 2: Axis-Aligned Rectangles in ℝ²
Concept class: C = {rectangles [a,b] × [c,d] with axes-parallel sides}
A concept c_{a,b,c,d}(x,y) = 1 iff a ≤ x ≤ b and c ≤ y ≤ d.
Class size: |C| = ∞ (continuous parameters)
VC dimension: VC(C) = 4
Theorem: Axis-aligned rectangles are PAC-learnable with sample complexity O((1/ε) log(1/δ)).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom typing import Tuple, List class RectangleLearner: """ PAC learner for axis-aligned rectangles in R^2. Algorithm: Find the tightest rectangle containing all positive examples. This is the 'most specific' consistent hypothesis. Key insight: The learned rectangle is contained in the target rectangle. Errors occur only in the 'gap' between learned and target rectangles. """ def __init__(self): self.rectangle = None # (a, b, c, d) def learn(self, samples: List[Tuple[float, float, int]]) -> None: """Learn from labeled samples.""" positives = [(x, y) for x, y, label in samples if label == 1] if not positives: # No positive examples: output empty rectangle self.rectangle = (float('inf'), float('-inf'), float('inf'), float('-inf')) return # Tightest rectangle containing all positives xs = [p[0] for p in positives] ys = [p[1] for p in positives] self.rectangle = (min(xs), max(xs), min(ys), max(ys)) def predict(self, x: float, y: float) -> int: """Predict label for point (x, y).""" a, b, c, d = self.rectangle return int(a <= x <= b and c <= y <= d) def analyze_rectangle_learning(m: int, epsilon: float, delta: float): """ Analyze PAC learning for rectangles. Theorem: With m = O((4/ε) ln(4/δ)) samples, the learned rectangle has generalization error ≤ ε with probability ≥ 1-δ. Proof: 1. True rectangle R* has probability 1 under D (for points with label 1) 2. Learned rectangle R ⊆ R* (contained in true rectangle) 3. Error region = R* \ R (points in true but not learned rectangle) 4. Error region has 4 'strips' (one per side) 5. Each strip has probability ≤ ε/4 of not being 'hit' by m samples if its probability mass > ε/4 6. By union bound, probability all strips are small: ≥ 1-δ """ np.random.seed(42) # True rectangle true_rect = (0.2, 0.8, 0.3, 0.7) # [0.2, 0.8] × [0.3, 0.7] a_t, b_t, c_t, d_t = true_rect # Distribution: uniform over [0, 1]² def sample(): x, y = np.random.uniform(0, 1, 2) label = int(a_t <= x <= b_t and c_t <= y <= d_t) return x, y, label # Generate training samples samples = [sample() for _ in range(m)] # Learn rectangle learner = RectangleLearner() learner.learn(samples) # Estimate generalization error n_test = 100000 errors = 0 for _ in range(n_test): x, y = np.random.uniform(0, 1, 2) true_label = int(a_t <= x <= b_t and c_t <= y <= d_t) pred_label = learner.predict(x, y) errors += (true_label != pred_label) gen_error = errors / n_test # Compare to theoretical bound # Required m for PAC: m ≥ (4/ε) ln(4/δ) theoretical_m = int(np.ceil((4/epsilon) * np.log(4/delta))) return { 'samples': m, 'gen_error': gen_error, 'epsilon_target': epsilon, 'theoretical_m': theoretical_m, 'learned_rect': learner.rectangle, 'true_rect': true_rect } # Run experimentresult = analyze_rectangle_learning(m=200, epsilon=0.05, delta=0.05)print(f"Samples: {result['samples']}")print(f"Generalization error: {result['gen_error']:.4f}")print(f"Target ε: {result['epsilon_target']}")print(f"Theoretical m required: {result['theoretical_m']}")Example 3: Linear Separators (Halfspaces) in ℝⁿ
Concept class: C = {h_{w,b}(x) = sign(w·x + b) : w ∈ ℝⁿ, b ∈ ℝ}
A halfspace partitions ℝⁿ into two regions separated by a hyperplane.
VC dimension: VC(C) = n + 1
Theorem: Linear separators in ℝⁿ are PAC-learnable with sample complexity O((n/ε) log(n/ε)).
Note on Computation: While sample complexity is polynomial, finding a consistent linear separator (when one exists) requires solving a linear program, which is polynomial-time. However, finding the best separator in the agnostic case is NP-hard.
| Concept Class | VC Dimension | Sample Complexity | Computational Complexity |
|---|---|---|---|
| Conjunctions (n vars) | n | O(n/ε log(n/δ)) | O(mn) - polynomial |
| Disjunctions (n vars) | n | O(n/ε log(n/δ)) | O(mn) - polynomial |
| k-CNF (k fixed) | O(nᵏ) | O(nᵏ/ε log(n/δ)) | O(mnᵏ) - polynomial in n |
| Intervals on ℝ | 2 | O(1/ε log(1/δ)) | O(m log m) - polynomial |
| Rectangles in ℝ² | 4 | O(1/ε log(1/δ)) | O(m) - polynomial |
| Linear separators ℝⁿ | n + 1 | O(n/ε log(n/ε)) | Polynomial (LP) |
PAC-learnability involves two distinct types of efficiency:
1. Statistical Efficiency (Sample Complexity):
2. Computational Efficiency (Time Complexity):
A concept class can be:
The crucial distinction: Finite VC dimension guarantees existence of a learning algorithm, but not that it's efficient!
Some classes have low VC dimension (few samples needed) but ERM is NP-hard. For example: 3-term DNF formulas. The VC dimension is O(n), requiring only O(n/ε) samples. But finding a consistent 3-term DNF is NP-hard! Statistical tractability doesn't imply computational tractability.
Example: The Hardness of Proper Learning
Consider 3-term DNF formulas: (conjunction) ∨ (conjunction) ∨ (conjunction)
Example: (x₁ ∧ x₂) ∨ (x₃ ∧ ¬x₄) ∨ (x₅)
Statistical Efficiency:
Computational Problem:
The Improper Learning Escape:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
"""Illustrating the difference between proper and improper learning. Proper learning: H = C (same hypothesis class as concept class)Improper learning: H ⊃ C (broader hypothesis class) Key insight: Improper learning can be easier!""" from typing import List, Tupleimport numpy as np # Example: Learning conjunctions class ProperConjunctionLearner: """ Proper learner: outputs a conjunction from the class of conjunctions. This learner finds the most specific conjunction consistent with data. Efficient for conjunctions: O(mn) time. """ def __init__(self, n: int): self.n = n self.literals = None # Set of (variable, polarity) pairs def learn(self, samples: List[Tuple[np.ndarray, int]]): # Initialize with all possible literals self.literals = set() for i in range(self.n): self.literals.add((i, True)) # positive literal self.literals.add((i, False)) # negative literal # Remove literals inconsistent with positive examples for x, label in samples: if label == 1: for i in range(self.n): if x[i] == 0: self.literals.discard((i, True)) else: self.literals.discard((i, False)) return self.literals class ImproperDNFLearner: """ Improper learner: learns conjunctions but outputs DNF. This might seem wasteful, but for some classes (like 3-term DNF), improper learning with a larger class (3-CNF) is tractable while proper learning is NP-hard! """ def __init__(self, n: int): self.n = n self.dnf_terms = [] # List of conjunctions def learn(self, samples: List[Tuple[np.ndarray, int]]): # Simple strategy: memorize positive examples as terms # Each positive example becomes a term with full specification for x, label in samples: if label == 1: # Create a term that matches exactly this example term = [] for i in range(self.n): term.append((i, x[i] == 1)) self.dnf_terms.append(term) return self.dnf_terms def predict(self, x: np.ndarray) -> int: # OR of all terms: positive if any term is satisfied for term in self.dnf_terms: satisfied = True for i, polarity in term: if polarity and x[i] != 1: satisfied = False break if not polarity and x[i] != 0: satisfied = False break if satisfied: return 1 return 0 def compare_learning_strategies(): """ Compare proper and improper learning. For conjunctions: both work fine. For 3-DNF: proper is NP-hard, improper (via 3-CNF) is polynomial! """ print("Learning Strategy Comparison") print("=" * 60) print("1. Conjunctions:") print(" - Proper learning: O(mn) time, output is a conjunction") print(" - Improper learning: Also works, but unnecessary") print(" - In this case, proper is preferred (simpler hypothesis)") print("2. 3-term DNF:") print(" - Proper learning: NP-hard (must find consistent 3-DNF)") print(" - Improper with 3-CNF: Polynomial time!") print(" - Key: DNF can be rewritten as CNF (De Morgan's laws)") print(" - Learning 3-CNF improperly learns 3-DNF efficiently") print("3. Intersections of halfspaces:") print(" - Proper learning: Open problem for >2 halfspaces") print(" - Improper learning: Use neural nets (not well understood)") print("" + "=" * 60) print("Takeaway: Improper learning broadens what's tractable!") compare_learning_strategies()For practical machine learning, we need both statistical and computational efficiency. Let's formalize what 'efficient' means.
Definition: Efficient PAC-Learnability
A concept class C is efficiently PAC-learnable if there exists an algorithm A such that:
All three conditions must hold for the problem to be tractable in practice.
Condition 3 is sometimes overlooked but important. Even if we can learn a hypothesis efficiently, we need to be able to USE it efficiently for prediction. A hypothesis represented as an exponentially large table is useless even if found in polynomial time.
The Hierarchy of Learnability:
Efficiently PAC-learnable (poly time, poly samples)
PAC-learnable but not efficiently (poly samples, exp time)
Not PAC-learnable (infinite VC dimension)
Not everything is learnable. Understanding what cannot be learned is as important as understanding what can.
Example 1: The Class of All Functions
Let C = {all functions c: X → {0,1}}.
For X = {0,1}ⁿ, this class has |C| = 2^{2^n} functions.
VC dimension: VC(C) = 2ⁿ (can shatter any set of points)
Why it's not PAC-learnable:
No Free Lunch: For the class of all functions, every learner has the same expected error over all possible targets. You can't do better than random guessing without restricting the hypothesis class! This is why assumptions (like limited VC dimension) are necessary for learning.
Example 2: Parity Functions with Noise
Concept class: Parity functions over n Boolean variables
Statistical perspective: VC dimension is n, so O(n/ε) samples suffice
Computational perspective: With random classification noise, learning parities is as hard as learning with noise in the worst case
Under cryptographic assumptions, this class is computationally not efficiently learnable, even though it's statistically learnable!
Example 3: Unbounded Depth Decision Trees
Concept class: All decision trees (no depth limit)
For n Boolean variables, this class can represent any Boolean function:
Not PAC-learnable for the same reason as all functions.
Contrast with Bounded Trees:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom typing import List, Tuple, Callable def demonstrate_no_free_lunch(): """ Demonstrate the No Free Lunch theorem: without assumptions, every learner is as good (or bad) as any other. """ np.random.seed(42) n = 4 # Small dimension for illustration # All possible inputs X = [tuple(int(b) for b in format(i, f'0{n}b')) for i in range(2**n)] # Generate random training sample m = 8 # Half the domain train_indices = np.random.choice(len(X), m, replace=False) train_X = [X[i] for i in train_indices] # Count how many consistent functions exist # (functions that agree on training data but vary on test data) test_indices = [i for i in range(len(X)) if i not in train_indices] n_test = len(test_indices) # Any labeling of training points is consistent with 2^(n_test) functions # (one for each possible labeling of test points) consistent_functions = 2 ** n_test print("No Free Lunch Demonstration") print("=" * 60) print(f"Input dimension: n = {n}") print(f"Total instances: |X| = {2**n}") print(f"Training samples: m = {m}") print(f"Unseen instances: 2^n - m = {2**n - m}") print(f"Functions consistent with training: 2^{n_test} = {consistent_functions}") print() print("For EACH possible labeling of training data,") print(f"there are {consistent_functions} functions that produce it.") print("Without assumptions, we can't distinguish between them!") print() print("Implication: For the class of ALL functions,") print("no learner can generalize better than random guessing.") def demonstrate_shattering_infinite_vc(): """ Show that the class of all functions shatters any finite set. """ print("Infinite VC Dimension: Class of All Functions") print("=" * 60) # For any m points, we can realize all 2^m labelings for m in [1, 2, 3, 4, 5]: labelings = 2 ** m print(f"m = {m} points: {labelings:>4} possible labelings, ALL achievable") print() print("VC dimension = ∞ (can shatter any finite set)") print("⟹ NOT PAC-learnable (need VC dimension < ∞)") def contrast_bounded_unbounded_trees(): """ Show the difference between bounded and unbounded decision trees. """ print("Bounded vs Unbounded Decision Trees") print("=" * 60) n = 10 # 10 Boolean variables # Unbounded trees: can represent all functions unbounded_vc = 2 ** n # Bounded trees: VC dimension depends on depth for depth in [1, 2, 3, 4, 5]: # VC dimension ≈ 2^d * log(n) bounded_vc = (2 ** depth) * np.log2(n) print(f"Depth = {depth}: VC ≈ {bounded_vc:.0f}") print(f"Unbounded: VC = {unbounded_vc} = 2^n (infinite growth)") print() print("Bounded depth trees: PAC-learnable") print("Unbounded depth trees: NOT PAC-learnable") demonstrate_no_free_lunch()demonstrate_shattering_infinite_vc()contrast_bounded_unbounded_trees()We've explored the theory of PAC-learnability—what makes a concept class learnable and what makes it hard. Let's consolidate the key insights:
| Property | Requirement | Implication |
|---|---|---|
| Statistical learnability | Finite VC dimension | Poly samples suffice |
| Computational learnability | ERM is polynomial | Poly time learning |
| Efficient learnability | Both above | Practical ML |
| Proper learning possible | Class-specific | Sometimes NP-hard |
| Improper learning helps | Class-specific | May bypass hardness |
You now understand PAC-learnability: what it means for a concept class to be learnable, the separation between statistical and computational efficiency, and why some classes are fundamentally not learnable. Next, we'll examine finite hypothesis classes in greater detail, setting the stage for VC dimension analysis of infinite classes.