Pac Learning - Learning Module

Loading content...

0/245

PAC-Learnability

The Fundamental Question: What Can Be Learned?

We've established the PAC framework and derived sample complexity bounds. But a deeper question remains: which concept classes are PAC-learnable in the first place?

This isn't a trivial question. Not every learning problem admits a solution—some concept classes are provably not learnable, no matter how much data we collect or how clever our algorithm.

Understanding PAC-learnability helps us:

Identify tractable problems before investing resources
Understand the limits of machine learning
Design appropriate algorithms for learnable classes
Recognize when approximations or additional assumptions are needed

What You Will Learn

By the end of this page, you will understand: (1) the formal definition of PAC-learnability, (2) the difference between proper and improper learning, (3) examples of learnable concept classes (conjunctions, linear separators), (4) the crucial connection between sample complexity and learnability, and (5) why some concept classes are not efficiently learnable.

Formal Definition of PAC-Learnability

Let's revisit and refine the formal definition of PAC-learnability, paying careful attention to all components.

Setup:

Instance space X (e.g., {0,1}ⁿ or ℝⁿ)
Concept class C ⊆ {c : X → {0,1}} (target functions)
Hypothesis class H ⊆ {h : X → {0,1}} (functions the learner can output)
Unknown distribution D over X
Unknown target c ∈ C*

Definition: PAC-Learnability

A concept class C is PAC-learnable by hypothesis class H if there exists:

A learning algorithm A
A polynomial function m₀(1/ε, 1/δ, n)

such that for every ε, δ ∈ (0, 1), every distribution D over X, and every target c* ∈ C:

When A is given m ≥ m₀(1/ε, 1/δ, n) i.i.d. samples from D labeled by c*, it outputs h ∈ H satisfying:

Pr[err_D(h) ≤ ε] ≥ 1 - δ

If additionally A runs in time polynomial in m, 1/ε, 1/δ, and n, we say C is efficiently PAC-learnable.

Key Nuances:

1. The Role of n (Representation Size):

For most concept classes, complexity grows with a parameter n:

Conjunctions: n = number of Boolean variables
Linear classifiers: n = dimension of input space
Polynomial classifiers: n = degree × dimension

PAC-learnability requires m₀ to be polynomial in n. Exponential dependence on n would make learning intractable for high-dimensional problems.

2. Proper vs. Improper Learning:

Proper learning: H = C (learner outputs a hypothesis from the same class it's learning)
Improper learning: H ⊇ C (learner can output hypotheses outside C)

Improper learning can sometimes be easier! The learner has more flexibility in representing the learned concept.

Proper Learning

•H = C (same class)
•Output is interpretable as a concept
•May be computationally harder
•Example: Learn a conjunction, output a conjunction

Improper Learning

•H ⊇ C (broader class)
•Output may be more complex
•Often computationally easier
•Example: Learn a conjunction, output a decision tree

3. Distribution-Free Requirement:

The definition requires learning to work for every distribution D—we cannot assume D is uniform, Gaussian, or any particular form. This distribution-free property makes PAC guarantees robust but also makes some problems harder.

4. Realizability Assumption:

The standard PAC definition assumes c* ∈ C (the target is in the concept class). This is known as the realizable setting. The agnostic setting drops this assumption, allowing the target to be any function—we then compete with the best hypothesis in H.

Characterizing PAC-Learnability

A fundamental result in learning theory characterizes exactly when PAC learning is possible:

Fundamental Theorem of PAC Learning:

A concept class C is PAC-learnable if and only if C has finite VC dimension.

Furthermore, for a class with VC dimension d:

Sample complexity is Θ(d/ε + log(1/δ)/ε)
ERM is a valid learning algorithm

We'll explore VC dimension in detail in the next module. For now, the key insight is that learnability is tied to a combinatorial measure of complexity.

VC Dimension Preview

VC dimension measures the 'richness' of a hypothesis class—specifically, the largest set of points that can be labeled in all possible ways by hypotheses in the class. A class with VC dimension d requires Ω(d) samples to learn, and O(d) samples suffice. Infinite VC dimension implies the class is not PAC-learnable.

Why Finite VC Dimension Implies Learnability:

The argument proceeds in two steps:

Uniform Convergence: If VC(C) = d is finite, then with m = O(d/ε²) samples, training error uniformly approximates true error for all hypotheses in C. This uses the Sauer-Shelah lemma.
ERM Works: With uniform convergence, ERM finds a hypothesis with low true error. Any consistent hypothesis (in the realizable case) has low true error.

Why Infinite VC Dimension Implies Non-Learnability:

If VC(C) = ∞, then for any sample size m, there exists:

A set S of m points
All 2^m possible labelings of S
Each labeling consistent with some c ∈ C

No finite sample can distinguish between all concepts—an adversary can always pick a c consistent with training data but different on test data.

vc_learnability_connection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from typing import Set, List, Tuple
 
def illustrate_shattering():
    """
    Illustrate the concept of shattering, which underlies VC dimension.
    
    A set S is 'shattered' by class H if every labeling of S is
    achievable by some hypothesis in H.
    
    VC dimension = size of largest shattered set
    """
    
    print("Shattering Illustration")
    print("=" * 60)
    
    # Example 1: Linear separators in R^2
    # Can shatter any 3 non-collinear points
    # Cannot shatter any 4 points (one configuration fails)
    # VC dimension = 3
    
    print("
1. Linear Separators in R^2:")
    print("   - 3 non-collinear points: ALL 8 labelings achievable ✓")
    print("   - 4 points (any config): Some labeling impossible ✗")
    print("   - VC dimension = 3")
    
    # Example 2: Axis-aligned rectangles in R^2
    # Can shatter 4 points (corners of a rectangle)
    # Cannot shatter 5 points
    # VC dimension = 4
    
    print("
2. Axis-Aligned Rectangles in R^2:")
    print("   - 4 points (rectangle corners): ALL 16 labelings ✓")
    print("   - 5 points: Configuration with center fails ✗")
    print("   - VC dimension = 4")
    
    # Example 3: Intervals on R
    # Can shatter 2 points (any pair)
    # Cannot shatter 3 points (endpoint labeling fails)
    # VC dimension = 2
    
    print("
3. Intervals on R:")
    print("   - 2 points: ALL 4 labelings (empty, left, right, both) ✓")
    print("   - 3 points: Labeling (1, 0, 1) impossible ✗")
    print("   - VC dimension = 2")
    
    # Example 4: All functions X → {0,1}
    # Can shatter ANY finite set
    # VC dimension = infinity
    # NOT PAC-learnable!
    
    print("
4. All Functions X → {0,1}:")
    print("   - Can shatter ANY finite set")
    print("   - VC dimension = ∞")
    print("   - NOT PAC-learnable!")
    
    print("
" + "=" * 60)
    print("Key Insight: Finite VC dimension ⟺ PAC-learnable")
 
 
def sample_complexity_from_vc(d: int, epsilon: float, delta: float) -> int:
    """
    Compute sample complexity from VC dimension.
    
    Fundamental theorem: m = O((d + log(1/δ)) / ε)
    
    More precise bound (realizable):
        m = O((d log(1/ε) + log(1/δ)) / ε)
    
    Agnostic bound:
        m = O((d + log(1/δ)) / ε²)
    """
    from math import log, ceil
    
    # Realizable case bound
    realizable = ceil((d * log(1/epsilon) + log(1/delta)) / epsilon)
    
    # Agnostic case bound
    agnostic = ceil((d + log(1/delta)) / (epsilon ** 2))
    
    return realizable, agnostic
 
 
def compare_complexity_bounds():
    """Compare sample complexity for different hypothesis classes."""
    
    print("
Sample Complexity Comparison (ε=0.05, δ=0.05)")
    print("-" * 60)
    
    cases = [
        ("Intervals on R", 2),
        ("Rectangles in R²", 4),
        ("Linear separators R²", 3),
        ("Linear separators R¹⁰", 11),
        ("Linear separators R¹⁰⁰", 101),
        ("Decision stumps (100 features)", 2),
    ]
    
    epsilon, delta = 0.05, 0.05
    
    for name, vc_dim in cases:
        real, agn = sample_complexity_from_vc(vc_dim, epsilon, delta)
        print(f"{name:30} (d={vc_dim:3}): "
              f"Realizable={real:>6,}, Agnostic={agn:>7,}")
 
illustrate_shattering()
compare_complexity_bounds()

Examples of PAC-Learnable Concept Classes

Let's examine several concept classes and prove their PAC-learnability.

Example 1: Conjunctions over n Boolean Variables

Concept class: C = {conjunctions of literals over x₁, ..., xₙ}

Examples: x₁ ∧ x₃, x₁ ∧ ¬x₂ ∧ x₅, etc.

Class size: |C| = 3ⁿ (each variable: positive, negative, or absent)

Theorem: Conjunctions are PAC-learnable with sample complexity O((n/ε) log(n/δ)).

The Conjunction Learning Algorithm

Initialize h = x₁ ∧ ¬x₁ ∧ x₂ ∧ ¬x₂ ∧ ... ∧ xₙ ∧ ¬xₙ (all literals). For each positive example x: remove any literal ℓ where ℓ(x) = 0. Output h = remaining literals. This runs in O(mn) time and finds the unique most-specific consistent conjunction.

Proof of Learnability:

The algorithm always outputs a superset of the target conjunction (never removes needed literals)
Extra literals cause false negatives: predicting 0 when target is 1
Each extra literal has probability ≥ p of being eliminated by a positive example
After m positive examples, probability a 'bad' literal survives: (1-p)^m
By union bound over 2n literals, with m = O((n/ε) log(n/δ)) samples, all bad literals are eliminated with probability ≥ 1-δ

Computational Efficiency: The algorithm runs in O(mn) time—polynomial in sample size and dimension. Thus, conjunctions are efficiently PAC-learnable.

Example 2: Axis-Aligned Rectangles in ℝ²

Concept class: C = {rectangles [a,b] × [c,d] with axes-parallel sides}

A concept c_{a,b,c,d}(x,y) = 1 iff a ≤ x ≤ b and c ≤ y ≤ d.

Class size: |C| = ∞ (continuous parameters)

VC dimension: VC(C) = 4

Theorem: Axis-aligned rectangles are PAC-learnable with sample complexity O((1/ε) log(1/δ)).

rectangle_learning.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from typing import Tuple, List
 
class RectangleLearner:
    """
    PAC learner for axis-aligned rectangles in R^2.
    
    Algorithm: Find the tightest rectangle containing all positive examples.
    This is the 'most specific' consistent hypothesis.
    
    Key insight: The learned rectangle is contained in the target rectangle.
    Errors occur only in the 'gap' between learned and target rectangles.
    """
    
    def __init__(self):
        self.rectangle = None  # (a, b, c, d)
    
    def learn(self, samples: List[Tuple[float, float, int]]) -> None:
        """Learn from labeled samples."""
        positives = [(x, y) for x, y, label in samples if label == 1]
        
        if not positives:
            # No positive examples: output empty rectangle
            self.rectangle = (float('inf'), float('-inf'), 
                             float('inf'), float('-inf'))
            return
        
        # Tightest rectangle containing all positives
        xs = [p[0] for p in positives]
        ys = [p[1] for p in positives]
        
        self.rectangle = (min(xs), max(xs), min(ys), max(ys))
    
    def predict(self, x: float, y: float) -> int:
        """Predict label for point (x, y)."""
        a, b, c, d = self.rectangle
        return int(a <= x <= b and c <= y <= d)
 
 
def analyze_rectangle_learning(m: int, epsilon: float, delta: float):
    """
    Analyze PAC learning for rectangles.
    
    Theorem: With m = O((4/ε) ln(4/δ)) samples, the learned rectangle
    has generalization error ≤ ε with probability ≥ 1-δ.
    
    Proof:
    1. True rectangle R* has probability 1 under D (for points with label 1)
    2. Learned rectangle R ⊆ R* (contained in true rectangle)
    3. Error region = R* \ R (points in true but not learned rectangle)
    4. Error region has 4 'strips' (one per side)
    5. Each strip has probability ≤ ε/4 of not being 'hit' by m samples
       if its probability mass > ε/4
    6. By union bound, probability all strips are small: ≥ 1-δ
    """
    np.random.seed(42)
    
    # True rectangle
    true_rect = (0.2, 0.8, 0.3, 0.7)  # [0.2, 0.8] × [0.3, 0.7]
    a_t, b_t, c_t, d_t = true_rect
    
    # Distribution: uniform over [0, 1]²
    def sample():
        x, y = np.random.uniform(0, 1, 2)
        label = int(a_t <= x <= b_t and c_t <= y <= d_t)
        return x, y, label
    
    # Generate training samples
    samples = [sample() for _ in range(m)]
    
    # Learn rectangle
    learner = RectangleLearner()
    learner.learn(samples)
    
    # Estimate generalization error
    n_test = 100000
    errors = 0
    for _ in range(n_test):
        x, y = np.random.uniform(0, 1, 2)
        true_label = int(a_t <= x <= b_t and c_t <= y <= d_t)
        pred_label = learner.predict(x, y)
        errors += (true_label != pred_label)
    
    gen_error = errors / n_test
    
    # Compare to theoretical bound
    # Required m for PAC: m ≥ (4/ε) ln(4/δ)
    theoretical_m = int(np.ceil((4/epsilon) * np.log(4/delta)))
    
    return {
        'samples': m,
        'gen_error': gen_error,
        'epsilon_target': epsilon,
        'theoretical_m': theoretical_m,
        'learned_rect': learner.rectangle,
        'true_rect': true_rect
    }
 
# Run experiment
result = analyze_rectangle_learning(m=200, epsilon=0.05, delta=0.05)
print(f"Samples: {result['samples']}")
print(f"Generalization error: {result['gen_error']:.4f}")
print(f"Target ε: {result['epsilon_target']}")
print(f"Theoretical m required: {result['theoretical_m']}")

Example 3: Linear Separators (Halfspaces) in ℝⁿ

Concept class: C = {h_{w,b}(x) = sign(w·x + b) : w ∈ ℝⁿ, b ∈ ℝ}

A halfspace partitions ℝⁿ into two regions separated by a hyperplane.

VC dimension: VC(C) = n + 1

Theorem: Linear separators in ℝⁿ are PAC-learnable with sample complexity O((n/ε) log(n/ε)).

Note on Computation: While sample complexity is polynomial, finding a consistent linear separator (when one exists) requires solving a linear program, which is polynomial-time. However, finding the best separator in the agnostic case is NP-hard.

Summary: PAC-Learnable Concept Classes
Concept Class	VC Dimension	Sample Complexity	Computational Complexity
Conjunctions (n vars)	n	O(n/ε log(n/δ))	O(mn) - polynomial
Disjunctions (n vars)	n	O(n/ε log(n/δ))	O(mn) - polynomial
k-CNF (k fixed)	O(nᵏ)	O(nᵏ/ε log(n/δ))	O(mnᵏ) - polynomial in n
Intervals on ℝ	2	O(1/ε log(1/δ))	O(m log m) - polynomial
Rectangles in ℝ²	4	O(1/ε log(1/δ))	O(m) - polynomial
Linear separators ℝⁿ	n + 1	O(n/ε log(n/ε))	Polynomial (LP)

Computational vs. Statistical Efficiency

PAC-learnability involves two distinct types of efficiency:

1. Statistical Efficiency (Sample Complexity):

How many samples are needed?
Determines data requirements
Bounded by VC dimension

2. Computational Efficiency (Time Complexity):

How long does learning take?
Runtime of the learning algorithm
Not automatically determined by VC dimension!

A concept class can be:

Statistically efficient but computationally hard
Both efficient (the gold standard)
Neither (not learnable)

The crucial distinction: Finite VC dimension guarantees existence of a learning algorithm, but not that it's efficient!

Statistical ≠ Computational Efficiency

Some classes have low VC dimension (few samples needed) but ERM is NP-hard. For example: 3-term DNF formulas. The VC dimension is O(n), requiring only O(n/ε) samples. But finding a consistent 3-term DNF is NP-hard! Statistical tractability doesn't imply computational tractability.

Example: The Hardness of Proper Learning

Consider 3-term DNF formulas: (conjunction) ∨ (conjunction) ∨ (conjunction)

Example: (x₁ ∧ x₂) ∨ (x₃ ∧ ¬x₄) ∨ (x₅)

Statistical Efficiency:

VC dimension = O(n log n)
Sample complexity = O(n log n / ε)
Easy!

Computational Problem:

Finding a consistent 3-DNF (proper learning) is NP-hard
The search space has exponentially many candidate formulas
No known polynomial-time algorithm

The Improper Learning Escape:

3-CNF formulas properly PAC-learn 3-DNF!
Every 3-DNF is equivalent to some 3-CNF
Learning 3-CNF from 3-DNF data is polynomial-time
Using a different representation (improper learning) makes the problem tractable

proper_vs_improper.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
"""
Illustrating the difference between proper and improper learning.
 
Proper learning: H = C (same hypothesis class as concept class)
Improper learning: H ⊃ C (broader hypothesis class)
 
Key insight: Improper learning can be easier!
"""
 
from typing import List, Tuple
import numpy as np
 
# Example: Learning conjunctions
 
class ProperConjunctionLearner:
    """
    Proper learner: outputs a conjunction from the class of conjunctions.
    
    This learner finds the most specific conjunction consistent with data.
    Efficient for conjunctions: O(mn) time.
    """
    
    def __init__(self, n: int):
        self.n = n
        self.literals = None  # Set of (variable, polarity) pairs
    
    def learn(self, samples: List[Tuple[np.ndarray, int]]):
        # Initialize with all possible literals
        self.literals = set()
        for i in range(self.n):
            self.literals.add((i, True))   # positive literal
            self.literals.add((i, False))  # negative literal
        
        # Remove literals inconsistent with positive examples
        for x, label in samples:
            if label == 1:
                for i in range(self.n):
                    if x[i] == 0:
                        self.literals.discard((i, True))
                    else:
                        self.literals.discard((i, False))
        
        return self.literals
 
 
class ImproperDNFLearner:
    """
    Improper learner: learns conjunctions but outputs DNF.
    
    This might seem wasteful, but for some classes (like 3-term DNF),
    improper learning with a larger class (3-CNF) is tractable
    while proper learning is NP-hard!
    """
    
    def __init__(self, n: int):
        self.n = n
        self.dnf_terms = []  # List of conjunctions
    
    def learn(self, samples: List[Tuple[np.ndarray, int]]):
        # Simple strategy: memorize positive examples as terms
        # Each positive example becomes a term with full specification
        
        for x, label in samples:
            if label == 1:
                # Create a term that matches exactly this example
                term = []
                for i in range(self.n):
                    term.append((i, x[i] == 1))
                self.dnf_terms.append(term)
        
        return self.dnf_terms
    
    def predict(self, x: np.ndarray) -> int:
        # OR of all terms: positive if any term is satisfied
        for term in self.dnf_terms:
            satisfied = True
            for i, polarity in term:
                if polarity and x[i] != 1:
                    satisfied = False
                    break
                if not polarity and x[i] != 0:
                    satisfied = False
                    break
            if satisfied:
                return 1
        return 0
 
 
def compare_learning_strategies():
    """
    Compare proper and improper learning.
    
    For conjunctions: both work fine.
    For 3-DNF: proper is NP-hard, improper (via 3-CNF) is polynomial!
    """
    
    print("Learning Strategy Comparison")
    print("=" * 60)
    
    print("
1. Conjunctions:")
    print("   - Proper learning: O(mn) time, output is a conjunction")
    print("   - Improper learning: Also works, but unnecessary")
    print("   - In this case, proper is preferred (simpler hypothesis)")
    
    print("
2. 3-term DNF:")
    print("   - Proper learning: NP-hard (must find consistent 3-DNF)")
    print("   - Improper with 3-CNF: Polynomial time!")
    print("   - Key: DNF can be rewritten as CNF (De Morgan's laws)")
    print("   - Learning 3-CNF improperly learns 3-DNF efficiently")
    
    print("
3. Intersections of halfspaces:")
    print("   - Proper learning: Open problem for >2 halfspaces")
    print("   - Improper learning: Use neural nets (not well understood)")
    
    print("
" + "=" * 60)
    print("Takeaway: Improper learning broadens what's tractable!")
 
compare_learning_strategies()

Requirements for Efficient PAC-Learnability

For practical machine learning, we need both statistical and computational efficiency. Let's formalize what 'efficient' means.

Definition: Efficient PAC-Learnability

A concept class C is efficiently PAC-learnable if there exists an algorithm A such that:

Sample efficiency: m = poly(n, 1/ε, 1/δ)
Time efficiency: Runtime of A is poly(m, n, 1/ε, 1/δ)
Hypothesis efficiency: The output hypothesis h can be evaluated in poly(n) time

All three conditions must hold for the problem to be tractable in practice.

Why Hypothesis Efficiency Matters

Condition 3 is sometimes overlooked but important. Even if we can learn a hypothesis efficiently, we need to be able to USE it efficiently for prediction. A hypothesis represented as an exponentially large table is useless even if found in polynomial time.

The Hierarchy of Learnability:

Efficiently PAC-learnable (poly time, poly samples)
- Gold standard for practical ML
- Examples: Conjunctions, linear separators, decision stumps
PAC-learnable but not efficiently (poly samples, exp time)
- ERM exists but is NP-hard
- Examples: 3-term DNF (proper), intersection of halfspaces
Not PAC-learnable (infinite VC dimension)
- No algorithm works regardless of samples
- Examples: All functions X → {0,1}, unbounded depth decision trees

Converting Mermaid diagram...

Efficiently Learnable Classes

•Conjunctions, Disjunctions: O(mn) algorithm finds consistent hypothesis
•k-CNF, k-DNF (fixed k): Polynomial in n, exponential in k
•Decision stumps: O(mn log n) algorithm
•Axis-aligned rectangles: O(m) algorithm (just find extrema)
•Linear separators (realizable): Linear programming, polynomial
•Decision lists: Polynomial-time greedy algorithm

Examples of Non-Learnable Concept Classes

Not everything is learnable. Understanding what cannot be learned is as important as understanding what can.

Example 1: The Class of All Functions

Let C = {all functions c: X → {0,1}}.

For X = {0,1}ⁿ, this class has |C| = 2^{2^n} functions.

VC dimension: VC(C) = 2ⁿ (can shatter any set of points)

Why it's not PAC-learnable:

For any sample size m < 2ⁿ, there exist concepts consistent with training data that disagree on test data
An adversary can always pick a concept that labels unseen points arbitrarily
No generalization is possible without additional assumptions

The No Free Lunch Theorem

No Free Lunch: For the class of all functions, every learner has the same expected error over all possible targets. You can't do better than random guessing without restricting the hypothesis class! This is why assumptions (like limited VC dimension) are necessary for learning.

Example 2: Parity Functions with Noise

Concept class: Parity functions over n Boolean variables

c_S(x) = ⊕_{i ∈ S} xᵢ (XOR of variables in subset S)
There are 2ⁿ such functions (one per subset S)

Statistical perspective: VC dimension is n, so O(n/ε) samples suffice

Computational perspective: With random classification noise, learning parities is as hard as learning with noise in the worst case

Under cryptographic assumptions, this class is computationally not efficiently learnable, even though it's statistically learnable!

Example 3: Unbounded Depth Decision Trees

Concept class: All decision trees (no depth limit)

For n Boolean variables, this class can represent any Boolean function:

Every Boolean function has a decision tree representation
Therefore C = {all functions X → {0,1}}
VC dimension = 2ⁿ (infinite growth rate)

Not PAC-learnable for the same reason as all functions.

Contrast with Bounded Trees:

Decision trees of depth d have VC dimension O(2^d log n)
With fixed d, they ARE PAC-learnable
This illustrates how structural constraints enable learning

unlearnability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from typing import List, Tuple, Callable
 
def demonstrate_no_free_lunch():
    """
    Demonstrate the No Free Lunch theorem: without assumptions,
    every learner is as good (or bad) as any other.
    """
    
    np.random.seed(42)
    n = 4  # Small dimension for illustration
    
    # All possible inputs
    X = [tuple(int(b) for b in format(i, f'0{n}b')) 
         for i in range(2**n)]
    
    # Generate random training sample
    m = 8  # Half the domain
    train_indices = np.random.choice(len(X), m, replace=False)
    train_X = [X[i] for i in train_indices]
    
    # Count how many consistent functions exist
    # (functions that agree on training data but vary on test data)
    
    test_indices = [i for i in range(len(X)) if i not in train_indices]
    n_test = len(test_indices)
    
    # Any labeling of training points is consistent with 2^(n_test) functions
    # (one for each possible labeling of test points)
    
    consistent_functions = 2 ** n_test
    
    print("No Free Lunch Demonstration")
    print("=" * 60)
    print(f"Input dimension: n = {n}")
    print(f"Total instances: |X| = {2**n}")
    print(f"Training samples: m = {m}")
    print(f"Unseen instances: 2^n - m = {2**n - m}")
    print(f"Functions consistent with training: 2^{n_test} = {consistent_functions}")
    print()
    print("For EACH possible labeling of training data,")
    print(f"there are {consistent_functions} functions that produce it.")
    print("Without assumptions, we can't distinguish between them!")
    print()
    print("Implication: For the class of ALL functions,")
    print("no learner can generalize better than random guessing.")
 
 
def demonstrate_shattering_infinite_vc():
    """
    Show that the class of all functions shatters any finite set.
    """
    
    print("
Infinite VC Dimension: Class of All Functions")
    print("=" * 60)
    
    # For any m points, we can realize all 2^m labelings
    for m in [1, 2, 3, 4, 5]:
        labelings = 2 ** m
        print(f"m = {m} points: {labelings:>4} possible labelings, ALL achievable")
    
    print()
    print("VC dimension = ∞ (can shatter any finite set)")
    print("⟹ NOT PAC-learnable (need VC dimension < ∞)")
 
 
def contrast_bounded_unbounded_trees():
    """
    Show the difference between bounded and unbounded decision trees.
    """
    
    print("
Bounded vs Unbounded Decision Trees")
    print("=" * 60)
    
    n = 10  # 10 Boolean variables
    
    # Unbounded trees: can represent all functions
    unbounded_vc = 2 ** n
    
    # Bounded trees: VC dimension depends on depth
    for depth in [1, 2, 3, 4, 5]:
        # VC dimension ≈ 2^d * log(n)
        bounded_vc = (2 ** depth) * np.log2(n)
        
        print(f"Depth = {depth}: VC ≈ {bounded_vc:.0f}")
    
    print(f"Unbounded: VC = {unbounded_vc} = 2^n (infinite growth)")
    print()
    print("Bounded depth trees: PAC-learnable")
    print("Unbounded depth trees: NOT PAC-learnable")
 
demonstrate_no_free_lunch()
demonstrate_shattering_infinite_vc()
contrast_bounded_unbounded_trees()

Summary: PAC-Learnability Essentials

We've explored the theory of PAC-learnability—what makes a concept class learnable and what makes it hard. Let's consolidate the key insights:

Core Lessons

•PAC-learnability requires finite VC dimension — Concept classes with infinite VC dimension cannot be learned from finite samples
•Statistical and computational efficiency are distinct — A class may be learnable with few samples but require exponential time (or vice versa)
•Proper vs. improper learning matters — For some classes, improper learning (using a broader hypothesis class) is tractable while proper learning is NP-hard
•Efficient learnability is rare — Most natural concept classes are either not learnable or computationally hard to learn properly
•Constraints enable learning — Bounded depth, low-degree, sparsity—structural constraints reduce VC dimension and enable tractable learning
•No Free Lunch applies — Without assumptions about the target, no learner outperforms random guessing on unseen data

PAC-Learnability Summary
Property	Requirement	Implication
Statistical learnability	Finite VC dimension	Poly samples suffice
Computational learnability	ERM is polynomial	Poly time learning
Efficient learnability	Both above	Practical ML
Proper learning possible	Class-specific	Sometimes NP-hard
Improper learning helps	Class-specific	May bypass hardness

Page Complete

You now understand PAC-learnability: what it means for a concept class to be learnable, the separation between statistical and computational efficiency, and why some classes are fundamentally not learnable. Next, we'll examine finite hypothesis classes in greater detail, setting the stage for VC dimension analysis of infinite classes.