Pac Learning - Learning Module

Loading content...

0/245

Finite Hypothesis Classes

The Complete Theory for Finite Classes

We've established that PAC learning is possible when the hypothesis class has finite VC dimension, and we've derived sample complexity bounds. Now we turn our attention to the special case of finite hypothesis classes—classes with finitely many distinct hypotheses.

Finite hypothesis classes are the 'easy case' of PAC learning theory:

Bounds are simple and tight
Proofs are elementary (no advanced probability theory needed)
They provide the template for understanding infinite classes

Understanding finite classes deeply will prepare us for the more sophisticated analysis of infinite classes via VC dimension.

What You Will Learn

By the end of this page, you will understand: (1) precise sample complexity bounds for finite classes, (2) the relationship between hypothesis class size and generalization, (3) concrete analysis of important finite classes like conjunctions and finite automata, (4) the Occam's Razor principle in learning theory, and (5) how finite class analysis motivates VC dimension.

Precise Bounds for Finite Hypothesis Classes

Let's state the tight bounds for learning finite hypothesis classes.

Theorem (Finite Hypothesis Class PAC Learning):

Let H be a finite hypothesis class with |H| hypotheses. Then:

Upper Bound (Sufficiency): For the realizable case (∃ h* ∈ H with err_D(h*) = 0):

m ≥ (1/ε)(ln|H| + ln(1/δ))

samples suffice to learn any target in H to error ε with probability 1-δ.

Lower Bound (Necessity): Any learning algorithm requires:

m ≥ (1 - ε)(ln|H| - 1) / ln(1/(1-ε))

samples in the worst case. For small ε, this is approximately (1/ε) ln|H|.

Tight Characterization

The gap between upper and lower bounds is O(ln(1/δ))—just the confidence term. The dependence on |H| and ε is TIGHT: Θ((1/ε) ln|H|). This means we fully understand the sample complexity of finite hypothesis classes.

The Agnostic Case:

When we don't assume the target is in H (agnostic learning), the bound becomes:

m ≥ (1/(2ε²))(ln|H| + ln(2/δ))

Note the (1/ε²) instead of (1/ε). This quadratic dependence reflects the harder task: we're estimating error rates precisely, not just eliminating inconsistent hypotheses.

Comparison:

Setting	Sample Complexity	Key Feature
Realizable	O((1/ε) ln(	H
Agnostic	O((1/ε²) ln(	H

finite_class_bounds.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
import numpy as np
from math import log, ceil
 
def sample_complexity_realizable(epsilon: float, delta: float, 
                                   H_size: int) -> int:
    """
    Sample complexity for realizable PAC learning of finite class.
    
    m ≥ (1/ε)(ln|H| + ln(1/δ))
    """
    return ceil((1/epsilon) * (log(H_size) + log(1/delta)))
 
 
def sample_complexity_agnostic(epsilon: float, delta: float,
                                H_size: int) -> int:
    """
    Sample complexity for agnostic PAC learning of finite class.
    
    m ≥ (1/(2ε²))(ln|H| + ln(2/δ))
    """
    return ceil((1/(2 * epsilon**2)) * (log(H_size) + log(2/delta)))
 
 
def lower_bound_realizable(epsilon: float, H_size: int) -> int:
    """
    Lower bound on sample complexity (any algorithm).
    
    m ≥ (1-ε)(ln|H| - 1) / ln(1/(1-ε))
    
    For small ε, approximately: m ≥ (1/ε) ln|H| * (1 - O(ε))
    """
    if epsilon >= 1:
        return 0
    return ceil((1 - epsilon) * (log(H_size) - 1) / log(1/(1-epsilon)))
 
 
def analyze_finite_class_bounds():
    """Comprehensive analysis of finite class bounds."""
    
    print("Finite Hypothesis Class: Sample Complexity Analysis")
    print("=" * 70)
    
    # Example 1: Varying hypothesis class size
    print("
1. Effect of Hypothesis Class Size |H|")
    print("-" * 50)
    print(f"{'|H|':>12} | {'Realizable':>12} | {'Agnostic':>12} | {'Lower Bound':>12}")
    print("-" * 50)
    
    epsilon, delta = 0.05, 0.05
    for log_H in range(1, 8):
        H_size = 10 ** log_H
        real = sample_complexity_realizable(epsilon, delta, H_size)
        agn = sample_complexity_agnostic(epsilon, delta, H_size)
        lb = lower_bound_realizable(epsilon, H_size)
        print(f"{H_size:>12,} | {real:>12,} | {agn:>12,} | {lb:>12,}")
    
    # Example 2: Varying epsilon
    print("
2. Effect of Accuracy Parameter ε")
    print("-" * 50)
    print(f"{'ε':>8} | {'Realizable':>12} | {'Agnostic':>12} | {'Ratio':>8}")
    print("-" * 50)
    
    H_size, delta = 10000, 0.05
    for epsilon in [0.2, 0.1, 0.05, 0.02, 0.01, 0.005]:
        real = sample_complexity_realizable(epsilon, delta, H_size)
        agn = sample_complexity_agnostic(epsilon, delta, H_size)
        ratio = agn / real if real > 0 else 0
        print(f"{epsilon:>8.3f} | {real:>12,} | {agn:>12,} | {ratio:>8.1f}x")
    
    # Key observation: agnostic/realizable ratio grows with 1/ε
    
    # Example 3: Concrete concept classes
    print("
3. Concrete Concept Classes")
    print("-" * 50)
    
    n = 20  # 20 Boolean variables
    epsilon, delta = 0.05, 0.05
    
    classes = [
        ("Conjunctions", 3**n),              # 3^n (each var: +, -, absent)
        ("k-CNF (k=3)", n**3 * 8),            # Rough: (n choose 3) * 8 clauses
        ("Decision stumps", 2*n),            # 2n thresholds
        ("Decision trees (d=3)", (2*n)**(2**3)),  # Rough approximation
    ]
    
    for name, H_size in classes:
        real = sample_complexity_realizable(epsilon, delta, H_size)
        agn = sample_complexity_agnostic(epsilon, delta, H_size)
        print(f"{name:25} |H|={H_size:>15,} | Real={real:>8,} | Agn={agn:>10,}")
 
analyze_finite_class_bounds()

The Proof: Union Bound Technique

The proof of the finite hypothesis class bound is instructive—it uses the union bound technique, which is fundamental throughout learning theory.

Proof Strategy (Realizable Case):

We want to show: if m ≥ (1/ε)(ln|H| + ln(1/δ)), then with probability ≥ 1-δ, any consistent hypothesis h (err_S(h) = 0) has err_D(h) ≤ ε.

Step 1: Define 'bad' hypotheses

A hypothesis h is ε-bad if err_D(h) > ε. We want to bound the probability that any ε-bad hypothesis is consistent with training data.

Step 2: Bound probability for one bad hypothesis

For a fixed ε-bad hypothesis h:

Each training example has probability < 1-ε of being labeled correctly by h
For h to be consistent, all m examples must be 'lucky'
Pr[h consistent] < (1-ε)^m ≤ e^{-εm}

Key Inequality

We use the fundamental inequality: (1-x) ≤ e^{-x} for all x. This allows us to convert the probability bound into an exponential form that's easier to manipulate. For x = ε and m examples: (1-ε)^m ≤ e^{-εm}.

Step 3: Apply union bound over all bad hypotheses

Let B ⊆ H be the set of ε-bad hypotheses. Then:

Pr[any bad h is consistent] ≤ Σ_{h ∈ B} Pr[h consistent] ≤ |B| · e^{-εm} ≤ |H| · e^{-εm}

Step 4: Set failure probability ≤ δ

We want |H| · e^{-εm} ≤ δ.

Solving: e^{-εm} ≤ δ/|H| -εm ≤ ln(δ/|H|) m ≥ (1/ε) ln(|H|/δ) m ≥ (1/ε)(ln|H| + ln(1/δ)) ✓

pac_proof.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from typing import List, Tuple
 
def verify_pac_bound_empirically(H_size: int, epsilon: float, delta: float,
                                  n_trials: int = 1000) -> dict:
    """
    Empirically verify the PAC bound for finite hypothesis classes.
    
    Setup:
    - H contains H_size hypotheses (represented as random bit strings)
    - Target h* is one of them (picked randomly)
    - Distribution D is uniform over instance space
    - We verify that the bound holds with probability ≥ 1-δ
    """
    from math import log, ceil
    
    # Compute required sample size
    m = ceil((1/epsilon) * (log(H_size) + log(1/delta)))
    
    # Simulate learning
    successes = 0
    
    for trial in range(n_trials):
        # Create hypotheses as random functions on m+100 points
        # Each hypothesis: random binary string of length m+100
        n_points = m + 100  # Extra for test set
        
        hypotheses = np.random.randint(0, 2, (H_size, n_points))
        
        # Pick target (one of the hypotheses)
        target_idx = np.random.randint(0, H_size)
        target = hypotheses[target_idx]
        
        # Generate training sample (first m points)
        train_labels = target[:m]
        
        # Find consistent hypotheses
        consistent_mask = np.all(hypotheses[:, :m] == train_labels, axis=1)
        consistent_indices = np.where(consistent_mask)[0]
        
        # Pick any consistent hypothesis (in practice, ERM picks one)
        if len(consistent_indices) == 0:
            # This shouldn't happen if target is in H
            continue
        
        picked_idx = np.random.choice(consistent_indices)
        picked = hypotheses[picked_idx]
        
        # Evaluate on test points (remaining 100 points)
        test_errors = np.sum(picked[m:] != target[m:])
        test_error_rate = test_errors / 100
        
        # Check if error ≤ epsilon
        if test_error_rate <= epsilon:
            successes += 1
    
    success_rate = successes / n_trials
    
    return {
        'H_size': H_size,
        'epsilon': epsilon,
        'delta': delta,
        'm_required': m,
        'empirical_success_rate': success_rate,
        'target_success_rate': 1 - delta,
        'bound_holds': success_rate >= 1 - delta - 0.05  # Allow small slack
    }
 
 
def demonstrate_union_bound():
    """
    Step-by-step demonstration of the union bound proof.
    """
    print("Union Bound Proof Walkthrough")
    print("=" * 60)
    
    # Parameters
    H_size = 1000
    epsilon = 0.1
    delta = 0.05
    
    # Step 1: One bad hypothesis
    print("
Step 1: Probability one ε-bad hypothesis survives m samples")
    for m in [10, 20, 50, 100]:
        p_survive = (1 - epsilon) ** m
        p_survive_exp = np.exp(-epsilon * m)
        print(f"  m={m}: (1-ε)^m = {p_survive:.2e}, e^{{-εm}} = {p_survive_exp:.2e}")
    
    # Step 2: Union bound
    print(f"
Step 2: Union bound over |H| = {H_size} hypotheses")
    for m in [10, 20, 50, 100]:
        p_any_bad = H_size * np.exp(-epsilon * m)
        print(f"  m={m}: Pr[any bad survives] ≤ {p_any_bad:.4f}")
    
    # Step 3: Solve for m
    print(f"
Step 3: Required m for δ = {delta}")
    m_required = int(np.ceil((1/epsilon) * (np.log(H_size) + np.log(1/delta))))
    p_fail = H_size * np.exp(-epsilon * m_required)
    print(f"  m = (1/ε)(ln|H| + ln(1/δ)) = {m_required}")
    print(f"  Pr[failure] ≤ |H| e^{{-εm}} = {p_fail:.4f} ≤ δ = {delta} ✓")
    
    # Empirical verification
    print("
Step 4: Empirical Verification")
    print("-" * 40)
    
    for H_size in [100, 500, 1000]:
        result = verify_pac_bound_empirically(H_size, 0.15, 0.10, n_trials=500)
        print(f"  |H|={H_size:4}: m={result['m_required']:3}, "
              f"success={result['empirical_success_rate']:.2f} "
              f"(target ≥{result['target_success_rate']:.2f})")
 
demonstrate_union_bound()

Occam's Razor: Simpler Hypotheses Generalize Better

The sample complexity bound m = O((1/ε) ln|H|) reveals a deep principle: smaller hypothesis classes require fewer samples. This is the theoretical justification for Occam's Razor in machine learning.

Occam's Razor Principle:

Among hypotheses consistent with the data, prefer the simplest one—not for aesthetic reasons, but because simpler hypothesis classes generalize better with less data.

The Learning-Theoretic Interpretation:

If hypothesis class H₁ ⊂ H₂ (H₁ is simpler), then:

Sample complexity for H₁ is lower: ln|H₁| < ln|H₂|
With fixed samples, H₁ offers better generalization guarantees
Choosing H₁ over H₂ is provably advantageous

Occam's Razor ≠ Bias Toward Simplicity

This isn't about preferring simple hypotheses per se. It's about the search space. If you search over decision trees of depth 3 vs depth 20, the depth-3 space is smaller, so consistent hypotheses have better generalization guarantees. The 'simplicity' comes from constraining the hypothesis class, not from regularization.

Formal Statement: The Occam Learning Algorithm

Definition: An algorithm A is an Occam algorithm for concept class C using hypothesis class H if:

For any target c ∈ C and sample S consistent with c, A outputs h ∈ H consistent with S
The hypothesis h has description length at most poly(n, |S|) bits

Theorem (Occam's Razor for Learning):

If A is an Occam algorithm that outputs hypotheses of length at most k bits, then with probability ≥ 1-δ:

err_D(A(S)) ≤ ε when m ≥ (k + ln(1/δ)) / ε

Key Insight: The effective hypothesis class size is 2^k (hypotheses of length k)—smaller is better!

Implications of Occam's Razor

•Prefer small trees over large trees when both fit data
•Prefer sparse models over dense models with same accuracy
•Regularization justified by constraining hypothesis class
•Feature selection reduces effective |H|

Practical Consequences

•Early stopping in boosting/neural nets
•Pruning decision trees
•L1 regularization for sparsity
•Model selection via description length

occam_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from math import log, ceil
 
def occam_sample_complexity(description_length: int, epsilon: float, 
                            delta: float) -> int:
    """
    Sample complexity from Occam's Razor theorem.
    
    If hypothesis has description length k bits,
    m ≥ (k + ln(1/δ)) / ε suffices.
    """
    return ceil((description_length + log(1/delta)) / epsilon)
 
 
def analyze_occam_effect():
    """
    Demonstrate how smaller hypothesis classes generalize better.
    """
    print("Occam's Razor in PAC Learning")
    print("=" * 60)
    
    epsilon, delta = 0.05, 0.05
    
    # Compare decision trees of different depths
    print("
1. Decision Trees: Depth vs Sample Complexity")
    print("-" * 50)
    
    n = 20  # features
    
    for depth in [1, 2, 3, 4, 5, 10]:
        # Decision tree description: O(2^d * log n) bits
        description_length = (2 ** depth) * int(np.ceil(np.log2(n)))
        m = occam_sample_complexity(description_length, epsilon, delta)
        
        print(f"  Depth {depth:2}: Description ~{description_length:4} bits, "
              f"Samples needed: {m:,}")
    
    # Polynomial vs exponential separation
    print("
2. Linear vs Exponential Hypothesis Classes")
    print("-" * 50)
    
    for n in [10, 20, 50, 100]:
        # Linear separator: O(n) bits
        lin_bits = n * 32  # 32-bit weights
        m_lin = occam_sample_complexity(lin_bits, epsilon, delta)
        
        # Decision tree depth n/2: O(2^(n/2) * log n) bits
        tree_bits = (2 ** (n//4)) * int(np.ceil(np.log2(n)))
        m_tree = occam_sample_complexity(tree_bits, epsilon, delta)
        
        print(f"  n={n:3}: Linear ({lin_bits} bits) → {m_lin:,} samples")
        print(f"         Tree d={n//4} ({tree_bits} bits) → {m_tree:,} samples")
    
    # Sparsity effect
    print("
3. Effect of Sparsity (k sparse features out of n)")
    print("-" * 50)
    
    n = 100
    for k in [1, 5, 10, 20, 50, 100]:
        # k-sparse linear: O(k log n + k * 32) bits
        sparse_bits = k * (int(np.ceil(np.log2(n))) + 32)
        m_sparse = occam_sample_complexity(sparse_bits, epsilon, delta)
        
        print(f"  k={k:3} out of {n}: {sparse_bits:5} bits → {m_sparse:,} samples")
 
analyze_occam_effect()
 
# Key insight: 
# - Depth-3 tree needs ~1000 samples
# - Depth-10 tree needs ~20000 samples  
# - 20x more data for modest increase in expressivity!

Case Study: Learning Conjunctions

Let's conduct a detailed analysis of PAC learning for Boolean conjunctions—one of the most well-understood concept classes.

Problem Setup:

Instance space: X = {0,1}ⁿ (Boolean vectors)
Concept class C: All conjunctions of literals
- Examples: x₁ ∧ x₃, x₂ ∧ ¬x₄ ∧ x₇, etc.
Class size: |C| = 3ⁿ (each variable: positive, negative, or absent)

The Learning Algorithm (Find-S for Conjunctions):

Initialize h = (x₁ ∧ ¬x₁) ∧ (x₂ ∧ ¬x₂) ∧ ... ∧ (xₙ ∧ ¬xₙ) [most specific]
For each positive example x:
- For each variable xᵢ where xᵢ = 0: remove xᵢ from h
- For each variable xᵢ where xᵢ = 1: remove ¬xᵢ from h
Output the remaining conjunction h

Algorithm Correctness

This algorithm finds the most specific consistent conjunction. It never removes a literal needed by the target (target positives satisfy that literal). It might keep extra literals that happen to be satisfied by all observed positives. Errors occur only when these extra literals cause false negatives.

Sample Complexity Analysis:

From the general bound: m ≥ (1/ε)(ln|H| + ln(1/δ)) = (1/ε)(n ln 3 + ln(1/δ))

Tighter Analysis: We can get a better bound by analyzing the algorithm directly:

There are at most 2n 'extra' literals that could cause errors
Each extra literal ℓ has some probability p_ℓ of being eliminated per positive example
An extra literal survives m positives with probability (1-p_ℓ)^m
But we don't just need positives—we need positives that eliminate each specific literal

Result: m = O((n/ε) log(n/δ)) samples suffice, matching the general bound up to constants.

conjunction_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from typing import Set, List, Tuple
from math import log, ceil
 
class ConjunctionLearner:
    """
    PAC learner for Boolean conjunctions using Find-S algorithm.
    """
    
    def __init__(self, n: int):
        self.n = n
        # Literals: (var_index, is_positive)
        # Start with all possible literals
        self.literals: Set[Tuple[int, bool]] = set()
        for i in range(n):
            self.literals.add((i, True))   # xi
            self.literals.add((i, False))  # ¬xi
    
    def learn(self, samples: List[Tuple[np.ndarray, int]]):
        """Learn from labeled samples."""
        # Process only positive examples
        for x, label in samples:
            if label == 1:
                for i in range(self.n):
                    if x[i] == 0:
                        # xi = 0, so remove positive literal xi
                        self.literals.discard((i, True))
                    else:
                        # xi = 1, so remove negative literal ¬xi
                        self.literals.discard((i, False))
        
        return self.literals
    
    def predict(self, x: np.ndarray) -> int:
        """Predict label for instance x."""
        for var, is_positive in self.literals:
            if is_positive and x[var] == 0:
                return 0
            if not is_positive and x[var] == 1:
                return 0
        return 1
    
    def to_string(self) -> str:
        """String representation of learned conjunction."""
        if not self.literals:
            return "TRUE (empty conjunction)"
        
        terms = []
        for var, is_positive in sorted(self.literals):
            terms.append(f"x{var}" if is_positive else f"¬x{var}")
        return " ∧ ".join(terms)
 
 
def analyze_conjunction_learning():
    """Comprehensive analysis of conjunction learning."""
    np.random.seed(42)
    
    print("Conjunction Learning Analysis")
    print("=" * 60)
    
    n = 10  # 10 Boolean variables
    
    # Target conjunction: x0 ∧ x3 ∧ ¬x5
    target_literals = {(0, True), (3, True), (5, False)}
    
    def true_concept(x: np.ndarray) -> int:
        for var, is_positive in target_literals:
            if is_positive and x[var] == 0:
                return 0
            if not is_positive and x[var] == 1:
                return 0
        return 1
    
    # Learning experiments with varying sample sizes
    print(f"
Target: x0 ∧ x3 ∧ ¬x5")
    print(f"Hypothesis class size: |H| = 3^{n} = {3**n:,}")
    print("
Sample Size vs Learning Success")
    print("-" * 50)
    
    sample_sizes = [5, 10, 20, 50, 100, 200, 500]
    n_trials = 100
    
    for m in sample_sizes:
        successes = 0
        avg_extra_literals = 0
        
        for trial in range(n_trials):
            # Generate training data
            samples = []
            for _ in range(m):
                x = np.random.randint(0, 2, n)
                label = true_concept(x)
                samples.append((x, label))
            
            # Learn
            learner = ConjunctionLearner(n)
            learned = learner.learn(samples)
            
            # Count extra literals (not in target)
            extra = learned - target_literals
            avg_extra_literals += len(extra)
            
            # Estimate generalization error
            test_errors = 0
            n_test = 1000
            for _ in range(n_test):
                x = np.random.randint(0, 2, n)
                true_label = true_concept(x)
                pred_label = learner.predict(x)
                if true_label != pred_label:
                    test_errors += 1
            
            if test_errors / n_test <= 0.05:  # ε = 0.05
                successes += 1
        
        success_rate = successes / n_trials
        avg_extra = avg_extra_literals / n_trials
        
        # Theoretical bound
        theoretical_m = ceil((1/0.05) * (n * log(3) + log(20)))  # δ = 0.05
        
        print(f"m={m:4}: Success rate={success_rate:.2f}, "
              f"Avg extra literals={avg_extra:.1f}, "
              f"(Theory: m≥{theoretical_m})")
 
analyze_conjunction_learning()

Other Important Finite Concept Classes

Beyond conjunctions, several finite concept classes are important for theory and practice.

1. k-DNF Formulas (Fixed k)

A k-DNF is a disjunction (OR) of terms, where each term is a conjunction (AND) of at most k literals.

Example (2-DNF): (x₁ ∧ x₂) ∨ (x₃ ∧ ¬x₄) ∨ x₅

Number of terms of size ≤ k: O((2n)^k) per formula
Hypothesis class size: exponential in number of possible terms
Sample complexity: O((n^k / ε) log(n/δ))

Polynomial for fixed k, exponential for k = Θ(n)

2. Decision Lists

A decision list is an if-then-else structure:

if ℓ₁ then b₁
else if ℓ₂ then b₂
...
else bₖ

where each ℓᵢ is a literal (xⱼ or ¬xⱼ) and bᵢ ∈ {0,1}.

Key Property: Decision lists can represent any Boolean function on n variables with at most 2n decisions.

Representation: 2n literals, each with a label
Can represent any function: enumerate all 2n literals
Sample complexity: O((n/ε) log(n/δ))
Efficiently learnable via greedy algorithm!

Decision Lists: Powerful Yet Learnable

Decision lists hit a sweet spot: they can represent any Boolean function (universal), yet have polynomial-sized representations that are efficiently learnable. The greedy learning algorithm picks the 'purest' literal at each step and recurses.

3. Finite Automata (DFA)

A Deterministic Finite Automaton with s states over alphabet Σ:

Recognizes a regular language
Hypothesis class size: Roughly (|Σ| · s)^s for s-state DFAs

Famous Result (Angluin 1987): DFAs are not properly PAC-learnable!

But: They ARE learnable with membership queries—the learner can ask 'is string w in the language?' Interactive learning expands what's tractable.

Summary of Finite Concept Classes
Class	\|H\| (approx)	Sample Complexity	Efficiently Learnable?
Conjunctions	3ⁿ	O(n/ε log n)	✅ Yes
k-DNF (fixed k)	(n)^(O(k))	O(n^k/ε log n)	✅ Yes (fixed k)
k-CNF (fixed k)	(n)^(O(k))	O(n^k/ε log n)	✅ Yes (fixed k)
Decision lists	2^O(n)	O(n/ε log n)	✅ Yes (greedy)
Decision trees (depth d)	2^(2^d) n^(2^d)	O(2^d log n / ε)	✅ if d small
s-state DFA	s^O(s)	O(s/ε log s)	❌ No (w/o queries)

other_finite_classes.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from typing import List, Tuple
from math import log, ceil, comb
 
def calculate_class_sizes():
    """Calculate hypothesis class sizes for various concept classes."""
    
    print("Hypothesis Class Size Analysis")
    print("=" * 60)
    
    n_values = [10, 20, 50, 100]
    
    for n in n_values:
        print(f"
n = {n} Boolean variables:")
        print("-" * 40)
        
        # Conjunctions: 3^n
        conj_size = 3 ** n
        print(f"  Conjunctions:     3^{n} = 10^{n * np.log10(3):.1f}")
        
        # 2-DNF: at most n^2 terms, each chosen or not
        # Actually: 2^(O(n^2)) possible formulas
        two_dnf_terms = 4 * comb(n, 2) + 2*n + 1  # pairs with signs + singles
        print(f"  2-DNF terms:      {two_dnf_terms} possible terms")
        
        # Decision lists: ordered subsets of 2n literals
        # Bound: (2n)! = 10^{O(n log n)}
        dl_bound = 2 * n * np.log10(2 * n)
        print(f"  Decision lists:   ≤ (2n)! ≈ 10^{dl_bound:.1f}")
        
        # Decision trees depth d
        for d in [2, 3, 4]:
            # Each internal node: n choices
            # 2^d internal nodes, 2^(d+1) leaves
            dt_size = (n ** (2**d)) * (2 ** (2**(d+1)))
            dt_log = (2**d) * np.log10(n) + (2**(d+1)) * np.log10(2)
            print(f"  Decision trees d={d}: ≈ 10^{dt_log:.1f}")
 
 
def sample_complexity_comparison():
    """Compare sample complexity across concept classes."""
    
    print("
" + "=" * 60)
    print("Sample Complexity Comparison (ε=0.05, δ=0.05)")
    print("=" * 60)
    
    epsilon, delta = 0.05, 0.05
    
    for n in [10, 20, 50]:
        print(f"
n = {n}:")
        print("-" * 40)
        
        # Conjunctions
        H_conj = 3 ** n
        m_conj = ceil((1/epsilon) * (log(H_conj) + log(1/delta)))
        m_conj_better = ceil((1/epsilon) * (n * log(3) + log(1/delta)))
        print(f"  Conjunctions:      m ≈ {m_conj_better:,}")
        
        # 2-term DNF
        H_2term = (3**n) ** 2  # Crude bound: two conjunctions
        m_2term = ceil((1/epsilon) * (2 * n * log(3) + log(1/delta)))
        print(f"  2-term DNF:        m ≈ {m_2term:,}")
        
        # Decision lists (bound by n log n bits)
        bits_dl = 4 * n * ceil(log(2*n))  # Each entry: position + label
        m_dl = ceil((bits_dl + log(1/delta)) / epsilon)
        print(f"  Decision lists:    m ≈ {m_dl:,}")
        
        # Decision tree depth 3
        d = 3
        bits_tree = (2**d - 1) * ceil(log(n)) + (2**d) * 1  # Internal + leaves
        m_tree = ceil((bits_tree + log(1/delta)) / epsilon)
        print(f"  Trees (d=3):       m ≈ {m_tree:,}")
 
calculate_class_sizes()
sample_complexity_comparison()

Transition to Infinite Classes: Why We Need VC Dimension

Finite hypothesis class theory is elegant and complete, but many important hypothesis classes are infinite:

Linear classifiers in ℝⁿ: Infinite choices of weight vectors
Neural networks: Continuous weights
Polynomial classifiers: Continuous coefficients
Kernel methods: Infinite-dimensional feature spaces

For infinite classes, |H| = ∞, so the bound m = O(log|H|/ε) is vacuous.

The Question: Can we extend our theory to infinite classes?

The Key Insight

Even if H is infinite, when we restrict attention to a finite set of m points, only finitely many distinct labelings are possible. The 'effective size' of H on m points might be much smaller than |H|. VC dimension captures exactly this notion of 'effective complexity.'

Preview: The Sauer-Shelah Lemma

For a hypothesis class H with VC dimension d, when restricted to any m points:

|H restricted to m points| ≤ Σ_{i=0}^{d} C(m,i) ≤ (em/d)^d

This is polynomial in m (for fixed d), not exponential!

The New Bound:

Replacing |H| with this effective size in our analysis:

m = O((d/ε) log(d/ε) + (1/ε) log(1/δ))

This is the sample complexity for learning any class with VC dimension d—finite or infinite!

Converting Mermaid diagram...

Examples of VC Dimension:

| Hypothesis Class | |H| | VC Dimension | Sample Complexity | |-----------------|-----|--------------|-------------------| | Intervals on ℝ | ∞ | 2 | O(1/ε log(1/ε)) | | Rectangles in ℝ² | ∞ | 4 | O(1/ε log(1/ε)) | | Linear classifiers ℝⁿ | ∞ | n+1 | O(n/ε log(n/ε)) | | Halfspaces ℝⁿ | ∞ | n+1 | O(n/ε log(n/ε)) | | Circles in ℝ² | ∞ | 3 | O(1/ε log(1/ε)) |

Key Takeaway: VC dimension replaces log|H| as the complexity measure, enabling analysis of infinite classes!

Summary: Finite Hypothesis Classes

We've developed a complete understanding of PAC learning for finite hypothesis classes. Here are the key takeaways:

Core Results for Finite Classes

•Tight sample complexity bounds — m = Θ((1/ε) ln|H|) for realizable case, Θ((1/ε²) ln|H|) for agnostic case
•Union bound technique — Fundamental proof method: bound probability that any bad hypothesis survives
•Occam's Razor justified — Smaller hypothesis classes generalize better with fixed data
•Concrete examples — Conjunctions, k-DNF, decision lists are efficiently PAC-learnable
•Some classes remain hard — DFAs not learnable without queries; proper learning sometimes NP-hard
•Foundation for infinite classes — Same ideas extend via VC dimension

Page Complete

You now have a thorough understanding of PAC learning for finite hypothesis classes—tight bounds, proof techniques, and concrete examples. The next page explores computational complexity considerations: when is learning computationally tractable, and what makes some problems hard?