Loading content...
We've established that PAC learning is possible when the hypothesis class has finite VC dimension, and we've derived sample complexity bounds. Now we turn our attention to the special case of finite hypothesis classes—classes with finitely many distinct hypotheses.
Finite hypothesis classes are the 'easy case' of PAC learning theory:
Understanding finite classes deeply will prepare us for the more sophisticated analysis of infinite classes via VC dimension.
By the end of this page, you will understand: (1) precise sample complexity bounds for finite classes, (2) the relationship between hypothesis class size and generalization, (3) concrete analysis of important finite classes like conjunctions and finite automata, (4) the Occam's Razor principle in learning theory, and (5) how finite class analysis motivates VC dimension.
Let's state the tight bounds for learning finite hypothesis classes.
Theorem (Finite Hypothesis Class PAC Learning):
Let H be a finite hypothesis class with |H| hypotheses. Then:
Upper Bound (Sufficiency): For the realizable case (∃ h* ∈ H with err_D(h*) = 0):
m ≥ (1/ε)(ln|H| + ln(1/δ))
samples suffice to learn any target in H to error ε with probability 1-δ.
Lower Bound (Necessity): Any learning algorithm requires:
m ≥ (1 - ε)(ln|H| - 1) / ln(1/(1-ε))
samples in the worst case. For small ε, this is approximately (1/ε) ln|H|.
The gap between upper and lower bounds is O(ln(1/δ))—just the confidence term. The dependence on |H| and ε is TIGHT: Θ((1/ε) ln|H|). This means we fully understand the sample complexity of finite hypothesis classes.
The Agnostic Case:
When we don't assume the target is in H (agnostic learning), the bound becomes:
m ≥ (1/(2ε²))(ln|H| + ln(2/δ))
Note the (1/ε²) instead of (1/ε). This quadratic dependence reflects the harder task: we're estimating error rates precisely, not just eliminating inconsistent hypotheses.
Comparison:
| Setting | Sample Complexity | Key Feature |
|---|---|---|
| Realizable | O((1/ε) ln( | H |
| Agnostic | O((1/ε²) ln( | H |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
import numpy as npfrom math import log, ceil def sample_complexity_realizable(epsilon: float, delta: float, H_size: int) -> int: """ Sample complexity for realizable PAC learning of finite class. m ≥ (1/ε)(ln|H| + ln(1/δ)) """ return ceil((1/epsilon) * (log(H_size) + log(1/delta))) def sample_complexity_agnostic(epsilon: float, delta: float, H_size: int) -> int: """ Sample complexity for agnostic PAC learning of finite class. m ≥ (1/(2ε²))(ln|H| + ln(2/δ)) """ return ceil((1/(2 * epsilon**2)) * (log(H_size) + log(2/delta))) def lower_bound_realizable(epsilon: float, H_size: int) -> int: """ Lower bound on sample complexity (any algorithm). m ≥ (1-ε)(ln|H| - 1) / ln(1/(1-ε)) For small ε, approximately: m ≥ (1/ε) ln|H| * (1 - O(ε)) """ if epsilon >= 1: return 0 return ceil((1 - epsilon) * (log(H_size) - 1) / log(1/(1-epsilon))) def analyze_finite_class_bounds(): """Comprehensive analysis of finite class bounds.""" print("Finite Hypothesis Class: Sample Complexity Analysis") print("=" * 70) # Example 1: Varying hypothesis class size print("1. Effect of Hypothesis Class Size |H|") print("-" * 50) print(f"{'|H|':>12} | {'Realizable':>12} | {'Agnostic':>12} | {'Lower Bound':>12}") print("-" * 50) epsilon, delta = 0.05, 0.05 for log_H in range(1, 8): H_size = 10 ** log_H real = sample_complexity_realizable(epsilon, delta, H_size) agn = sample_complexity_agnostic(epsilon, delta, H_size) lb = lower_bound_realizable(epsilon, H_size) print(f"{H_size:>12,} | {real:>12,} | {agn:>12,} | {lb:>12,}") # Example 2: Varying epsilon print("2. Effect of Accuracy Parameter ε") print("-" * 50) print(f"{'ε':>8} | {'Realizable':>12} | {'Agnostic':>12} | {'Ratio':>8}") print("-" * 50) H_size, delta = 10000, 0.05 for epsilon in [0.2, 0.1, 0.05, 0.02, 0.01, 0.005]: real = sample_complexity_realizable(epsilon, delta, H_size) agn = sample_complexity_agnostic(epsilon, delta, H_size) ratio = agn / real if real > 0 else 0 print(f"{epsilon:>8.3f} | {real:>12,} | {agn:>12,} | {ratio:>8.1f}x") # Key observation: agnostic/realizable ratio grows with 1/ε # Example 3: Concrete concept classes print("3. Concrete Concept Classes") print("-" * 50) n = 20 # 20 Boolean variables epsilon, delta = 0.05, 0.05 classes = [ ("Conjunctions", 3**n), # 3^n (each var: +, -, absent) ("k-CNF (k=3)", n**3 * 8), # Rough: (n choose 3) * 8 clauses ("Decision stumps", 2*n), # 2n thresholds ("Decision trees (d=3)", (2*n)**(2**3)), # Rough approximation ] for name, H_size in classes: real = sample_complexity_realizable(epsilon, delta, H_size) agn = sample_complexity_agnostic(epsilon, delta, H_size) print(f"{name:25} |H|={H_size:>15,} | Real={real:>8,} | Agn={agn:>10,}") analyze_finite_class_bounds()The proof of the finite hypothesis class bound is instructive—it uses the union bound technique, which is fundamental throughout learning theory.
Proof Strategy (Realizable Case):
We want to show: if m ≥ (1/ε)(ln|H| + ln(1/δ)), then with probability ≥ 1-δ, any consistent hypothesis h (err_S(h) = 0) has err_D(h) ≤ ε.
Step 1: Define 'bad' hypotheses
A hypothesis h is ε-bad if err_D(h) > ε. We want to bound the probability that any ε-bad hypothesis is consistent with training data.
Step 2: Bound probability for one bad hypothesis
For a fixed ε-bad hypothesis h:
We use the fundamental inequality: (1-x) ≤ e^{-x} for all x. This allows us to convert the probability bound into an exponential form that's easier to manipulate. For x = ε and m examples: (1-ε)^m ≤ e^{-εm}.
Step 3: Apply union bound over all bad hypotheses
Let B ⊆ H be the set of ε-bad hypotheses. Then:
Pr[any bad h is consistent] ≤ Σ_{h ∈ B} Pr[h consistent] ≤ |B| · e^{-εm} ≤ |H| · e^{-εm}
Step 4: Set failure probability ≤ δ
We want |H| · e^{-εm} ≤ δ.
Solving: e^{-εm} ≤ δ/|H| -εm ≤ ln(δ/|H|) m ≥ (1/ε) ln(|H|/δ) m ≥ (1/ε)(ln|H| + ln(1/δ)) ✓
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
import numpy as npfrom typing import List, Tuple def verify_pac_bound_empirically(H_size: int, epsilon: float, delta: float, n_trials: int = 1000) -> dict: """ Empirically verify the PAC bound for finite hypothesis classes. Setup: - H contains H_size hypotheses (represented as random bit strings) - Target h* is one of them (picked randomly) - Distribution D is uniform over instance space - We verify that the bound holds with probability ≥ 1-δ """ from math import log, ceil # Compute required sample size m = ceil((1/epsilon) * (log(H_size) + log(1/delta))) # Simulate learning successes = 0 for trial in range(n_trials): # Create hypotheses as random functions on m+100 points # Each hypothesis: random binary string of length m+100 n_points = m + 100 # Extra for test set hypotheses = np.random.randint(0, 2, (H_size, n_points)) # Pick target (one of the hypotheses) target_idx = np.random.randint(0, H_size) target = hypotheses[target_idx] # Generate training sample (first m points) train_labels = target[:m] # Find consistent hypotheses consistent_mask = np.all(hypotheses[:, :m] == train_labels, axis=1) consistent_indices = np.where(consistent_mask)[0] # Pick any consistent hypothesis (in practice, ERM picks one) if len(consistent_indices) == 0: # This shouldn't happen if target is in H continue picked_idx = np.random.choice(consistent_indices) picked = hypotheses[picked_idx] # Evaluate on test points (remaining 100 points) test_errors = np.sum(picked[m:] != target[m:]) test_error_rate = test_errors / 100 # Check if error ≤ epsilon if test_error_rate <= epsilon: successes += 1 success_rate = successes / n_trials return { 'H_size': H_size, 'epsilon': epsilon, 'delta': delta, 'm_required': m, 'empirical_success_rate': success_rate, 'target_success_rate': 1 - delta, 'bound_holds': success_rate >= 1 - delta - 0.05 # Allow small slack } def demonstrate_union_bound(): """ Step-by-step demonstration of the union bound proof. """ print("Union Bound Proof Walkthrough") print("=" * 60) # Parameters H_size = 1000 epsilon = 0.1 delta = 0.05 # Step 1: One bad hypothesis print("Step 1: Probability one ε-bad hypothesis survives m samples") for m in [10, 20, 50, 100]: p_survive = (1 - epsilon) ** m p_survive_exp = np.exp(-epsilon * m) print(f" m={m}: (1-ε)^m = {p_survive:.2e}, e^{{-εm}} = {p_survive_exp:.2e}") # Step 2: Union bound print(f"Step 2: Union bound over |H| = {H_size} hypotheses") for m in [10, 20, 50, 100]: p_any_bad = H_size * np.exp(-epsilon * m) print(f" m={m}: Pr[any bad survives] ≤ {p_any_bad:.4f}") # Step 3: Solve for m print(f"Step 3: Required m for δ = {delta}") m_required = int(np.ceil((1/epsilon) * (np.log(H_size) + np.log(1/delta)))) p_fail = H_size * np.exp(-epsilon * m_required) print(f" m = (1/ε)(ln|H| + ln(1/δ)) = {m_required}") print(f" Pr[failure] ≤ |H| e^{{-εm}} = {p_fail:.4f} ≤ δ = {delta} ✓") # Empirical verification print("Step 4: Empirical Verification") print("-" * 40) for H_size in [100, 500, 1000]: result = verify_pac_bound_empirically(H_size, 0.15, 0.10, n_trials=500) print(f" |H|={H_size:4}: m={result['m_required']:3}, " f"success={result['empirical_success_rate']:.2f} " f"(target ≥{result['target_success_rate']:.2f})") demonstrate_union_bound()The sample complexity bound m = O((1/ε) ln|H|) reveals a deep principle: smaller hypothesis classes require fewer samples. This is the theoretical justification for Occam's Razor in machine learning.
Occam's Razor Principle:
Among hypotheses consistent with the data, prefer the simplest one—not for aesthetic reasons, but because simpler hypothesis classes generalize better with less data.
The Learning-Theoretic Interpretation:
If hypothesis class H₁ ⊂ H₂ (H₁ is simpler), then:
This isn't about preferring simple hypotheses per se. It's about the search space. If you search over decision trees of depth 3 vs depth 20, the depth-3 space is smaller, so consistent hypotheses have better generalization guarantees. The 'simplicity' comes from constraining the hypothesis class, not from regularization.
Formal Statement: The Occam Learning Algorithm
Definition: An algorithm A is an Occam algorithm for concept class C using hypothesis class H if:
Theorem (Occam's Razor for Learning):
If A is an Occam algorithm that outputs hypotheses of length at most k bits, then with probability ≥ 1-δ:
err_D(A(S)) ≤ ε when m ≥ (k + ln(1/δ)) / ε
Key Insight: The effective hypothesis class size is 2^k (hypotheses of length k)—smaller is better!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom math import log, ceil def occam_sample_complexity(description_length: int, epsilon: float, delta: float) -> int: """ Sample complexity from Occam's Razor theorem. If hypothesis has description length k bits, m ≥ (k + ln(1/δ)) / ε suffices. """ return ceil((description_length + log(1/delta)) / epsilon) def analyze_occam_effect(): """ Demonstrate how smaller hypothesis classes generalize better. """ print("Occam's Razor in PAC Learning") print("=" * 60) epsilon, delta = 0.05, 0.05 # Compare decision trees of different depths print("1. Decision Trees: Depth vs Sample Complexity") print("-" * 50) n = 20 # features for depth in [1, 2, 3, 4, 5, 10]: # Decision tree description: O(2^d * log n) bits description_length = (2 ** depth) * int(np.ceil(np.log2(n))) m = occam_sample_complexity(description_length, epsilon, delta) print(f" Depth {depth:2}: Description ~{description_length:4} bits, " f"Samples needed: {m:,}") # Polynomial vs exponential separation print("2. Linear vs Exponential Hypothesis Classes") print("-" * 50) for n in [10, 20, 50, 100]: # Linear separator: O(n) bits lin_bits = n * 32 # 32-bit weights m_lin = occam_sample_complexity(lin_bits, epsilon, delta) # Decision tree depth n/2: O(2^(n/2) * log n) bits tree_bits = (2 ** (n//4)) * int(np.ceil(np.log2(n))) m_tree = occam_sample_complexity(tree_bits, epsilon, delta) print(f" n={n:3}: Linear ({lin_bits} bits) → {m_lin:,} samples") print(f" Tree d={n//4} ({tree_bits} bits) → {m_tree:,} samples") # Sparsity effect print("3. Effect of Sparsity (k sparse features out of n)") print("-" * 50) n = 100 for k in [1, 5, 10, 20, 50, 100]: # k-sparse linear: O(k log n + k * 32) bits sparse_bits = k * (int(np.ceil(np.log2(n))) + 32) m_sparse = occam_sample_complexity(sparse_bits, epsilon, delta) print(f" k={k:3} out of {n}: {sparse_bits:5} bits → {m_sparse:,} samples") analyze_occam_effect() # Key insight: # - Depth-3 tree needs ~1000 samples# - Depth-10 tree needs ~20000 samples # - 20x more data for modest increase in expressivity!Let's conduct a detailed analysis of PAC learning for Boolean conjunctions—one of the most well-understood concept classes.
Problem Setup:
The Learning Algorithm (Find-S for Conjunctions):
This algorithm finds the most specific consistent conjunction. It never removes a literal needed by the target (target positives satisfy that literal). It might keep extra literals that happen to be satisfied by all observed positives. Errors occur only when these extra literals cause false negatives.
Sample Complexity Analysis:
From the general bound: m ≥ (1/ε)(ln|H| + ln(1/δ)) = (1/ε)(n ln 3 + ln(1/δ))
Tighter Analysis: We can get a better bound by analyzing the algorithm directly:
Result: m = O((n/ε) log(n/δ)) samples suffice, matching the general bound up to constants.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npfrom typing import Set, List, Tuplefrom math import log, ceil class ConjunctionLearner: """ PAC learner for Boolean conjunctions using Find-S algorithm. """ def __init__(self, n: int): self.n = n # Literals: (var_index, is_positive) # Start with all possible literals self.literals: Set[Tuple[int, bool]] = set() for i in range(n): self.literals.add((i, True)) # xi self.literals.add((i, False)) # ¬xi def learn(self, samples: List[Tuple[np.ndarray, int]]): """Learn from labeled samples.""" # Process only positive examples for x, label in samples: if label == 1: for i in range(self.n): if x[i] == 0: # xi = 0, so remove positive literal xi self.literals.discard((i, True)) else: # xi = 1, so remove negative literal ¬xi self.literals.discard((i, False)) return self.literals def predict(self, x: np.ndarray) -> int: """Predict label for instance x.""" for var, is_positive in self.literals: if is_positive and x[var] == 0: return 0 if not is_positive and x[var] == 1: return 0 return 1 def to_string(self) -> str: """String representation of learned conjunction.""" if not self.literals: return "TRUE (empty conjunction)" terms = [] for var, is_positive in sorted(self.literals): terms.append(f"x{var}" if is_positive else f"¬x{var}") return " ∧ ".join(terms) def analyze_conjunction_learning(): """Comprehensive analysis of conjunction learning.""" np.random.seed(42) print("Conjunction Learning Analysis") print("=" * 60) n = 10 # 10 Boolean variables # Target conjunction: x0 ∧ x3 ∧ ¬x5 target_literals = {(0, True), (3, True), (5, False)} def true_concept(x: np.ndarray) -> int: for var, is_positive in target_literals: if is_positive and x[var] == 0: return 0 if not is_positive and x[var] == 1: return 0 return 1 # Learning experiments with varying sample sizes print(f"Target: x0 ∧ x3 ∧ ¬x5") print(f"Hypothesis class size: |H| = 3^{n} = {3**n:,}") print("Sample Size vs Learning Success") print("-" * 50) sample_sizes = [5, 10, 20, 50, 100, 200, 500] n_trials = 100 for m in sample_sizes: successes = 0 avg_extra_literals = 0 for trial in range(n_trials): # Generate training data samples = [] for _ in range(m): x = np.random.randint(0, 2, n) label = true_concept(x) samples.append((x, label)) # Learn learner = ConjunctionLearner(n) learned = learner.learn(samples) # Count extra literals (not in target) extra = learned - target_literals avg_extra_literals += len(extra) # Estimate generalization error test_errors = 0 n_test = 1000 for _ in range(n_test): x = np.random.randint(0, 2, n) true_label = true_concept(x) pred_label = learner.predict(x) if true_label != pred_label: test_errors += 1 if test_errors / n_test <= 0.05: # ε = 0.05 successes += 1 success_rate = successes / n_trials avg_extra = avg_extra_literals / n_trials # Theoretical bound theoretical_m = ceil((1/0.05) * (n * log(3) + log(20))) # δ = 0.05 print(f"m={m:4}: Success rate={success_rate:.2f}, " f"Avg extra literals={avg_extra:.1f}, " f"(Theory: m≥{theoretical_m})") analyze_conjunction_learning()Beyond conjunctions, several finite concept classes are important for theory and practice.
1. k-DNF Formulas (Fixed k)
A k-DNF is a disjunction (OR) of terms, where each term is a conjunction (AND) of at most k literals.
Example (2-DNF): (x₁ ∧ x₂) ∨ (x₃ ∧ ¬x₄) ∨ x₅
Polynomial for fixed k, exponential for k = Θ(n)
2. Decision Lists
A decision list is an if-then-else structure:
where each ℓᵢ is a literal (xⱼ or ¬xⱼ) and bᵢ ∈ {0,1}.
Key Property: Decision lists can represent any Boolean function on n variables with at most 2n decisions.
Decision lists hit a sweet spot: they can represent any Boolean function (universal), yet have polynomial-sized representations that are efficiently learnable. The greedy learning algorithm picks the 'purest' literal at each step and recurses.
3. Finite Automata (DFA)
A Deterministic Finite Automaton with s states over alphabet Σ:
Famous Result (Angluin 1987): DFAs are not properly PAC-learnable!
But: They ARE learnable with membership queries—the learner can ask 'is string w in the language?' Interactive learning expands what's tractable.
| Class | |H| (approx) | Sample Complexity | Efficiently Learnable? |
|---|---|---|---|
| Conjunctions | 3ⁿ | O(n/ε log n) | ✅ Yes |
| k-DNF (fixed k) | (n)^(O(k)) | O(n^k/ε log n) | ✅ Yes (fixed k) |
| k-CNF (fixed k) | (n)^(O(k)) | O(n^k/ε log n) | ✅ Yes (fixed k) |
| Decision lists | 2^O(n) | O(n/ε log n) | ✅ Yes (greedy) |
| Decision trees (depth d) | 2^(2^d) n^(2^d) | O(2^d log n / ε) | ✅ if d small |
| s-state DFA | s^O(s) | O(s/ε log s) | ❌ No (w/o queries) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom typing import List, Tuplefrom math import log, ceil, comb def calculate_class_sizes(): """Calculate hypothesis class sizes for various concept classes.""" print("Hypothesis Class Size Analysis") print("=" * 60) n_values = [10, 20, 50, 100] for n in n_values: print(f"n = {n} Boolean variables:") print("-" * 40) # Conjunctions: 3^n conj_size = 3 ** n print(f" Conjunctions: 3^{n} = 10^{n * np.log10(3):.1f}") # 2-DNF: at most n^2 terms, each chosen or not # Actually: 2^(O(n^2)) possible formulas two_dnf_terms = 4 * comb(n, 2) + 2*n + 1 # pairs with signs + singles print(f" 2-DNF terms: {two_dnf_terms} possible terms") # Decision lists: ordered subsets of 2n literals # Bound: (2n)! = 10^{O(n log n)} dl_bound = 2 * n * np.log10(2 * n) print(f" Decision lists: ≤ (2n)! ≈ 10^{dl_bound:.1f}") # Decision trees depth d for d in [2, 3, 4]: # Each internal node: n choices # 2^d internal nodes, 2^(d+1) leaves dt_size = (n ** (2**d)) * (2 ** (2**(d+1))) dt_log = (2**d) * np.log10(n) + (2**(d+1)) * np.log10(2) print(f" Decision trees d={d}: ≈ 10^{dt_log:.1f}") def sample_complexity_comparison(): """Compare sample complexity across concept classes.""" print("" + "=" * 60) print("Sample Complexity Comparison (ε=0.05, δ=0.05)") print("=" * 60) epsilon, delta = 0.05, 0.05 for n in [10, 20, 50]: print(f"n = {n}:") print("-" * 40) # Conjunctions H_conj = 3 ** n m_conj = ceil((1/epsilon) * (log(H_conj) + log(1/delta))) m_conj_better = ceil((1/epsilon) * (n * log(3) + log(1/delta))) print(f" Conjunctions: m ≈ {m_conj_better:,}") # 2-term DNF H_2term = (3**n) ** 2 # Crude bound: two conjunctions m_2term = ceil((1/epsilon) * (2 * n * log(3) + log(1/delta))) print(f" 2-term DNF: m ≈ {m_2term:,}") # Decision lists (bound by n log n bits) bits_dl = 4 * n * ceil(log(2*n)) # Each entry: position + label m_dl = ceil((bits_dl + log(1/delta)) / epsilon) print(f" Decision lists: m ≈ {m_dl:,}") # Decision tree depth 3 d = 3 bits_tree = (2**d - 1) * ceil(log(n)) + (2**d) * 1 # Internal + leaves m_tree = ceil((bits_tree + log(1/delta)) / epsilon) print(f" Trees (d=3): m ≈ {m_tree:,}") calculate_class_sizes()sample_complexity_comparison()Finite hypothesis class theory is elegant and complete, but many important hypothesis classes are infinite:
For infinite classes, |H| = ∞, so the bound m = O(log|H|/ε) is vacuous.
The Question: Can we extend our theory to infinite classes?
Even if H is infinite, when we restrict attention to a finite set of m points, only finitely many distinct labelings are possible. The 'effective size' of H on m points might be much smaller than |H|. VC dimension captures exactly this notion of 'effective complexity.'
Preview: The Sauer-Shelah Lemma
For a hypothesis class H with VC dimension d, when restricted to any m points:
|H restricted to m points| ≤ Σ_{i=0}^{d} C(m,i) ≤ (em/d)^d
This is polynomial in m (for fixed d), not exponential!
The New Bound:
Replacing |H| with this effective size in our analysis:
m = O((d/ε) log(d/ε) + (1/ε) log(1/δ))
This is the sample complexity for learning any class with VC dimension d—finite or infinite!
Examples of VC Dimension:
| Hypothesis Class | |H| | VC Dimension | Sample Complexity | |-----------------|-----|--------------|-------------------| | Intervals on ℝ | ∞ | 2 | O(1/ε log(1/ε)) | | Rectangles in ℝ² | ∞ | 4 | O(1/ε log(1/ε)) | | Linear classifiers ℝⁿ | ∞ | n+1 | O(n/ε log(n/ε)) | | Halfspaces ℝⁿ | ∞ | n+1 | O(n/ε log(n/ε)) | | Circles in ℝ² | ∞ | 3 | O(1/ε log(1/ε)) |
Key Takeaway: VC dimension replaces log|H| as the complexity measure, enabling analysis of infinite classes!
We've developed a complete understanding of PAC learning for finite hypothesis classes. Here are the key takeaways:
You now have a thorough understanding of PAC learning for finite hypothesis classes—tight bounds, proof techniques, and concrete examples. The next page explores computational complexity considerations: when is learning computationally tractable, and what makes some problems hard?