Loading content...
When we train a machine learning model on a dataset, we're implicitly making a profound leap of faith. We observe the model's performance on training data—a finite sample from some unknown distribution—and then trust that this performance will generalize to unseen examples drawn from the same distribution.
But how can we justify this leap? Why should low training error imply low error on new data? Is there a principled way to quantify our confidence in a learned model's predictions?
These are not merely philosophical questions. They strike at the heart of what makes machine learning possible—or impossible—in any given setting. The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provides the first rigorous mathematical answer to these fundamental questions.
By the end of this page, you will understand the PAC learning framework's core components: the formal learning model, the meanings of 'probably' and 'approximately,' the role of the unknown target distribution, and how these elements combine to provide the first theoretical guarantees for machine learning. This framework will become your lens for understanding all subsequent generalization theory.
Before PAC learning, machine learning existed primarily as an empirical discipline. Researchers developed algorithms—perceptrons, decision trees, early neural networks—and demonstrated their effectiveness through experiments. But there was no rigorous framework to answer fundamental questions:
These questions matter immensely in practice. Without theoretical grounding, we have no way to distinguish between problems that are hard due to insufficient data, hard due to computational limitations, or fundamentally impossible to learn.
Leslie Valiant, working at Harvard, recognized that learning could be formalized as a computational problem. His 1984 paper 'A Theory of the Learnable' introduced PAC learning and earned him the Turing Award in 2010. Valiant's key insight was that learning should be judged not just by whether it succeeds, but by whether it succeeds efficiently—both in terms of sample size and computation time.
The pre-PAC landscape:
Perceptron Convergence Theorem (1962): Rosenblatt proved that perceptrons converge when data is linearly separable—but said nothing about generalization to unseen data.
Statistical Decision Theory: Provided tools for analyzing estimation problems, but typically assumed the model class was known to contain the true distribution—too strong an assumption for learning.
Computational Learning (ad hoc): Various algorithms were analyzed individually, but there was no unifying framework to compare them or understand their limitations.
PAC learning unified these threads, providing a framework where:
To make precise statements about learning, we must define our terms rigorously. The PAC framework introduces a formal model with several key components:
The Instance Space X:
We denote by X the set of all possible instances (examples, inputs). In a spam classification problem, X might be the set of all possible emails. In image recognition, X could be the set of all possible images of a fixed resolution. Formally, X is often taken to be {0, 1}ⁿ (binary vectors of length n) or ℝⁿ (real-valued vectors), but the theory extends to arbitrary instance spaces.
The original PAC framework focused on X = {0, 1}ⁿ because it allows clean combinatorial analysis. Each instance is a bit vector of length n, and concepts are subsets of {0, 1}ⁿ. This makes bounds on hypothesis class size and sample complexity concrete and computable. The framework extends naturally to continuous domains.
The Concept Class C:
A concept is a function c: X → {0, 1} that assigns a binary label to every instance. We can equivalently think of a concept as a subset of X—the set of instances labeled 1 (positive examples).
A concept class C is a collection of concepts we're trying to learn. For example:
The choice of concept class C is crucial—it determines what can be learned and with what sample complexity.
1234567891011121314151617181920212223242526272829303132333435363738394041
# Example concept classes and instances import numpy as np # Instance space: Binary vectors of length n=4# Each instance is a 4-bit vectorn = 4instance_space_size = 2**n # 16 possible instances # Example concept: Conjunction (x1 AND NOT x2)# Returns 1 iff x[0]=1 and x[1]=0def conjunction_concept(x): """A concept from the class of conjunctions.""" return int(x[0] == 1 and x[1] == 0) # Example concept: Axis-aligned rectangle in R^2# Returns 1 iff point (x,y) is inside the rectangle [a,b] x [c,d]class AxisAlignedRectangle: def __init__(self, a, b, c, d): """Rectangle with x in [a,b] and y in [c,d].""" self.a, self.b, self.c, self.d = a, b, c, d def __call__(self, point): x, y = point return int(self.a <= x <= self.b and self.c <= y <= self.d) # Example concept: Linear separator in R^n# Returns 1 iff w·x + b >= 0class LinearSeparator: def __init__(self, weights, bias): """Hyperplane defined by w·x + b >= 0.""" self.w = np.array(weights) self.b = bias def __call__(self, x): return int(np.dot(self.w, x) + self.b >= 0) # The concept class C is the SET of all such concepts# |C_conjunctions| = 3^n (each variable: positive, negative, or absent)# |C_rectangles| = infinite (but VC dimension = 4)# |C_linear_separators| = infinite (VC dimension = n+1)The Target Distribution D:
In PAC learning, we assume there exists an unknown probability distribution D over the instance space X. This distribution governs:
Crucially, the same distribution D is used for both training and testing. This is a fundamental assumption—if the test distribution differs from training, all bets are off. This assumption is sometimes called the i.i.d. (independent and identically distributed) assumption.
The distribution D is unknown to the learner. We cannot assume D is uniform, or normal, or any particular form. The theory must work for any distribution D—this is what makes PAC learning 'distribution-free.'
Distribution-free learning is both a strength and a limitation. It means PAC bounds hold for ANY distribution—no assumptions needed. But it also means bounds can be loose for 'nice' distributions. This is why PAC bounds are often worst-case bounds. In practice, when we know something about the distribution, tighter analysis may be possible.
The Target Concept c ∈ C:*
We assume there exists a target concept c* that belongs to the concept class C. This concept perfectly labels all instances—it represents the 'ground truth' we're trying to learn.
The assumption that c* ∈ C is called the realizability assumption or consistent hypothesis assumption. It means the target concept is expressible within our chosen concept class.
Note: This is a strong assumption. In reality, our concept class C might not contain the true labeling function. Later, we'll discuss agnostic PAC learning, which relaxes this assumption.
With our components defined, we can now specify exactly how learning proceeds:
Step 1: Sampling
The learner receives a training set S = {(x₁, c(x₁)), (x₂, c(x₂)), ..., (xₘ, c*(xₘ))}**
where each xᵢ is drawn independently from the distribution D, and labels come from the target concept c*. The number of samples m is a key parameter—we want to understand how m must grow with our requirements.
Step 2: Hypothesis Selection
The learner outputs a hypothesis h: X → {0, 1} from some hypothesis class H. In the basic PAC model, H = C (the learner chooses from the same class it's learning). In general, H can differ from C.
Step 3: Evaluation
The hypothesis h is evaluated by its generalization error (also called true error or risk):
err_D(h) = Pr_{x~D}[h(x) ≠ c(x)]*
This is the probability that h misclassifies a random example drawn from D. Note: we measure error with respect to D, the same distribution training examples came from.
The Gap Between Training and Test Error:
The learner can observe the training error (empirical error):
err_S(h) = (1/m) |{i : h(xᵢ) ≠ c(xᵢ)}|*
But the learner wants to minimize the generalization error err_D(h), which cannot be directly computed (since D is unknown).
The central question of learning theory: Under what conditions does low training error imply low generalization error?
PAC learning provides precise answers to this question.
err_S(h) is observable; err_D(h) is what we care about. The gap between them is the 'generalization gap.' PAC learning bounds this gap, telling us when we can trust training performance. A learner that memorizes training data may have err_S(h) = 0 but terrible err_D(h)—this is overfitting.
We now come to the central definition. A learning algorithm is PAC (Probably Approximately Correct) if it can output a hypothesis that is:
Formally:
A concept class C is PAC-learnable by a learning algorithm A if there exists a function m₀: (0,1) × (0,1) → ℕ such that:
For every ε > 0 (accuracy parameter) and δ > 0 (confidence parameter), For every distribution D over X, For every target concept c ∈ C*,
When algorithm A is given a sample S of m ≥ m₀(ε, δ) examples drawn i.i.d. from D and labeled by c*:
Pr_S[err_D(h) ≤ ε] ≥ 1 - δ
where h is the hypothesis output by A(S).
Unpacking the Definition:
The ε Parameter (Accuracy): ε is the maximum generalization error we're willing to tolerate. Setting ε = 0.05 means we want a hypothesis that misclassifies at most 5% of random examples. Smaller ε requires more samples.
The δ Parameter (Confidence): δ is the probability of failure. Setting δ = 0.01 means we want our guarantee to hold at least 99% of the time. The randomness is over the choice of training set S—some 'unlucky' training sets might lead to bad hypotheses. Smaller δ requires more samples.
Why 'Probably Approximately'?
We cannot guarantee exact correctness (err_D(h) = 0) for two reasons:
Thus, we settle for approximate correctness (err_D(h) ≤ ε) that holds probably (with probability ≥ 1-δ).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom typing import Callable, List, Tuple # Simulating PAC learning for axis-aligned rectangles class PACRectangleLearner: """ PAC learner for axis-aligned rectangles in R^2. Target concept: Rectangle defined by a <= x <= b and c <= y <= d Hypothesis class: Same as concept class (rectangles) This is a canonical example because: - It's simple enough to analyze completely - It demonstrates key PAC concepts - The sample complexity is O(1/ε * log(1/δ)) """ def __init__(self): self.hypothesis = None def learn(self, samples: List[Tuple[float, float, int]]): """ Learn the tightest rectangle containing all positive examples. This is the 'consistent learner' strategy: find the smallest hypothesis consistent with the training data. samples: List of (x, y, label) where label ∈ {0, 1} """ positive_samples = [(x, y) for x, y, label in samples if label == 1] if not positive_samples: # No positive examples: output empty rectangle (always predict 0) self.hypothesis = (float('inf'), float('-inf'), float('inf'), float('-inf')) return # Tightest axis-aligned rectangle containing positive examples xs = [p[0] for p in positive_samples] ys = [p[1] for p in positive_samples] a = min(xs) # left boundary b = max(xs) # right boundary c = min(ys) # bottom boundary d = max(ys) # top boundary self.hypothesis = (a, b, c, d) def predict(self, x: float, y: float) -> int: """Predict label for point (x, y).""" a, b, c, d = self.hypothesis return int(a <= x <= b and c <= y <= d) def generalization_error(self, true_rect: Tuple[float, float, float, float], distribution: Callable): """ Estimate generalization error by Monte Carlo sampling. true_rect: (a*, b*, c*, d*) defining the target concept distribution: Function that samples from D """ n_samples = 100000 errors = 0 a_t, b_t, c_t, d_t = true_rect for _ in range(n_samples): x, y = distribution() true_label = int(a_t <= x <= b_t and c_t <= y <= d_t) pred_label = self.predict(x, y) errors += (true_label != pred_label) return errors / n_samples # Demonstration of PAC learningdef demonstrate_pac_learning(m: int, epsilon: float, delta: float): """ Demonstrate PAC learning with varying sample sizes. m: number of training samples epsilon: desired accuracy (max gen. error) delta: confidence (probability of failure) """ # True rectangle (target concept) true_rect = (0.2, 0.8, 0.3, 0.7) # [0.2, 0.8] x [0.3, 0.7] # Distribution D: uniform over [0, 1]^2 def sample_distribution(): return np.random.uniform(0, 1), np.random.uniform(0, 1) # Generate training samples samples = [] a_t, b_t, c_t, d_t = true_rect for _ in range(m): x, y = sample_distribution() label = int(a_t <= x <= b_t and c_t <= y <= d_t) samples.append((x, y, label)) # Learn hypothesis learner = PACRectangleLearner() learner.learn(samples) # Compute generalization error gen_error = learner.generalization_error(true_rect, sample_distribution) return gen_error # Run multiple trials to estimate probability of successdef pac_experiment(m: int, epsilon: float, n_trials: int = 100): """Run n_trials and count how often gen_error <= epsilon.""" successes = sum( 1 for _ in range(n_trials) if demonstrate_pac_learning(m, epsilon, 0.05) <= epsilon ) return successes / n_trials # Example output (conceptual):# m=50, ε=0.10: Success rate ≈ 0.85# m=100, ε=0.10: Success rate ≈ 0.95# m=200, ε=0.10: Success rate ≈ 0.99# m=500, ε=0.05: Success rate ≈ 0.99Abstract definitions become clearer with concrete examples. Let's examine PAC learning for one of the simplest concept classes: conjunctions over Boolean variables.
Concept Class: Monotone Conjunctions
Consider the instance space X = {0,1}ⁿ (Boolean vectors of length n). A monotone conjunction is an AND of some subset of variables:
There are 2ⁿ monotone conjunctions (each variable is either included or not).
A positive example (labeled 1) proves that any variable with value 0 in that example cannot be in the target conjunction—otherwise the example would be labeled 0. We eliminate such variables. The remaining variables must all be in the target conjunction (or we'd have removed them).
Analyzing the Algorithm:
Let the target concept be c* = x_{i₁} ∧ x_{i₂} ∧ ... ∧ x_{iₖ} (a conjunction of k variables).
After seeing all positive examples, our hypothesis h will:
The hypothesis h is a superset of c*. It makes errors only on instances where:
This happens when x has a 0 for some variable in h that's not in c*—a 'ghost' variable we didn't eliminate.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
import numpy as npfrom typing import Set, Tuple, List class ConjunctionLearner: """ PAC learner for monotone conjunctions over {0,1}^n. This is the classic 'consistent hypothesis' algorithm: - Start with all variables in the conjunction - Remove any variable that appears as 0 in a positive example - The remaining variables form the hypothesis """ def __init__(self, n: int): """Initialize learner for n Boolean variables.""" self.n = n self.hypothesis: Set[int] = set() def learn(self, samples: List[Tuple[np.ndarray, int]]): """ Learn from labeled samples. samples: List of (x, label) where x ∈ {0,1}^n, label ∈ {0,1} """ # Start with all variables in hypothesis self.hypothesis = set(range(self.n)) # Process positive examples only for x, label in samples: if label == 1: # Remove variables that are 0 in this positive example for i in range(self.n): if x[i] == 0 and i in self.hypothesis: self.hypothesis.remove(i) def predict(self, x: np.ndarray) -> int: """ Predict label for instance x. Returns 1 iff all variables in hypothesis are 1. """ for i in self.hypothesis: if x[i] == 0: return 0 return 1 def error_region_analysis(self, target: Set[int], distribution) -> float: """ Analyze where hypothesis h differs from target c*. Error occurs when: - Target c*(x) = 1 (all target vars are 1) - Hypothesis h(x) = 0 (some hypothesis var is 0) This happens for 'ghost variables': vars in h but not in c*. For each ghost variable i, error region includes points where: - All target variables = 1 - Variable i = 0 """ ghost_vars = self.hypothesis - target # Ghost variables cause false negatives # Each ghost var i: error when x_i=0 AND all target vars=1 # Monte Carlo estimation n_samples = 50000 errors = 0 for _ in range(n_samples): x = distribution() true_label = all(x[i] == 1 for i in target) if target else 1 pred_label = self.predict(x) errors += (true_label != pred_label) return errors / n_samples def sample_complexity_for_conjunctions(n: int, epsilon: float, delta: float): """ Calculate sample complexity for PAC learning conjunctions. Theorem: m = O((n/ε) * log(n/δ)) samples suffice. Proof sketch: - There are n 'ghost variable' risks - Each ghost variable has probability at most ε/n of not being eliminated - By union bound, probability of any ghost surviving is at most n * (ε/n) = ε - To ensure each ghost has prob < ε/n of surviving after m samples, we need m ≥ (1/p) * ln(n/δ) where p is probability of a positive example eliminating that ghost For typical distributions: m = O(n/ε * log(n/δ)) """ # Rough bound: m = (n/ε) * ln(n/δ) m = int(np.ceil((n / epsilon) * np.log(n / delta))) return m # Example calculationsn = 10 # 10 Boolean variablesepsilon = 0.05 # 5% error tolerancedelta = 0.01 # 99% confidence required_samples = sample_complexity_for_conjunctions(n, epsilon, delta)print(f"For n={n}, ε={epsilon}, δ={delta}:")print(f"Required samples: m ≥ {required_samples}")# Typically around 200-400 samples for these parametersThe parameters ε and δ are not just technical details—they are the knobs that control the tradeoff between sample size and learning guarantees.
The Accuracy Parameter ε:
ε bounds the generalization error we tolerate. The relationship between m (sample size) and ε is typically:
m = O(1/ε)
To halve ε (double accuracy), we roughly need to double m. This makes intuitive sense: to distinguish between hypotheses that differ by only a small margin, we need more evidence.
The Confidence Parameter δ:
δ bounds the probability of failure. The relationship is typically:
m = O(log(1/δ))
To reduce δ by a factor of 10 (e.g., from 1% to 0.1% failure rate), we only need to increase m by a constant factor. This logarithmic dependence is good news—high confidence is 'cheap.'
| ε (accuracy) | δ (failure prob) | Approximate m needed | Interpretation |
|---|---|---|---|
| 0.10 (10%) | 0.10 (10%) | ~230 | Rough estimate, moderate confidence |
| 0.10 (10%) | 0.01 (1%) | ~460 | Same accuracy, higher confidence |
| 0.05 (5%) | 0.01 (1%) | ~920 | Higher accuracy, high confidence |
| 0.01 (1%) | 0.01 (1%) | ~4600 | Very high accuracy, high confidence |
| 0.01 (1%) | 0.001 (0.1%) | ~6900 | Very high accuracy, very high confidence |
Practical Interpretation:
When you train a machine learning model, you're implicitly choosing ε and δ:
ε: 'How accurate do I need my model to be?' For medical diagnosis, you might need ε = 0.01. For content recommendation, ε = 0.10 might be acceptable.
δ: 'How confident do I need to be in this guarantee?' If model retraining is cheap, high δ might be acceptable. If deployment is expensive, you want very low δ.
The key insight is that both parameters affect sample complexity, but in different ways:
This asymmetry suggests: once you have enough samples for your desired accuracy, getting very high confidence 'comes for free' (relatively speaking).
In practice, the dominant factor is usually 1/ε. The logarithmic dependence on δ means you can typically set δ very small (like 0.001) without dramatically increasing sample requirements. The main sample size driver is how accurate you need to be.
A remarkable feature of PAC learning is that it is distribution-free. The guarantees hold for ANY distribution D over the instance space—we don't need to know or assume anything about D.
Why is this important?
In real-world machine learning, we rarely know the true distribution of data:
Distribution-free guarantees mean our bounds hold regardless of these unknowns. We don't need to guess or model the distribution—the theory works for whatever the true distribution happens to be.
Distribution-free bounds are worst-case bounds. They hold for the adversarially chosen distribution that makes learning hardest. If you know something about your distribution (e.g., it's smooth, or has certain symmetries), you can often do better. But when you don't know, distribution-free bounds give you a safe lower bar.
What 'Distribution-Free' Really Means:
The PAC definition says 'for every distribution D.' This means:
Example: How Different D's Affect Learning
Consider learning a linear separator in 2D:
A distribution-free bound must handle all these cases, so it's calibrated for the adversarial case. This makes bounds sometimes loose for 'nice' distributions, but reliably safe.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npimport matplotlib.pyplot as plt def visualize_distribution_effects(): """ Illustrate how different distributions affect learning difficulty for the same concept (a linear separator). """ np.random.seed(42) # True separator: x + y = 1 (positive when x + y >= 1) def true_concept(x, y): return int(x + y >= 1) # Distribution 1: Uniform over [0, 1]^2 def uniform_distribution(n): points = np.random.uniform(0, 1, (n, 2)) labels = np.array([true_concept(x, y) for x, y in points]) return points, labels # Distribution 2: Concentrated near the boundary (hard) def boundary_distribution(n): # Points clustered near the line x + y = 1 t = np.random.uniform(-0.2, 0.2, n) u = np.random.uniform(0, 1, n) x = u + t/2 y = 1 - u + t/2 x = np.clip(x, 0, 1) y = np.clip(y, 0, 1) points = np.column_stack([x, y]) labels = np.array([true_concept(x, y) for x, y in points]) return points, labels # Distribution 3: Concentrated far from boundary (easy) def easy_distribution(n): # Half points in top-left, half in bottom-right points = np.zeros((n, 2)) labels = np.zeros(n, dtype=int) for i in range(n): if np.random.random() < 0.5: # Negative region (far) points[i] = np.random.uniform(0, 0.3, 2) labels[i] = 0 else: # Positive region (far) points[i] = np.random.uniform(0.7, 1, 2) labels[i] = 1 return points, labels # The PAC guarantee holds for ALL three distributions! # But actual learning difficulty varies significantly. # With m=20 samples from each distribution: # - Uniform: Good chance of finding approximate separator # - Boundary: Hard to distinguish, need more samples # - Easy: Very easy, few samples needed # Key insight: distribution-free theory bounds the WORST case # (which is like the boundary distribution) return { 'uniform': uniform_distribution(100), 'boundary': boundary_distribution(100), 'easy': easy_distribution(100) }We've now established the foundational concepts of PAC learning. Let's consolidate the key ideas before moving forward:
| Component | Symbol | Role | Key Constraint |
|---|---|---|---|
| Instance Space | X | Set of all possible examples | Fixed in advance |
| Concept Class | C | Hypotheses we're learning over | Determines complexity |
| Target Concept | c* ∈ C | True labeling function | Unknown, but in C |
| Distribution | D | Probability over instances | Unknown, but fixed |
| Training Set | S | m i.i.d. samples from D | Our only information |
| Hypothesis | h | Learner's output | Goal: err_D(h) ≤ ε |
| Accuracy | ε | Maximum tolerable error | Affects m linearly |
| Confidence | 1-δ | Probability of success | Affects m logarithmically |
What's Next:
Now that we understand the PAC framework, we'll explore sample complexity—exactly how many samples are needed to learn a given concept class. We'll see that sample complexity depends on the 'size' of the concept class and derive concrete bounds for important cases.
The fundamental question becomes: For a concept class C, what is m(ε, δ)? The answer reveals deep connections between learning, combinatorics, and computational complexity.
You now understand the PAC learning framework—the gold standard for theoretical analysis of machine learning. The key insight is that learning is about generalization, and PAC provides rigorous, distribution-free guarantees about when generalization is possible. Next, we'll quantify exactly how many samples are needed through the lens of sample complexity.