Pac Learning - Learning Module

Loading content...

0/245

Probably Approximately Correct Framework

The Fundamental Question of Machine Learning

When we train a machine learning model on a dataset, we're implicitly making a profound leap of faith. We observe the model's performance on training data—a finite sample from some unknown distribution—and then trust that this performance will generalize to unseen examples drawn from the same distribution.

But how can we justify this leap? Why should low training error imply low error on new data? Is there a principled way to quantify our confidence in a learned model's predictions?

These are not merely philosophical questions. They strike at the heart of what makes machine learning possible—or impossible—in any given setting. The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provides the first rigorous mathematical answer to these fundamental questions.

What You Will Learn

By the end of this page, you will understand the PAC learning framework's core components: the formal learning model, the meanings of 'probably' and 'approximately,' the role of the unknown target distribution, and how these elements combine to provide the first theoretical guarantees for machine learning. This framework will become your lens for understanding all subsequent generalization theory.

Historical Context: Why Theory Was Needed

Before PAC learning, machine learning existed primarily as an empirical discipline. Researchers developed algorithms—perceptrons, decision trees, early neural networks—and demonstrated their effectiveness through experiments. But there was no rigorous framework to answer fundamental questions:

When is learning possible? What properties must a problem have for any algorithm to learn it?
How many examples are needed? Given a desired accuracy, how large must our training set be?
What computational resources are required? Can we learn efficiently, or is learning inherently intractable?

These questions matter immensely in practice. Without theoretical grounding, we have no way to distinguish between problems that are hard due to insufficient data, hard due to computational limitations, or fundamentally impossible to learn.

Valiant's Crucial Insight

Leslie Valiant, working at Harvard, recognized that learning could be formalized as a computational problem. His 1984 paper 'A Theory of the Learnable' introduced PAC learning and earned him the Turing Award in 2010. Valiant's key insight was that learning should be judged not just by whether it succeeds, but by whether it succeeds efficiently—both in terms of sample size and computation time.

The pre-PAC landscape:

Perceptron Convergence Theorem (1962): Rosenblatt proved that perceptrons converge when data is linearly separable—but said nothing about generalization to unseen data.
Statistical Decision Theory: Provided tools for analyzing estimation problems, but typically assumed the model class was known to contain the true distribution—too strong an assumption for learning.
Computational Learning (ad hoc): Various algorithms were analyzed individually, but there was no unifying framework to compare them or understand their limitations.

PAC learning unified these threads, providing a framework where:

Learning is explicitly about generalization, not just fitting training data
Success is measured probabilistically, acknowledging that finite samples can mislead
Computational efficiency is a first-class concern, not an afterthought

The Formal Learning Model

To make precise statements about learning, we must define our terms rigorously. The PAC framework introduces a formal model with several key components:

The Instance Space X:

We denote by X the set of all possible instances (examples, inputs). In a spam classification problem, X might be the set of all possible emails. In image recognition, X could be the set of all possible images of a fixed resolution. Formally, X is often taken to be {0, 1}ⁿ (binary vectors of length n) or ℝⁿ (real-valued vectors), but the theory extends to arbitrary instance spaces.

Why Binary Vectors?

The original PAC framework focused on X = {0, 1}ⁿ because it allows clean combinatorial analysis. Each instance is a bit vector of length n, and concepts are subsets of {0, 1}ⁿ. This makes bounds on hypothesis class size and sample complexity concrete and computable. The framework extends naturally to continuous domains.

The Concept Class C:

A concept is a function c: X → {0, 1} that assigns a binary label to every instance. We can equivalently think of a concept as a subset of X—the set of instances labeled 1 (positive examples).

A concept class C is a collection of concepts we're trying to learn. For example:

Conjunctions over n Boolean variables: Concepts like "x₁ ∧ x₃ ∧ ¬x₅" that are AND of literals
Axis-aligned rectangles in ℝ²: Concepts defined by rectangles with sides parallel to axes
Linear separators in ℝⁿ: Concepts defined by a hyperplane

The choice of concept class C is crucial—it determines what can be learned and with what sample complexity.

concept_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Example concept classes and instances
 
import numpy as np
 
# Instance space: Binary vectors of length n=4
# Each instance is a 4-bit vector
n = 4
instance_space_size = 2**n  # 16 possible instances
 
# Example concept: Conjunction (x1 AND NOT x2)
# Returns 1 iff x[0]=1 and x[1]=0
def conjunction_concept(x):
    """A concept from the class of conjunctions."""
    return int(x[0] == 1 and x[1] == 0)
 
# Example concept: Axis-aligned rectangle in R^2
# Returns 1 iff point (x,y) is inside the rectangle [a,b] x [c,d]
class AxisAlignedRectangle:
    def __init__(self, a, b, c, d):
        """Rectangle with x in [a,b] and y in [c,d]."""
        self.a, self.b, self.c, self.d = a, b, c, d
    
    def __call__(self, point):
        x, y = point
        return int(self.a <= x <= self.b and self.c <= y <= self.d)
 
# Example concept: Linear separator in R^n
# Returns 1 iff w·x + b >= 0
class LinearSeparator:
    def __init__(self, weights, bias):
        """Hyperplane defined by w·x + b >= 0."""
        self.w = np.array(weights)
        self.b = bias
    
    def __call__(self, x):
        return int(np.dot(self.w, x) + self.b >= 0)
 
# The concept class C is the SET of all such concepts
# |C_conjunctions| = 3^n (each variable: positive, negative, or absent)
# |C_rectangles| = infinite (but VC dimension = 4)
# |C_linear_separators| = infinite (VC dimension = n+1)

The Target Distribution D:

In PAC learning, we assume there exists an unknown probability distribution D over the instance space X. This distribution governs:

Which examples the learner receives during training
Which examples the learner will be tested on after learning

Crucially, the same distribution D is used for both training and testing. This is a fundamental assumption—if the test distribution differs from training, all bets are off. This assumption is sometimes called the i.i.d. (independent and identically distributed) assumption.

The distribution D is unknown to the learner. We cannot assume D is uniform, or normal, or any particular form. The theory must work for any distribution D—this is what makes PAC learning 'distribution-free.'

The Distribution-Free Assumption

Distribution-free learning is both a strength and a limitation. It means PAC bounds hold for ANY distribution—no assumptions needed. But it also means bounds can be loose for 'nice' distributions. This is why PAC bounds are often worst-case bounds. In practice, when we know something about the distribution, tighter analysis may be possible.

The Target Concept c ∈ C:*

We assume there exists a target concept c* that belongs to the concept class C. This concept perfectly labels all instances—it represents the 'ground truth' we're trying to learn.

The assumption that c* ∈ C is called the realizability assumption or consistent hypothesis assumption. It means the target concept is expressible within our chosen concept class.

Note: This is a strong assumption. In reality, our concept class C might not contain the true labeling function. Later, we'll discuss agnostic PAC learning, which relaxes this assumption.

The Learning Protocol

With our components defined, we can now specify exactly how learning proceeds:

Step 1: Sampling

The learner receives a training set S = {(x₁, c(x₁)), (x₂, c(x₂)), ..., (xₘ, c*(xₘ))}**

where each xᵢ is drawn independently from the distribution D, and labels come from the target concept c*. The number of samples m is a key parameter—we want to understand how m must grow with our requirements.

Step 2: Hypothesis Selection

The learner outputs a hypothesis h: X → {0, 1} from some hypothesis class H. In the basic PAC model, H = C (the learner chooses from the same class it's learning). In general, H can differ from C.

Step 3: Evaluation

The hypothesis h is evaluated by its generalization error (also called true error or risk):

err_D(h) = Pr_{x~D}[h(x) ≠ c(x)]*

This is the probability that h misclassifies a random example drawn from D. Note: we measure error with respect to D, the same distribution training examples came from.

Converting Mermaid diagram...

The Gap Between Training and Test Error:

The learner can observe the training error (empirical error):

err_S(h) = (1/m) |{i : h(xᵢ) ≠ c(xᵢ)}|*

But the learner wants to minimize the generalization error err_D(h), which cannot be directly computed (since D is unknown).

The central question of learning theory: Under what conditions does low training error imply low generalization error?

PAC learning provides precise answers to this question.

Empirical vs. True Error

err_S(h) is observable; err_D(h) is what we care about. The gap between them is the 'generalization gap.' PAC learning bounds this gap, telling us when we can trust training performance. A learner that memorizes training data may have err_S(h) = 0 but terrible err_D(h)—this is overfitting.

The PAC Definition: Probably Approximately Correct

We now come to the central definition. A learning algorithm is PAC (Probably Approximately Correct) if it can output a hypothesis that is:

Approximately correct: Has generalization error at most ε (epsilon)
Probably: With probability at least 1 - δ (delta)

Formally:

Definition: PAC Learning

A concept class C is PAC-learnable by a learning algorithm A if there exists a function m₀: (0,1) × (0,1) → ℕ such that:

For every ε > 0 (accuracy parameter) and δ > 0 (confidence parameter), For every distribution D over X, For every target concept c ∈ C*,

When algorithm A is given a sample S of m ≥ m₀(ε, δ) examples drawn i.i.d. from D and labeled by c*:

Pr_S[err_D(h) ≤ ε] ≥ 1 - δ

where h is the hypothesis output by A(S).

Unpacking the Definition:

The ε Parameter (Accuracy): ε is the maximum generalization error we're willing to tolerate. Setting ε = 0.05 means we want a hypothesis that misclassifies at most 5% of random examples. Smaller ε requires more samples.

The δ Parameter (Confidence): δ is the probability of failure. Setting δ = 0.01 means we want our guarantee to hold at least 99% of the time. The randomness is over the choice of training set S—some 'unlucky' training sets might lead to bad hypotheses. Smaller δ requires more samples.

Why 'Probably Approximately'?

We cannot guarantee exact correctness (err_D(h) = 0) for two reasons:

With finite samples, a 'bad' training set might hide the target concept's key features
Even with infinite samples, computational limitations might prevent finding the optimal hypothesis

Thus, we settle for approximate correctness (err_D(h) ≤ ε) that holds probably (with probability ≥ 1-δ).

pac_learning_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from typing import Callable, List, Tuple
 
# Simulating PAC learning for axis-aligned rectangles
 
class PACRectangleLearner:
    """
    PAC learner for axis-aligned rectangles in R^2.
    
    Target concept: Rectangle defined by a <= x <= b and c <= y <= d
    Hypothesis class: Same as concept class (rectangles)
    
    This is a canonical example because:
    - It's simple enough to analyze completely
    - It demonstrates key PAC concepts
    - The sample complexity is O(1/ε * log(1/δ))
    """
    
    def __init__(self):
        self.hypothesis = None
    
    def learn(self, samples: List[Tuple[float, float, int]]):
        """
        Learn the tightest rectangle containing all positive examples.
        
        This is the 'consistent learner' strategy: find the smallest
        hypothesis consistent with the training data.
        
        samples: List of (x, y, label) where label ∈ {0, 1}
        """
        positive_samples = [(x, y) for x, y, label in samples if label == 1]
        
        if not positive_samples:
            # No positive examples: output empty rectangle (always predict 0)
            self.hypothesis = (float('inf'), float('-inf'), 
                              float('inf'), float('-inf'))
            return
        
        # Tightest axis-aligned rectangle containing positive examples
        xs = [p[0] for p in positive_samples]
        ys = [p[1] for p in positive_samples]
        
        a = min(xs)  # left boundary
        b = max(xs)  # right boundary  
        c = min(ys)  # bottom boundary
        d = max(ys)  # top boundary
        
        self.hypothesis = (a, b, c, d)
    
    def predict(self, x: float, y: float) -> int:
        """Predict label for point (x, y)."""
        a, b, c, d = self.hypothesis
        return int(a <= x <= b and c <= y <= d)
    
    def generalization_error(self, 
                            true_rect: Tuple[float, float, float, float],
                            distribution: Callable):
        """
        Estimate generalization error by Monte Carlo sampling.
        
        true_rect: (a*, b*, c*, d*) defining the target concept
        distribution: Function that samples from D
        """
        n_samples = 100000
        errors = 0
        
        a_t, b_t, c_t, d_t = true_rect
        
        for _ in range(n_samples):
            x, y = distribution()
            true_label = int(a_t <= x <= b_t and c_t <= y <= d_t)
            pred_label = self.predict(x, y)
            errors += (true_label != pred_label)
        
        return errors / n_samples
 
 
# Demonstration of PAC learning
def demonstrate_pac_learning(m: int, epsilon: float, delta: float):
    """
    Demonstrate PAC learning with varying sample sizes.
    
    m: number of training samples
    epsilon: desired accuracy (max gen. error)
    delta: confidence (probability of failure)
    """
    # True rectangle (target concept)
    true_rect = (0.2, 0.8, 0.3, 0.7)  # [0.2, 0.8] x [0.3, 0.7]
    
    # Distribution D: uniform over [0, 1]^2
    def sample_distribution():
        return np.random.uniform(0, 1), np.random.uniform(0, 1)
    
    # Generate training samples
    samples = []
    a_t, b_t, c_t, d_t = true_rect
    for _ in range(m):
        x, y = sample_distribution()
        label = int(a_t <= x <= b_t and c_t <= y <= d_t)
        samples.append((x, y, label))
    
    # Learn hypothesis
    learner = PACRectangleLearner()
    learner.learn(samples)
    
    # Compute generalization error
    gen_error = learner.generalization_error(true_rect, sample_distribution)
    
    return gen_error
 
# Run multiple trials to estimate probability of success
def pac_experiment(m: int, epsilon: float, n_trials: int = 100):
    """Run n_trials and count how often gen_error <= epsilon."""
    successes = sum(
        1 for _ in range(n_trials) 
        if demonstrate_pac_learning(m, epsilon, 0.05) <= epsilon
    )
    return successes / n_trials
 
# Example output (conceptual):
# m=50,  ε=0.10: Success rate ≈ 0.85
# m=100, ε=0.10: Success rate ≈ 0.95
# m=200, ε=0.10: Success rate ≈ 0.99
# m=500, ε=0.05: Success rate ≈ 0.99

Understanding Through Examples

Abstract definitions become clearer with concrete examples. Let's examine PAC learning for one of the simplest concept classes: conjunctions over Boolean variables.

Concept Class: Monotone Conjunctions

Consider the instance space X = {0,1}ⁿ (Boolean vectors of length n). A monotone conjunction is an AND of some subset of variables:

c₁(x) = x₁ ∧ x₃ (returns 1 iff x₁=1 AND x₃=1)
c₂(x) = x₂ ∧ x₄ ∧ x₅
c₃(x) = x₁ (just one variable)
c₄(x) = 1 (empty conjunction, always true)

There are 2ⁿ monotone conjunctions (each variable is either included or not).

Algorithm for Learning Conjunctions

•Initialize: h = x₁ ∧ x₂ ∧ ... ∧ xₙ (all variables)
•For each positive example (x, 1):
•
- Remove from h any variable xᵢ where xᵢ = 0
•Output: The remaining conjunction h

Why This Works

A positive example (labeled 1) proves that any variable with value 0 in that example cannot be in the target conjunction—otherwise the example would be labeled 0. We eliminate such variables. The remaining variables must all be in the target conjunction (or we'd have removed them).

Analyzing the Algorithm:

Let the target concept be c* = x_{i₁} ∧ x_{i₂} ∧ ... ∧ x_{iₖ} (a conjunction of k variables).

After seeing all positive examples, our hypothesis h will:

Always include all variables from c* (since they're 1 in all positive examples)
Possibly include extra variables (those that happen to be 1 in all seen positive examples)

The hypothesis h is a superset of c*. It makes errors only on instances where:

c*(x) = 1 (should be positive)
h(x) = 0 (we predict negative)

This happens when x has a 0 for some variable in h that's not in c*—a 'ghost' variable we didn't eliminate.

conjunction_learner.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from typing import Set, Tuple, List
 
class ConjunctionLearner:
    """
    PAC learner for monotone conjunctions over {0,1}^n.
    
    This is the classic 'consistent hypothesis' algorithm:
    - Start with all variables in the conjunction
    - Remove any variable that appears as 0 in a positive example
    - The remaining variables form the hypothesis
    """
    
    def __init__(self, n: int):
        """Initialize learner for n Boolean variables."""
        self.n = n
        self.hypothesis: Set[int] = set()
    
    def learn(self, samples: List[Tuple[np.ndarray, int]]):
        """
        Learn from labeled samples.
        
        samples: List of (x, label) where x ∈ {0,1}^n, label ∈ {0,1}
        """
        # Start with all variables in hypothesis
        self.hypothesis = set(range(self.n))
        
        # Process positive examples only
        for x, label in samples:
            if label == 1:
                # Remove variables that are 0 in this positive example
                for i in range(self.n):
                    if x[i] == 0 and i in self.hypothesis:
                        self.hypothesis.remove(i)
    
    def predict(self, x: np.ndarray) -> int:
        """
        Predict label for instance x.
        
        Returns 1 iff all variables in hypothesis are 1.
        """
        for i in self.hypothesis:
            if x[i] == 0:
                return 0
        return 1
    
    def error_region_analysis(self, 
                              target: Set[int], 
                              distribution) -> float:
        """
        Analyze where hypothesis h differs from target c*.
        
        Error occurs when:
        - Target c*(x) = 1 (all target vars are 1)
        - Hypothesis h(x) = 0 (some hypothesis var is 0)
        
        This happens for 'ghost variables': vars in h but not in c*.
        For each ghost variable i, error region includes points where:
        - All target variables = 1
        - Variable i = 0
        """
        ghost_vars = self.hypothesis - target
        
        # Ghost variables cause false negatives
        # Each ghost var i: error when x_i=0 AND all target vars=1
        
        # Monte Carlo estimation
        n_samples = 50000
        errors = 0
        
        for _ in range(n_samples):
            x = distribution()
            true_label = all(x[i] == 1 for i in target) if target else 1
            pred_label = self.predict(x)
            errors += (true_label != pred_label)
        
        return errors / n_samples
 
 
def sample_complexity_for_conjunctions(n: int, epsilon: float, delta: float):
    """
    Calculate sample complexity for PAC learning conjunctions.
    
    Theorem: m = O((n/ε) * log(n/δ)) samples suffice.
    
    Proof sketch:
    - There are n 'ghost variable' risks
    - Each ghost variable has probability at most ε/n of not being eliminated
    - By union bound, probability of any ghost surviving is at most n * (ε/n) = ε
    - To ensure each ghost has prob < ε/n of surviving after m samples,
      we need m ≥ (1/p) * ln(n/δ) where p is probability of a positive 
      example eliminating that ghost
    
    For typical distributions: m = O(n/ε * log(n/δ))
    """
    # Rough bound: m = (n/ε) * ln(n/δ)
    m = int(np.ceil((n / epsilon) * np.log(n / delta)))
    return m
 
# Example calculations
n = 10  # 10 Boolean variables
epsilon = 0.05  # 5% error tolerance
delta = 0.01  # 99% confidence
 
required_samples = sample_complexity_for_conjunctions(n, epsilon, delta)
print(f"For n={n}, ε={epsilon}, δ={delta}:")
print(f"Required samples: m ≥ {required_samples}")
# Typically around 200-400 samples for these parameters

The Role of ε and δ: Quantifying 'Probably' and 'Approximately'

The parameters ε and δ are not just technical details—they are the knobs that control the tradeoff between sample size and learning guarantees.

The Accuracy Parameter ε:

ε bounds the generalization error we tolerate. The relationship between m (sample size) and ε is typically:

m = O(1/ε)

To halve ε (double accuracy), we roughly need to double m. This makes intuitive sense: to distinguish between hypotheses that differ by only a small margin, we need more evidence.

The Confidence Parameter δ:

δ bounds the probability of failure. The relationship is typically:

m = O(log(1/δ))

To reduce δ by a factor of 10 (e.g., from 1% to 0.1% failure rate), we only need to increase m by a constant factor. This logarithmic dependence is good news—high confidence is 'cheap.'

Sample Complexity vs. Parameters (for a typical concept class)
ε (accuracy)	δ (failure prob)	Approximate m needed	Interpretation
0.10 (10%)	0.10 (10%)	~230	Rough estimate, moderate confidence
0.10 (10%)	0.01 (1%)	~460	Same accuracy, higher confidence
0.05 (5%)	0.01 (1%)	~920	Higher accuracy, high confidence
0.01 (1%)	0.01 (1%)	~4600	Very high accuracy, high confidence
0.01 (1%)	0.001 (0.1%)	~6900	Very high accuracy, very high confidence

Practical Interpretation:

When you train a machine learning model, you're implicitly choosing ε and δ:

ε: 'How accurate do I need my model to be?' For medical diagnosis, you might need ε = 0.01. For content recommendation, ε = 0.10 might be acceptable.
δ: 'How confident do I need to be in this guarantee?' If model retraining is cheap, high δ might be acceptable. If deployment is expensive, you want very low δ.

The key insight is that both parameters affect sample complexity, but in different ways:

Improving accuracy (smaller ε) has a linear cost in samples
Improving confidence (smaller δ) has only a logarithmic cost

This asymmetry suggests: once you have enough samples for your desired accuracy, getting very high confidence 'comes for free' (relatively speaking).

The Practical Bottom Line

In practice, the dominant factor is usually 1/ε. The logarithmic dependence on δ means you can typically set δ very small (like 0.001) without dramatically increasing sample requirements. The main sample size driver is how accurate you need to be.

Distribution-Free Learning: A Key Strength

A remarkable feature of PAC learning is that it is distribution-free. The guarantees hold for ANY distribution D over the instance space—we don't need to know or assume anything about D.

Why is this important?

In real-world machine learning, we rarely know the true distribution of data:

What's the distribution of emails for spam filtering? Unknown.
What's the distribution of user queries for search ranking? Unknown.
What's the distribution of sensor readings for anomaly detection? Unknown.

Distribution-free guarantees mean our bounds hold regardless of these unknowns. We don't need to guess or model the distribution—the theory works for whatever the true distribution happens to be.

The Price of Distribution-Free Bounds

Distribution-free bounds are worst-case bounds. They hold for the adversarially chosen distribution that makes learning hardest. If you know something about your distribution (e.g., it's smooth, or has certain symmetries), you can often do better. But when you don't know, distribution-free bounds give you a safe lower bar.

What 'Distribution-Free' Really Means:

The PAC definition says 'for every distribution D.' This means:

The learner doesn't need to know D: The algorithm is the same regardless of D
The bounds hold for adversarial D: Even if someone picked D to make learning as hard as possible, the bounds still apply
Training and test distribution must match: D governs both. If test distribution differs, the guarantee is void

Example: How Different D's Affect Learning

Consider learning a linear separator in 2D:

Uniform D: Data spread evenly; learning is relatively easy
Clustered D: Data concentrated near the decision boundary; learning is harder (each sample is highly informative)
Adversarial D: Data placed to maximally confuse any learner; learning is hardest

A distribution-free bound must handle all these cases, so it's calibrated for the adversarial case. This makes bounds sometimes loose for 'nice' distributions, but reliably safe.

distribution_effects.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_distribution_effects():
    """
    Illustrate how different distributions affect learning difficulty
    for the same concept (a linear separator).
    """
    np.random.seed(42)
    
    # True separator: x + y = 1 (positive when x + y >= 1)
    def true_concept(x, y):
        return int(x + y >= 1)
    
    # Distribution 1: Uniform over [0, 1]^2
    def uniform_distribution(n):
        points = np.random.uniform(0, 1, (n, 2))
        labels = np.array([true_concept(x, y) for x, y in points])
        return points, labels
    
    # Distribution 2: Concentrated near the boundary (hard)
    def boundary_distribution(n):
        # Points clustered near the line x + y = 1
        t = np.random.uniform(-0.2, 0.2, n)
        u = np.random.uniform(0, 1, n)
        x = u + t/2
        y = 1 - u + t/2
        x = np.clip(x, 0, 1)
        y = np.clip(y, 0, 1)
        points = np.column_stack([x, y])
        labels = np.array([true_concept(x, y) for x, y in points])
        return points, labels
    
    # Distribution 3: Concentrated far from boundary (easy)
    def easy_distribution(n):
        # Half points in top-left, half in bottom-right
        points = np.zeros((n, 2))
        labels = np.zeros(n, dtype=int)
        for i in range(n):
            if np.random.random() < 0.5:
                # Negative region (far)
                points[i] = np.random.uniform(0, 0.3, 2)
                labels[i] = 0
            else:
                # Positive region (far)
                points[i] = np.random.uniform(0.7, 1, 2)
                labels[i] = 1
        return points, labels
    
    # The PAC guarantee holds for ALL three distributions!
    # But actual learning difficulty varies significantly.
    
    # With m=20 samples from each distribution:
    # - Uniform: Good chance of finding approximate separator
    # - Boundary: Hard to distinguish, need more samples
    # - Easy: Very easy, few samples needed
    
    # Key insight: distribution-free theory bounds the WORST case
    # (which is like the boundary distribution)
    
    return {
        'uniform': uniform_distribution(100),
        'boundary': boundary_distribution(100),
        'easy': easy_distribution(100)
    }

Key Takeaways: The PAC Framework

We've now established the foundational concepts of PAC learning. Let's consolidate the key ideas before moving forward:

Core Concepts of PAC Learning

•PAC learning formalizes generalization — It asks: when can we trust that low training error implies low test error?
•'Probably' means with high probability (≥ 1-δ) — We allow a small probability of failure due to 'unlucky' training sets
•'Approximately' means with bounded error (≤ ε) — We don't require perfect accuracy, just accuracy within a tolerance
•The framework is distribution-free — Guarantees hold for any distribution D, making them robust but potentially loose
•Sample complexity quantifies learning cost — m(ε, δ) tells us how many samples are sufficient for given accuracy and confidence
•The target concept is assumed to be in C — This 'realizability' assumption will be relaxed in agnostic PAC learning

PAC Learning Component Summary
Component	Symbol	Role	Key Constraint
Instance Space	X	Set of all possible examples	Fixed in advance
Concept Class	C	Hypotheses we're learning over	Determines complexity
Target Concept	c* ∈ C	True labeling function	Unknown, but in C
Distribution	D	Probability over instances	Unknown, but fixed
Training Set	S	m i.i.d. samples from D	Our only information
Hypothesis	h	Learner's output	Goal: err_D(h) ≤ ε
Accuracy	ε	Maximum tolerable error	Affects m linearly
Confidence	1-δ	Probability of success	Affects m logarithmically

What's Next:

Now that we understand the PAC framework, we'll explore sample complexity—exactly how many samples are needed to learn a given concept class. We'll see that sample complexity depends on the 'size' of the concept class and derive concrete bounds for important cases.

The fundamental question becomes: For a concept class C, what is m(ε, δ)? The answer reveals deep connections between learning, combinatorics, and computational complexity.

Page Complete

You now understand the PAC learning framework—the gold standard for theoretical analysis of machine learning. The key insight is that learning is about generalization, and PAC provides rigorous, distribution-free guarantees about when generalization is possible. Next, we'll quantify exactly how many samples are needed through the lens of sample complexity.