Machine LearningPAC Learning

PAC Learning: Probably Approximately Correct Framework

LevelAdvanced

Duration75 mins

TopicPAC Learning

5 / 5

Computational Complexity

The Computational Barrier

So far, we've established when PAC learning is possible—when sample complexity is polynomial. But possibility doesn't mean practicality. A learning problem might require only O(n/ε) samples yet demand exponential computation time.

This gap between statistical and computational efficiency is a central theme in modern learning theory:

Some problems are statistically easy but computationally hard
NP-hardness results show fundamental computational barriers
Cryptographic assumptions imply learning hardness

Understanding computational complexity in learning is essential for choosing tractable problem formulations and recognizing when approximations or additional assumptions are necessary.

What You Will Learn

By the end of this page, you will understand: (1) the definition of efficient PAC learning, (2) computational hardness results for learning (NP-hardness, cryptographic hardness), (3) the relationship between proper and improper learning complexity, (4) concrete examples of computationally hard learning problems, and (5) how assumptions enable tractability.

Efficient PAC Learning: Formal Definition

Let's formalize what it means for learning to be computationally efficient.

Definition: Efficiently PAC-Learnable

A concept class C is efficiently PAC-learnable by hypothesis class H if there exists a learning algorithm A such that:

Sample efficiency: The sample complexity m(n, ε, δ) is polynomial in n, 1/ε, and 1/δ
Time efficiency: The running time of A is polynomial in the total input size:
- Running time ≤ poly(n, m, 1/ε, 1/δ)
- Since m = poly(n, 1/ε, 1/δ), this is poly(n, 1/ε, 1/δ)
Hypothesis efficiency: Evaluating h(x) for the output hypothesis takes polynomial time in n

Note: All three conditions must hold. An algorithm that takes exponential time isn't useful even if it needs few samples.

Why Hypothesis Efficiency?

Condition 3 is sometimes overlooked. If the learned hypothesis h is so complex that evaluating h(x) takes exponential time, we can't use it for prediction! For example, memorizing all training examples as a lookup table is 'efficient' learning but results in slow prediction.

The Computational Hierarchy:

Efficiently PAC-learnable (polynomial time, polynomial samples)
- Examples: Conjunctions, decision stumps, linear separators (realizable)
- Gold standard for practical machine learning
PAC-learnable but not efficiently (polynomial samples, exponential time)
- Examples: 3-term DNF (proper learning), noisy parities
- Statistically tractable but computationally intractable
Not PAC-learnable (infinite VC dimension)
- Examples: All Boolean functions, unbounded decision trees
- No algorithm works regardless of resources

The Central Complexity Question:

Given a concept class C and hypothesis class H:

Is ERM (Empirical Risk Minimization) computationally tractable?
ERM finds arg min_{h ∈ H} err_S(h)—but this search might be NP-hard!

efficiency_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
from typing import List, Tuple, Set
import numpy as np
from math import comb
 
def analyze_erm_complexity():
    """
    Analyze the computational complexity of ERM for different classes.
    
    ERM = Find h ∈ H that minimizes training error
    
    This is equivalent to finding a hypothesis consistent with 
    training data (in the realizable case).
    """
    
    print("ERM Computational Complexity Analysis")
    print("=" * 60)
    
    # Case 1: Conjunctions
    print("
1. Conjunctions (n Boolean variables)")
    print("-" * 40)
    print("   Hypothesis space: 3^n (each var: +, -, absent)")
    print("   ERM algorithm: Start with all literals, eliminate inconsistent")
    print("   Time complexity: O(m * n)")
    print("   ✅ POLYNOMIAL - Check each literal against positive examples")
    
    # Case 2: Linear separators (realizable)
    print("
2. Linear Separators - Realizable (n dimensions)")
    print("-" * 40)
    print("   Hypothesis space: Infinite (all hyperplanes)")
    print("   ERM algorithm: Linear Programming feasibility")
    print("   Time complexity: O(n^3 + m*n) for interior point methods")
    print("   ✅ POLYNOMIAL - LP is polynomial-time solvable")
    
    # Case 3: Linear separators (agnostic)
    print("
3. Linear Separators - Agnostic")
    print("-" * 40)
    print("   Problem: Minimize misclassification error")
    print("   This is: min over w of |{i : sign(w·xi) ≠ yi}|")
    print("   Time complexity: NP-hard in general!")
    print("   ❌ NP-HARD - Even for 2D, minimizing 0-1 loss is hard")
    
    # Case 4: 3-term DNF (proper learning)
    print("
4. 3-term DNF (proper learning)")
    print("-" * 40)
    print("   Hypothesis space: O((2n)^(3k)) for k literals per term")
    print("   Problem: Find a 3-term DNF consistent with data")
    print("   Time complexity: NP-hard")
    print("   ❌ NP-HARD - Reduces from graph 3-coloring")
    
    # Case 5: 3-term DNF (improper with 3-CNF)
    print("
5. 3-term DNF → 3-CNF (improper learning)")
    print("-" * 40)
    print("   Every 3-DNF is equivalent to some 3-CNF")
    print("   3-CNF is efficiently learnable (Horn clauses)")
    print("   Time complexity: poly(n, m)")
    print("   ✅ POLYNOMIAL - Improper learning bypasses hardness!")
    
    # Case 6: Intersection of halfspaces
    print("
6. Intersection of k halfspaces (k ≥ 2)")
    print("-" * 40)
    print("   Problem: Find k halfspaces whose intersection separates data")
    print("   For k=2: Can be solved approximately")
    print("   For k=Ω(n): NP-hard")
    print("   ⚠️ DEPENDS ON k - Hardness grows with k")
 
analyze_erm_complexity()

NP-Hardness of Learning Problems

Many natural learning problems are NP-hard, meaning (under P ≠ NP) no polynomial-time algorithm exists. Understanding these hardness results helps us recognize intractable problems.

What Does NP-Hardness Mean for Learning?

A learning problem is NP-hard if there's no polynomial-time algorithm that:

Given a training set S
Finds a hypothesis h from the specified class
Such that h is consistent with S (or minimizes error on S)

Key Result: Consistency Checking is Often NP-Hard

For many concept classes, even deciding whether a consistent hypothesis exists is NP-hard!

The Hardness Surprise

NP-hardness of learning is often surprising. A concept class might have polynomial VC dimension (few samples needed) yet ERM is NP-hard. Statistical tractability does NOT imply computational tractability!

Classic Hardness Results:

1. Learning 3-term DNF is NP-hard (proper learning)

Reduction from: Graph 3-Coloring

Construct a sample set from a graph G
G is 3-colorable ⟺ a consistent 3-term DNF exists
Graph 3-coloring is NP-complete → 3-term DNF learning is NP-hard

2. Learning Intersections of 2 Halfspaces is NP-hard

Even finding whether a consistent intersection exists
Reduction from 3-SAT
Contrast: Single halfspace is polynomial (LP)

3. Minimum Disagreement with a Halfspace is NP-hard

Given data with errors, find the best halfspace
minimizing 0-1 loss is NP-hard
This is why we use surrogate losses (hinge, logistic)

hardness_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import List, Tuple
 
def three_term_dnf_hardness():
    """
    Illustrate the NP-hardness of learning 3-term DNF.
    
    The reduction from Graph 3-Coloring:
    
    Input: Graph G = (V, E) with n vertices
    Question: Can vertices be colored with 3 colors such that
              no edge connects same-colored vertices?
    
    Reduction to 3-term DNF learning:
    - Create n Boolean variables (one per vertex)
    - Create positive examples for valid vertex colorings
    - Create negative examples from edges
    - A consistent 3-term DNF exists ⟺ G is 3-colorable
    
    Key insight: Each term corresponds to one color class
    """
    
    print("NP-Hardness: 3-term DNF (Proper Learning)")
    print("=" * 60)
    
    print("""
Reduction from Graph 3-Coloring:
 
Given: Graph G = (V, E)
Goal: 3-color vertices so no edge is monochromatic
 
Construct learning problem:
- Variables: x_v for each vertex v
- Positive examples: Encode valid 3-colorings
- Negative examples: Encode edge constraints
 
Result:
- G is 3-colorable ⟺ ∃ consistent 3-term DNF
- Each DNF term = one color class
- Variables in term = vertices of that color
 
Since 3-Coloring is NP-complete, 3-term DNF is NP-hard.
""")
    
    # Concrete example
    print("Concrete Example:")
    print("-" * 40)
    print("Graph: Triangle (K3) with vertices {A, B, C}")
    print("Edges: (A,B), (B,C), (A,C)")
    print()
    print("3-Coloring exists: Yes (each vertex different color)")
    print("Corresponding 3-term DNF:")
    print("  Term 1 (Red):   Contains A")
    print("  Term 2 (Blue):  Contains B")  
    print("  Term 3 (Green): Contains C")
 
 
def min_disagreement_hardness():
    """
    Illustrate the NP-hardness of minimizing 0-1 loss.
    
    Problem: Given labeled data {(x_i, y_i)}, find halfspace
             h(x) = sign(w·x) that minimizes:
             
             Error = |{i : h(x_i) ≠ y_i}|
             
    This is NP-hard! Even for 2D data.
    """
    
    print("
" + "=" * 60)
    print("NP-Hardness: Minimum Disagreement Halfspace")
    print("=" * 60)
    
    print("""
Problem: Given labeled points, find optimal halfspace
         (minimizes number of misclassifications)
 
Why it's hard:
- The 0-1 loss is discontinuous and non-convex
- Unlike convex losses, can't use gradient descent
- Brute force: Check all possible labelings (2^m)
 
Reduction from:
- MAX-2-SAT (maximizing satisfied clauses)
- Construct points encoding clauses
- Optimal halfspace ⟺ maximum satisfying assignment
 
Practical implication:
- We CAN'T efficiently find the best classifier under 0-1 loss
- This is why we use SURROGATE losses:
  - Hinge loss (SVM): convex upper bound on 0-1
  - Logistic loss: smooth approximation  
  - Cross-entropy: probabilistic interpretation
""")
 
 
def improper_learning_escape():
    """
    Show how improper learning can bypass NP-hardness.
    """
    
    print("
" + "=" * 60)
    print("Escaping Hardness via Improper Learning")
    print("=" * 60)
    
    print("""
Key observation: NP-hardness proofs are often specific to PROPER learning
(outputting a hypothesis from the same class being learned).
 
Example: 3-term DNF
- Proper learning (output 3-term DNF): NP-hard
- Improper learning (output 3-CNF): Polynomial!
 
Why it works:
1. Every 3-DNF can be converted to equivalent 3-CNF
2. 3-CNF has efficient learning algorithms
3. The learned 3-CNF represents the same concept
 
Trade-off:
- Hypothesis might be larger (3-CNF has more clauses)
- But polynomial-time learning is achieved!
 
General principle:
- If C is hard to learn properly
- Find larger class H ⊇ C that's easy to learn
- Learn improperly: output h ∈ H that approximates c ∈ C
""")
 
three_term_dnf_hardness()
min_disagreement_hardness()  
improper_learning_escape()

Cryptographic Hardness of Learning

Beyond NP-hardness, some learning problems are hard under cryptographic assumptions—even stronger evidence of intractability.

Why Cryptographic Hardness?

NP-hardness is a worst-case result: there exist hard instances, but most instances might be easy. Cryptographic hardness provides average-case hardness: random instances are hard with high probability.

The Connection: Learning and Cryptography

Many cryptographic primitives are based on the assumption that certain functions are hard to learn:

If we could learn the function, we could break the cryptosystem
If the cryptosystem is secure, learning must be hard

Learning vs. Cryptography

Cryptography assumes certain computational problems are hard. If we could PAC-learn efficiently, we could solve these problems—breaking the cryptosystem! Thus, under cryptographic assumptions, certain learning problems MUST be hard.

The Parity Learning Problem:

Setup:

Instance space: X = {0,1}ⁿ
Target: Parity function c_S(x) = ⊕_{i ∈ S} xᵢ for unknown subset S ⊆ {1,...,n}
Distribution: Uniform over {0,1}ⁿ

Statistical Complexity:

VC dimension = n
Sample complexity = O(n/ε)
Statistically easy!

Computational Complexity:

Can be learned in O(n³) time using Gaussian elimination
In the noiseless case, this is polynomial

But with noise...

Learning Parity with Noise (LPN)

The LPN problem: Learn parity when each label is flipped with probability η. This is believed to be computationally hard—no polynomial-time algorithm known! LPN is used in lightweight cryptography. If LPN is hard, then learning parity with noise is NOT efficiently PAC-learnable.

LPN Hardness Implications:

The best known algorithm for LPN runs in time 2^{O(n/log n)}—subexponential but not polynomial.

If LPN is hard, then:

Parity functions with noise are not efficiently PAC-learnable
Various cryptographic schemes (HB protocol, LPN-based encryption) are secure
Even with unlimited samples, efficient learning is impossible

Other Cryptographic Connections:

Learning with Errors (LWE): Generalization of LPN over finite fields, basis of post-quantum cryptography
Planted clique: Learning large cliques in random graphs
Subset sum: Learning linear combinations over integers

lpn_hardness.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from typing import Tuple
 
def demonstrate_parity_learning():
    """
    Demonstrate parity learning without noise (easy case).
    Uses Gaussian elimination to find the hidden parity.
    """
    np.random.seed(42)
    
    n = 8  # 8 variables
    
    # Hidden parity subset S (unknown to learner)
    S = {1, 3, 5}  # Parity of x_1, x_3, x_5
    
    def parity_label(x: np.ndarray) -> int:
        """Compute parity for hidden subset S."""
        return sum(x[i] for i in S) % 2
    
    # Generate training data
    m = n + 5  # Slightly more than n samples
    X = np.random.randint(0, 2, (m, n))
    y = np.array([parity_label(x) for x in X])
    
    print("Parity Learning (Noiseless)")
    print("=" * 60)
    print(f"Variables: n = {n}")
    print(f"Hidden subset S = {S}")
    print(f"Samples: m = {m}")
    print()
    
    # Gaussian elimination to find parity
    # Augmented matrix [X | y]
    A = np.hstack([X, y.reshape(-1, 1)]).astype(float)
    
    # Forward elimination (GF(2))
    pivot_row = 0
    for col in range(n):
        # Find pivot
        for row in range(pivot_row, m):
            if A[row, col] == 1:
                # Swap rows
                A[[pivot_row, row]] = A[[row, pivot_row]]
                # Eliminate
                for r in range(m):
                    if r != pivot_row and A[r, col] == 1:
                        A[r] = (A[r] + A[pivot_row]) % 2
                pivot_row += 1
                break
    
    # Extract solution
    learned_S = set()
    for row in range(min(pivot_row, n)):
        for col in range(n):
            if A[row, col] == 1:
                if A[row, n] == 1:  # This variable is in parity
                    learned_S.add(col)
                break
    
    print(f"Learned subset: {learned_S}")
    print(f"Correct: {learned_S == S}")
    print()
    print("Time complexity: O(n³) - Polynomial!")
    print("✅ Noiseless parity is efficiently learnable")
 
 
def demonstrate_lpn_hardness():
    """
    Demonstrate Learning Parity with Noise (LPN) is hard.
    
    With noise, Gaussian elimination fails!
    """
    np.random.seed(42)
    
    n = 8
    noise_rate = 0.1  # 10% label noise
    
    S = {1, 3, 5}
    
    def noisy_parity_label(x: np.ndarray) -> int:
        """Compute parity with random noise."""
        true_label = sum(x[i] for i in S) % 2
        if np.random.random() < noise_rate:
            return 1 - true_label  # Flip
        return true_label
    
    print("
" + "=" * 60)
    print("Learning Parity with Noise (LPN)")
    print("=" * 60)
    print(f"Noise rate: {noise_rate * 100}%")
    print()
    
    # Try to learn with noisy data
    m = 100
    X = np.random.randint(0, 2, (m, n))
    y_noisy = np.array([noisy_parity_label(x) for x in X])
    
    # Count how many labels are flipped
    y_true = np.array([sum(x[i] for i in S) % 2 for x in X])
    flipped = np.sum(y_noisy != y_true)
    
    print(f"Samples: m = {m}")
    print(f"Flipped labels: {flipped} ({100*flipped/m:.1f}%)")
    print()
    print("Why LPN is hard:")
    print("  1. Gaussian elimination assumes exact labels")
    print("  2. With noise, linear algebra gives wrong answer")
    print("  3. Need to find S that explains MOST labels (not all)")
    print("  4. This is related to decoding random linear codes (hard!)")
    print()
    print("Best known algorithm: 2^{O(n/log n)} time")
    print("❌ LPN is believed to be computationally hard")
    print()
    print("Cryptographic implication:")
    print("  If LPN is hard → HB authentication protocol is secure")
    print("  If PAC learning LPN → Break lightweight crypto!")
 
 
def statistical_vs_computational():
    """
    Summarize the statistical vs computational gap.
    """
    print("
" + "=" * 60)
    print("Statistical vs Computational Complexity")
    print("=" * 60)
    
    problems = [
        ("Noiseless Parity", "O(n)", "O(n³)", "Efficient"),
        ("Noisy Parity (LPN)", "O(n)", "2^{O(n/log n)}", "Hard!"),
        ("3-term DNF", "O(n log n)", "NP-hard", "Hard (proper)"),
        ("Linear sep (agnostic)", "O(n)", "NP-hard", "Hard"),
        ("Planted clique", "O(log n)", "No efficient known", "Conjectured hard"),
    ]
    
    print(f"
{'Problem':<25} {'Statistical':<15} {'Computational':<20} {'Status':<15}")
    print("-" * 75)
    for prob, stat, comp, status in problems:
        print(f"{prob:<25} {stat:<15} {comp:<20} {status:<15}")
 
demonstrate_parity_learning()
demonstrate_lpn_hardness()
statistical_vs_computational()

Proper vs. Improper Learning Complexity

A recurring theme in computational learning theory: improper learning is often easier than proper learning.

Definitions Revisited:

Proper learning: Output h must be from class C (learn C using C)
Improper learning: Output h from class H ⊇ C (learn C using larger H)

Why Does This Matter?

NP-hardness proofs for learning are typically for proper learning. The reduction requires the output to have a specific form. Improper learning breaks this structure.

Proper Learning

•Output in same class C
•Often NP-hard
•Output is interpretable
•May require exponential time

Improper Learning

•Output from larger class H
•Often polynomial time
•Output may be complex
•Bypasses hardness barriers

Classic Example: 3-term DNF

Learning Type	Hypothesis Class	Complexity	Result
Proper	3-term DNF	NP-hard	❌ Hard
Improper	3-CNF	Polynomial	✅ Easy
Improper	Decision trees	Polynomial	✅ Easy

How Improper Learning Helps:

The NP-hardness proof for 3-DNF exploits the constraint that output must have exactly 3 terms. If we allow output to be 3-CNF (which is semantically equivalent but syntactically different), the hardness argument breaks down.

Practical Implications:

Neural networks improperly learn many concept classes efficiently
Boosting combines weak learners into a stronger (more complex) hypothesis
Kernel methods work in infinite-dimensional spaces to improperly learn
Random forests use many trees to represent complex boundaries

proper_improper_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import List, Tuple, Set
 
def classify_learning_problems():
    """
    Classify learning problems by proper/improper complexity.
    """
    
    print("Proper vs Improper Learning Complexity")
    print("=" * 60)
    
    print("""
+------------------+----------------+------------------+
|  Concept Class   | Proper Learning| Improper Learning|
+------------------+----------------+------------------+
| Conjunctions     | O(mn)         | O(mn)            |
| k-DNF (fixed k)  | O(m·n^k)      | O(m·n^k)         |
| 3-term DNF       | NP-hard       | poly (3-CNF)     |
| k-term DNF       | NP-hard       | poly (k-CNF)     |
| Decision trees   | NP-hard       | poly (many trees)|
| Intersect 2 HS   | NP-hard       | ??? (open)       |
| Linear sep (agn) | NP-hard       | poly (surrogate) |
+------------------+----------------+------------------+
 
Key insight: Improper learning often escapes NP-hardness!
""")
 
 
def dnf_to_cnf_conversion():
    """
    Demonstrate how DNF converts to CNF (enabling improper learning).
    
    Every k-term DNF can be written as a 3-CNF.
    This is the key to escaping hardness.
    """
    
    print("
" + "=" * 60)
    print("DNF to CNF Conversion (Why Improper Works)")
    print("=" * 60)
    
    print("""
Example: 2-term DNF
    f = (x₁ ∧ x₂) ∨ (x₃ ∧ ¬x₄)
 
Convert to CNF using distribution:
    f = (x₁ ∨ x₃) ∧ (x₁ ∨ ¬x₄) ∧ (x₂ ∨ x₃) ∧ (x₂ ∨ ¬x₄)
 
Result: 4 clauses (each is 2-literal = 2-CNF)
 
General case:
- k-term DNF with t literals per term
- Converts to: k^t clauses (polynomial for fixed k and t)
- Each clause has k literals (k-CNF)
 
Algorithm:
1. Take one literal from each DNF term
2. All combinations form CNF clauses
3. Number of clauses: (# terms)^(max term size)
 
Learning implication:
- Learn 3-CNF from data labeled by 3-DNF
- 3-CNF is learnable in polynomial time
- Output 3-CNF = output of improper learning for 3-DNF
""")
 
 
def improper_learning_example():
    """
    Demonstrate improper learning: learn DNF using decision tree.
    """
    
    print("
" + "=" * 60)
    print("Improper Learning: DNF via Decision Tree")
    print("=" * 60)
    
    n = 6
    
    # Target: 2-term DNF
    # f(x) = (x0 ∧ x1) ∨ (x2 ∧ x3)
    def target_dnf(x):
        term1 = x[0] and x[1]
        term2 = x[2] and x[3]
        return int(term1 or term2)
    
    # Generate training data
    np.random.seed(42)
    m = 50
    X = np.random.randint(0, 2, (m, n))
    y = np.array([target_dnf(x) for x in X])
    
    # Build decision tree (improper learning)
    # A simple greedy tree construction
    
    def build_stump(X, y):
        """Find best single-variable split."""
        best_score = -1
        best_var = 0
        
        n_features = X.shape[1]
        for var in range(n_features):
            # Split on x[var] = 0 vs x[var] = 1
            mask = X[:, var] == 1
            if mask.sum() == 0 or mask.sum() == len(mask):
                continue
            
            # Compute purity after split
            y_left = y[~mask]
            y_right = y[mask]
            purity = (
                (np.sum(y_left == 0) + np.sum(y_right == 1))
            )
            
            if purity > best_score:
                best_score = purity
                best_var = var
        
        return best_var
    
    best_var = build_stump(X, y)
    
    print(f"Target 2-term DNF: (x0 ∧ x1) ∨ (x2 ∧ x3)")
    print(f"Training samples: {m}")
    print(f"Best first split: Variable x{best_var}")
    print()
    print("Decision tree improperly learns the DNF concept.")
    print("The tree may have different structure but same labels.")
    print("This bypasses the NP-hardness of proper DNF learning!")
 
classify_learning_problems()
dnf_to_cnf_conversion()
improper_learning_example()

Assumptions That Enable Tractability

When learning is computationally hard in the worst case, what can we do? Several types of assumptions can restore tractability:

1. Distributional Assumptions

The hardness results are often worst-case over distributions. With a 'nice' distribution, learning may become easy.

Uniform distribution: Many problems become tractable
Log-concave distributions: Enable efficient sampling-based algorithms
Product distributions: Independence structure simplifies learning

Example: Learning DNF is hard in the worst case, but under the uniform distribution, there are quasi-polynomial algorithms.

Distribution-Specific Learning

PAC learning requires distribution-free guarantees—hardest case over all D. By allowing distribution-dependent algorithms, we can often achieve polynomial-time learning. This is realistic: in practice, we may know something about our data distribution.

2. Margin/Separability Assumptions

Instead of learning any consistent hypothesis, assume data is well-separated.

Margin assumption: Data has a 'gap' from decision boundary
Low-noise assumption: Few examples are near the boundary

Example: Learning halfspaces is NP-hard for 0-1 loss, but with margin γ, we can achieve:

Polynomial-time algorithm (Perceptron converges in O(1/γ²) iterations)
Sample complexity independent of dimension when margin is large

3. Structural Assumptions

Sparsity: True concept depends on few features
Low rank: Weight matrix has low rank
Smoothness: Function varies slowly

Assumptions That Help

•Margin/separability — Large gap from boundary
•Low noise — Labels almost deterministic
•Sparsity — Few relevant features
•Smoothness — Small Lipschitz constant
•Distribution structure — Known form of D

How They Help

•Reduce effective hypothesis class size
•Make optimization landscape convex
•Enable efficient algorithms (Perceptron, Lasso)
•Provide polynomial-time guarantees
•Match practical problem structure

tractability_assumptions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
 
def margin_based_learning():
    """
    Demonstrate how margin assumptions enable tractability.
    
    Without margin: Halfspace learning (agnostic) is NP-hard
    With margin γ: Perceptron converges in O(1/γ²) iterations
    """
    
    print("Margin Assumption: Enabling Tractability")
    print("=" * 60)
    
    np.random.seed(42)
    
    # Generate linearly separable data with margin
    def generate_margin_data(n: int, m: int, margin: float):
        """Generate data with margin γ from decision boundary."""
        # True separator: w = [1, 0, ..., 0], b = 0
        # Points have distance at least γ from hyperplane
        
        X = np.random.randn(m, n)
        # Project away from boundary
        X[:, 0] = np.sign(X[:, 0]) * (margin + np.abs(X[:, 0]))
        y = np.sign(X[:, 0])
        
        return X, y
    
    n = 10
    m = 100
    
    for margin in [0.1, 0.5, 1.0, 2.0]:
        X, y = generate_margin_data(n, m, margin)
        
        # Perceptron algorithm
        w = np.zeros(n)
        iterations = 0
        max_iter = 10000
        
        while iterations < max_iter:
            mistake = False
            for i in range(m):
                if y[i] * np.dot(w, X[i]) <= 0:
                    w += y[i] * X[i]
                    mistake = True
                    iterations += 1
            if not mistake:
                break
        
        print(f"Margin γ={margin:.1f}: Perceptron converged in {iterations} iterations")
    
    print()
    print("Key insight: Larger margin → fewer iterations")
    print("Perceptron mistake bound: O(R²/γ²) where R = max||x||")
    print("With margin, hard problem becomes tractable!")
 
 
def sparsity_assumption():
    """
    Demonstrate how sparsity enables tractable learning.
    """
    
    print("
" + "=" * 60)
    print("Sparsity Assumption: Learning in High Dimensions")
    print("=" * 60)
    
    print("""
Problem: Linear regression in n = 10,000 dimensions
         But only k = 10 features are relevant (sparse)
 
Without sparsity assumption:
- Need m = Ω(n) = Ω(10,000) samples
- Time: O(n³) = O(10^12) operations
 
With sparsity assumption (k-sparse):
- Use L1 regularization (LASSO)
- Need m = O(k log n) = O(10 × 13) ≈ 130 samples!
- Time: O(m × n) = O(1.3 million) — tractable!
 
The key: Restricted Isometry Property (RIP)
- For k-sparse signals, random matrices preserve distances
- With RIP, LASSO exactly recovers the sparse solution
- Sample complexity: O(k log(n/k)) instead of O(n)
""")
 
 
def surrogate_loss_assumption():
    """
    Demonstrate surrogate losses for making optimization tractable.
    """
    
    print("
" + "=" * 60)
    print("Surrogate Losses: Convexifying Hard Problems")
    print("=" * 60)
    
    print("""
Problem: Minimize 0-1 classification loss
         min_w |{i : sign(w·xi) ≠ yi}|
 
Issue: 0-1 loss is non-convex, discontinuous → NP-hard
 
Solution: Replace with convex surrogate loss
 
+------------------+-------------------+--------------------+
|   Surrogate      |    Formula        |    Properties      |
+------------------+-------------------+--------------------+
| Hinge (SVM)      | max(0, 1-y·f(x)) | Convex, sparse     |
| Logistic         | log(1+e^{-y·f(x)})| Convex, smooth     |
| Exponential      | e^{-y·f(x)}       | Convex, used in AdaBoost |
| Squared          | (1-y·f(x))²       | Convex, not ideal  |
+------------------+-------------------+--------------------+
 
Key result: Minimizing surrogate also minimizes 0-1 loss!
            (With appropriate calibration)
 
This converts NP-hard problem to convex optimization:
- Gradient descent works
- Polynomial-time algorithms
- Practical machine learning!
""")
 
margin_based_learning()
sparsity_assumption()
surrogate_loss_assumption()

Summary: Computational Complexity in Learning

We've explored the computational landscape of PAC learning—when learning is tractable and what makes it hard. Here are the essential insights:

Key Takeaways

•Statistical ≠ computational efficiency — A problem may need few samples but require exponential time
•Many learning problems are NP-hard — 3-DNF, intersection of halfspaces, minimum disagreement
•Cryptographic hardness goes deeper — LPN implies average-case hardness for noisy parity
•Improper learning often helps — Using a larger hypothesis class can bypass proper learning hardness
•Assumptions restore tractability — Margin, sparsity, distribution structure enable polynomial-time learning
•Surrogate losses make optimization tractable — Convert NP-hard 0-1 loss to convex optimization

Complexity Landscape Summary
Problem	Statistical	Computational	Assumption for Tractability
Halfspace (realizable)	O(n/ε)	Poly (LP)	None needed
Halfspace (agnostic)	O(n/ε²)	NP-hard	Margin, surrogate loss
3-term DNF (proper)	O(n/ε)	NP-hard	Use 3-CNF (improper)
Noisy parity	O(n/ε)	2^{Ω(n)}	None known!
Sparse regression	O(k log n/ε)	Poly (LASSO)	k-sparsity

Module Complete!

You've completed the PAC Learning module! You now understand: the PAC framework and its guarantees, sample complexity bounds for finite and infinite classes, what makes concept classes learnable, and the computational barriers and how to overcome them. This foundational knowledge prepares you for the VC Dimension module, which provides the tools to analyze infinite hypothesis classes.

5 / 5

Loading learning content...

Machine LearningPAC Learning

PAC Learning: Probably Approximately Correct Framework

LevelAdvanced

Duration75 mins

TopicPAC Learning

5 / 5

Computational Complexity

The Computational Barrier

This gap between statistical and computational efficiency is a central theme in modern learning theory:

Some problems are statistically easy but computationally hard
NP-hardness results show fundamental computational barriers
Cryptographic assumptions imply learning hardness

Understanding computational complexity in learning is essential for choosing tractable problem formulations and recognizing when approximations or additional assumptions are necessary.

What You Will Learn

Efficient PAC Learning: Formal Definition

Let's formalize what it means for learning to be computationally efficient.

Definition: Efficiently PAC-Learnable

A concept class C is efficiently PAC-learnable by hypothesis class H if there exists a learning algorithm A such that:

Sample efficiency: The sample complexity m(n, ε, δ) is polynomial in n, 1/ε, and 1/δ
Time efficiency: The running time of A is polynomial in the total input size:
- Running time ≤ poly(n, m, 1/ε, 1/δ)
- Since m = poly(n, 1/ε, 1/δ), this is poly(n, 1/ε, 1/δ)
Hypothesis efficiency: Evaluating h(x) for the output hypothesis takes polynomial time in n

Note: All three conditions must hold. An algorithm that takes exponential time isn't useful even if it needs few samples.

Why Hypothesis Efficiency?

The Computational Hierarchy:

Efficiently PAC-learnable (polynomial time, polynomial samples)
- Examples: Conjunctions, decision stumps, linear separators (realizable)
- Gold standard for practical machine learning
PAC-learnable but not efficiently (polynomial samples, exponential time)
- Examples: 3-term DNF (proper learning), noisy parities
- Statistically tractable but computationally intractable
Not PAC-learnable (infinite VC dimension)
- Examples: All Boolean functions, unbounded decision trees
- No algorithm works regardless of resources

The Central Complexity Question:

Given a concept class C and hypothesis class H:

Is ERM (Empirical Risk Minimization) computationally tractable?
ERM finds arg min_{h ∈ H} err_S(h)—but this search might be NP-hard!

efficiency_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
from typing import List, Tuple, Set
import numpy as np
from math import comb
 
def analyze_erm_complexity():
    """
    Analyze the computational complexity of ERM for different classes.
    
    ERM = Find h ∈ H that minimizes training error
    
    This is equivalent to finding a hypothesis consistent with 
    training data (in the realizable case).
    """
    
    print("ERM Computational Complexity Analysis")
    print("=" * 60)
    
    # Case 1: Conjunctions
    print("
1. Conjunctions (n Boolean variables)")
    print("-" * 40)
    print("   Hypothesis space: 3^n (each var: +, -, absent)")
    print("   ERM algorithm: Start with all literals, eliminate inconsistent")
    print("   Time complexity: O(m * n)")
    print("   ✅ POLYNOMIAL - Check each literal against positive examples")
    
    # Case 2: Linear separators (realizable)
    print("
2. Linear Separators - Realizable (n dimensions)")
    print("-" * 40)
    print("   Hypothesis space: Infinite (all hyperplanes)")
    print("   ERM algorithm: Linear Programming feasibility")
    print("   Time complexity: O(n^3 + m*n) for interior point methods")
    print("   ✅ POLYNOMIAL - LP is polynomial-time solvable")
    
    # Case 3: Linear separators (agnostic)
    print("
3. Linear Separators - Agnostic")
    print("-" * 40)
    print("   Problem: Minimize misclassification error")
    print("   This is: min over w of |{i : sign(w·xi) ≠ yi}|")
    print("   Time complexity: NP-hard in general!")
    print("   ❌ NP-HARD - Even for 2D, minimizing 0-1 loss is hard")
    
    # Case 4: 3-term DNF (proper learning)
    print("
4. 3-term DNF (proper learning)")
    print("-" * 40)
    print("   Hypothesis space: O((2n)^(3k)) for k literals per term")
    print("   Problem: Find a 3-term DNF consistent with data")
    print("   Time complexity: NP-hard")
    print("   ❌ NP-HARD - Reduces from graph 3-coloring")
    
    # Case 5: 3-term DNF (improper with 3-CNF)
    print("
5. 3-term DNF → 3-CNF (improper learning)")
    print("-" * 40)
    print("   Every 3-DNF is equivalent to some 3-CNF")
    print("   3-CNF is efficiently learnable (Horn clauses)")
    print("   Time complexity: poly(n, m)")
    print("   ✅ POLYNOMIAL - Improper learning bypasses hardness!")
    
    # Case 6: Intersection of halfspaces
    print("
6. Intersection of k halfspaces (k ≥ 2)")
    print("-" * 40)
    print("   Problem: Find k halfspaces whose intersection separates data")
    print("   For k=2: Can be solved approximately")
    print("   For k=Ω(n): NP-hard")
    print("   ⚠️ DEPENDS ON k - Hardness grows with k")
 
analyze_erm_complexity()

NP-Hardness of Learning Problems

Many natural learning problems are NP-hard, meaning (under P ≠ NP) no polynomial-time algorithm exists. Understanding these hardness results helps us recognize intractable problems.

What Does NP-Hardness Mean for Learning?

A learning problem is NP-hard if there's no polynomial-time algorithm that:

Given a training set S
Finds a hypothesis h from the specified class
Such that h is consistent with S (or minimizes error on S)

Key Result: Consistency Checking is Often NP-Hard

For many concept classes, even deciding whether a consistent hypothesis exists is NP-hard!

The Hardness Surprise

Classic Hardness Results:

1. Learning 3-term DNF is NP-hard (proper learning)

Reduction from: Graph 3-Coloring

Construct a sample set from a graph G
G is 3-colorable ⟺ a consistent 3-term DNF exists
Graph 3-coloring is NP-complete → 3-term DNF learning is NP-hard

2. Learning Intersections of 2 Halfspaces is NP-hard

Even finding whether a consistent intersection exists
Reduction from 3-SAT
Contrast: Single halfspace is polynomial (LP)

3. Minimum Disagreement with a Halfspace is NP-hard

Given data with errors, find the best halfspace
minimizing 0-1 loss is NP-hard
This is why we use surrogate losses (hinge, logistic)

hardness_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import List, Tuple
 
def three_term_dnf_hardness():
    """
    Illustrate the NP-hardness of learning 3-term DNF.
    
    The reduction from Graph 3-Coloring:
    
    Input: Graph G = (V, E) with n vertices
    Question: Can vertices be colored with 3 colors such that
              no edge connects same-colored vertices?
    
    Reduction to 3-term DNF learning:
    - Create n Boolean variables (one per vertex)
    - Create positive examples for valid vertex colorings
    - Create negative examples from edges
    - A consistent 3-term DNF exists ⟺ G is 3-colorable
    
    Key insight: Each term corresponds to one color class
    """
    
    print("NP-Hardness: 3-term DNF (Proper Learning)")
    print("=" * 60)
    
    print("""
Reduction from Graph 3-Coloring:
 
Given: Graph G = (V, E)
Goal: 3-color vertices so no edge is monochromatic
 
Construct learning problem:
- Variables: x_v for each vertex v
- Positive examples: Encode valid 3-colorings
- Negative examples: Encode edge constraints
 
Result:
- G is 3-colorable ⟺ ∃ consistent 3-term DNF
- Each DNF term = one color class
- Variables in term = vertices of that color
 
Since 3-Coloring is NP-complete, 3-term DNF is NP-hard.
""")
    
    # Concrete example
    print("Concrete Example:")
    print("-" * 40)
    print("Graph: Triangle (K3) with vertices {A, B, C}")
    print("Edges: (A,B), (B,C), (A,C)")
    print()
    print("3-Coloring exists: Yes (each vertex different color)")
    print("Corresponding 3-term DNF:")
    print("  Term 1 (Red):   Contains A")
    print("  Term 2 (Blue):  Contains B")  
    print("  Term 3 (Green): Contains C")
 
 
def min_disagreement_hardness():
    """
    Illustrate the NP-hardness of minimizing 0-1 loss.
    
    Problem: Given labeled data {(x_i, y_i)}, find halfspace
             h(x) = sign(w·x) that minimizes:
             
             Error = |{i : h(x_i) ≠ y_i}|
             
    This is NP-hard! Even for 2D data.
    """
    
    print("
" + "=" * 60)
    print("NP-Hardness: Minimum Disagreement Halfspace")
    print("=" * 60)
    
    print("""
Problem: Given labeled points, find optimal halfspace
         (minimizes number of misclassifications)
 
Why it's hard:
- The 0-1 loss is discontinuous and non-convex
- Unlike convex losses, can't use gradient descent
- Brute force: Check all possible labelings (2^m)
 
Reduction from:
- MAX-2-SAT (maximizing satisfied clauses)
- Construct points encoding clauses
- Optimal halfspace ⟺ maximum satisfying assignment
 
Practical implication:
- We CAN'T efficiently find the best classifier under 0-1 loss
- This is why we use SURROGATE losses:
  - Hinge loss (SVM): convex upper bound on 0-1
  - Logistic loss: smooth approximation  
  - Cross-entropy: probabilistic interpretation
""")
 
 
def improper_learning_escape():
    """
    Show how improper learning can bypass NP-hardness.
    """
    
    print("
" + "=" * 60)
    print("Escaping Hardness via Improper Learning")
    print("=" * 60)
    
    print("""
Key observation: NP-hardness proofs are often specific to PROPER learning
(outputting a hypothesis from the same class being learned).
 
Example: 3-term DNF
- Proper learning (output 3-term DNF): NP-hard
- Improper learning (output 3-CNF): Polynomial!
 
Why it works:
1. Every 3-DNF can be converted to equivalent 3-CNF
2. 3-CNF has efficient learning algorithms
3. The learned 3-CNF represents the same concept
 
Trade-off:
- Hypothesis might be larger (3-CNF has more clauses)
- But polynomial-time learning is achieved!
 
General principle:
- If C is hard to learn properly
- Find larger class H ⊇ C that's easy to learn
- Learn improperly: output h ∈ H that approximates c ∈ C
""")
 
three_term_dnf_hardness()
min_disagreement_hardness()  
improper_learning_escape()

Cryptographic Hardness of Learning

Beyond NP-hardness, some learning problems are hard under cryptographic assumptions—even stronger evidence of intractability.

Why Cryptographic Hardness?

The Connection: Learning and Cryptography

Many cryptographic primitives are based on the assumption that certain functions are hard to learn:

If we could learn the function, we could break the cryptosystem
If the cryptosystem is secure, learning must be hard

Learning vs. Cryptography

The Parity Learning Problem:

Setup:

Instance space: X = {0,1}ⁿ
Target: Parity function c_S(x) = ⊕_{i ∈ S} xᵢ for unknown subset S ⊆ {1,...,n}
Distribution: Uniform over {0,1}ⁿ

Statistical Complexity:

VC dimension = n
Sample complexity = O(n/ε)
Statistically easy!

Computational Complexity:

Can be learned in O(n³) time using Gaussian elimination
In the noiseless case, this is polynomial

But with noise...

Learning Parity with Noise (LPN)

LPN Hardness Implications:

The best known algorithm for LPN runs in time 2^{O(n/log n)}—subexponential but not polynomial.

If LPN is hard, then:

Parity functions with noise are not efficiently PAC-learnable
Various cryptographic schemes (HB protocol, LPN-based encryption) are secure
Even with unlimited samples, efficient learning is impossible

Other Cryptographic Connections:

Learning with Errors (LWE): Generalization of LPN over finite fields, basis of post-quantum cryptography
Planted clique: Learning large cliques in random graphs
Subset sum: Learning linear combinations over integers

lpn_hardness.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import numpy as np
from typing import Tuple
 
def demonstrate_parity_learning():
    """
    Demonstrate parity learning without noise (easy case).
    Uses Gaussian elimination to find the hidden parity.
    """
    np.random.seed(42)
    
    n = 8  # 8 variables
    
    # Hidden parity subset S (unknown to learner)
    S = {1, 3, 5}  # Parity of x_1, x_3, x_5
    
    def parity_label(x: np.ndarray) -> int:
        """Compute parity for hidden subset S."""
        return sum(x[i] for i in S) % 2
    
    # Generate training data
    m = n + 5  # Slightly more than n samples
    X = np.random.randint(0, 2, (m, n))
    y = np.array([parity_label(x) for x in X])
    
    print("Parity Learning (Noiseless)")
    print("=" * 60)
    print(f"Variables: n = {n}")
    print(f"Hidden subset S = {S}")
    print(f"Samples: m = {m}")
    print()
    
    # Gaussian elimination to find parity
    # Augmented matrix [X | y]
    A = np.hstack([X, y.reshape(-1, 1)]).astype(float)
    
    # Forward elimination (GF(2))
    pivot_row = 0
    for col in range(n):
        # Find pivot
        for row in range(pivot_row, m):
            if A[row, col] == 1:
                # Swap rows
                A[[pivot_row, row]] = A[[row, pivot_row]]
                # Eliminate
                for r in range(m):
                    if r != pivot_row and A[r, col] == 1:
                        A[r] = (A[r] + A[pivot_row]) % 2
                pivot_row += 1
                break
    
    # Extract solution
    learned_S = set()
    for row in range(min(pivot_row, n)):
        for col in range(n):
            if A[row, col] == 1:
                if A[row, n] == 1:  # This variable is in parity
                    learned_S.add(col)
                break
    
    print(f"Learned subset: {learned_S}")
    print(f"Correct: {learned_S == S}")
    print()
    print("Time complexity: O(n³) - Polynomial!")
    print("✅ Noiseless parity is efficiently learnable")
 
 
def demonstrate_lpn_hardness():
    """
    Demonstrate Learning Parity with Noise (LPN) is hard.
    
    With noise, Gaussian elimination fails!
    """
    np.random.seed(42)
    
    n = 8
    noise_rate = 0.1  # 10% label noise
    
    S = {1, 3, 5}
    
    def noisy_parity_label(x: np.ndarray) -> int:
        """Compute parity with random noise."""
        true_label = sum(x[i] for i in S) % 2
        if np.random.random() < noise_rate:
            return 1 - true_label  # Flip
        return true_label
    
    print("
" + "=" * 60)
    print("Learning Parity with Noise (LPN)")
    print("=" * 60)
    print(f"Noise rate: {noise_rate * 100}%")
    print()
    
    # Try to learn with noisy data
    m = 100
    X = np.random.randint(0, 2, (m, n))
    y_noisy = np.array([noisy_parity_label(x) for x in X])
    
    # Count how many labels are flipped
    y_true = np.array([sum(x[i] for i in S) % 2 for x in X])
    flipped = np.sum(y_noisy != y_true)
    
    print(f"Samples: m = {m}")
    print(f"Flipped labels: {flipped} ({100*flipped/m:.1f}%)")
    print()
    print("Why LPN is hard:")
    print("  1. Gaussian elimination assumes exact labels")
    print("  2. With noise, linear algebra gives wrong answer")
    print("  3. Need to find S that explains MOST labels (not all)")
    print("  4. This is related to decoding random linear codes (hard!)")
    print()
    print("Best known algorithm: 2^{O(n/log n)} time")
    print("❌ LPN is believed to be computationally hard")
    print()
    print("Cryptographic implication:")
    print("  If LPN is hard → HB authentication protocol is secure")
    print("  If PAC learning LPN → Break lightweight crypto!")
 
 
def statistical_vs_computational():
    """
    Summarize the statistical vs computational gap.
    """
    print("
" + "=" * 60)
    print("Statistical vs Computational Complexity")
    print("=" * 60)
    
    problems = [
        ("Noiseless Parity", "O(n)", "O(n³)", "Efficient"),
        ("Noisy Parity (LPN)", "O(n)", "2^{O(n/log n)}", "Hard!"),
        ("3-term DNF", "O(n log n)", "NP-hard", "Hard (proper)"),
        ("Linear sep (agnostic)", "O(n)", "NP-hard", "Hard"),
        ("Planted clique", "O(log n)", "No efficient known", "Conjectured hard"),
    ]
    
    print(f"
{'Problem':<25} {'Statistical':<15} {'Computational':<20} {'Status':<15}")
    print("-" * 75)
    for prob, stat, comp, status in problems:
        print(f"{prob:<25} {stat:<15} {comp:<20} {status:<15}")
 
demonstrate_parity_learning()
demonstrate_lpn_hardness()
statistical_vs_computational()

Proper vs. Improper Learning Complexity

A recurring theme in computational learning theory: improper learning is often easier than proper learning.

Definitions Revisited:

Proper learning: Output h must be from class C (learn C using C)
Improper learning: Output h from class H ⊇ C (learn C using larger H)

Why Does This Matter?

NP-hardness proofs for learning are typically for proper learning. The reduction requires the output to have a specific form. Improper learning breaks this structure.

Proper Learning

•Output in same class C
•Often NP-hard
•Output is interpretable
•May require exponential time

Improper Learning

•Output from larger class H
•Often polynomial time
•Output may be complex
•Bypasses hardness barriers

Classic Example: 3-term DNF

Learning Type	Hypothesis Class	Complexity	Result
Proper	3-term DNF	NP-hard	❌ Hard
Improper	3-CNF	Polynomial	✅ Easy
Improper	Decision trees	Polynomial	✅ Easy

How Improper Learning Helps:

Practical Implications:

Neural networks improperly learn many concept classes efficiently
Boosting combines weak learners into a stronger (more complex) hypothesis
Kernel methods work in infinite-dimensional spaces to improperly learn
Random forests use many trees to represent complex boundaries

proper_improper_examples.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from typing import List, Tuple, Set
 
def classify_learning_problems():
    """
    Classify learning problems by proper/improper complexity.
    """
    
    print("Proper vs Improper Learning Complexity")
    print("=" * 60)
    
    print("""
+------------------+----------------+------------------+
|  Concept Class   | Proper Learning| Improper Learning|
+------------------+----------------+------------------+
| Conjunctions     | O(mn)         | O(mn)            |
| k-DNF (fixed k)  | O(m·n^k)      | O(m·n^k)         |
| 3-term DNF       | NP-hard       | poly (3-CNF)     |
| k-term DNF       | NP-hard       | poly (k-CNF)     |
| Decision trees   | NP-hard       | poly (many trees)|
| Intersect 2 HS   | NP-hard       | ??? (open)       |
| Linear sep (agn) | NP-hard       | poly (surrogate) |
+------------------+----------------+------------------+
 
Key insight: Improper learning often escapes NP-hardness!
""")
 
 
def dnf_to_cnf_conversion():
    """
    Demonstrate how DNF converts to CNF (enabling improper learning).
    
    Every k-term DNF can be written as a 3-CNF.
    This is the key to escaping hardness.
    """
    
    print("
" + "=" * 60)
    print("DNF to CNF Conversion (Why Improper Works)")
    print("=" * 60)
    
    print("""
Example: 2-term DNF
    f = (x₁ ∧ x₂) ∨ (x₃ ∧ ¬x₄)
 
Convert to CNF using distribution:
    f = (x₁ ∨ x₃) ∧ (x₁ ∨ ¬x₄) ∧ (x₂ ∨ x₃) ∧ (x₂ ∨ ¬x₄)
 
Result: 4 clauses (each is 2-literal = 2-CNF)
 
General case:
- k-term DNF with t literals per term
- Converts to: k^t clauses (polynomial for fixed k and t)
- Each clause has k literals (k-CNF)
 
Algorithm:
1. Take one literal from each DNF term
2. All combinations form CNF clauses
3. Number of clauses: (# terms)^(max term size)
 
Learning implication:
- Learn 3-CNF from data labeled by 3-DNF
- 3-CNF is learnable in polynomial time
- Output 3-CNF = output of improper learning for 3-DNF
""")
 
 
def improper_learning_example():
    """
    Demonstrate improper learning: learn DNF using decision tree.
    """
    
    print("
" + "=" * 60)
    print("Improper Learning: DNF via Decision Tree")
    print("=" * 60)
    
    n = 6
    
    # Target: 2-term DNF
    # f(x) = (x0 ∧ x1) ∨ (x2 ∧ x3)
    def target_dnf(x):
        term1 = x[0] and x[1]
        term2 = x[2] and x[3]
        return int(term1 or term2)
    
    # Generate training data
    np.random.seed(42)
    m = 50
    X = np.random.randint(0, 2, (m, n))
    y = np.array([target_dnf(x) for x in X])
    
    # Build decision tree (improper learning)
    # A simple greedy tree construction
    
    def build_stump(X, y):
        """Find best single-variable split."""
        best_score = -1
        best_var = 0
        
        n_features = X.shape[1]
        for var in range(n_features):
            # Split on x[var] = 0 vs x[var] = 1
            mask = X[:, var] == 1
            if mask.sum() == 0 or mask.sum() == len(mask):
                continue
            
            # Compute purity after split
            y_left = y[~mask]
            y_right = y[mask]
            purity = (
                (np.sum(y_left == 0) + np.sum(y_right == 1))
            )
            
            if purity > best_score:
                best_score = purity
                best_var = var
        
        return best_var
    
    best_var = build_stump(X, y)
    
    print(f"Target 2-term DNF: (x0 ∧ x1) ∨ (x2 ∧ x3)")
    print(f"Training samples: {m}")
    print(f"Best first split: Variable x{best_var}")
    print()
    print("Decision tree improperly learns the DNF concept.")
    print("The tree may have different structure but same labels.")
    print("This bypasses the NP-hardness of proper DNF learning!")
 
classify_learning_problems()
dnf_to_cnf_conversion()
improper_learning_example()

Assumptions That Enable Tractability

When learning is computationally hard in the worst case, what can we do? Several types of assumptions can restore tractability:

1. Distributional Assumptions

The hardness results are often worst-case over distributions. With a 'nice' distribution, learning may become easy.

Uniform distribution: Many problems become tractable
Log-concave distributions: Enable efficient sampling-based algorithms
Product distributions: Independence structure simplifies learning

Example: Learning DNF is hard in the worst case, but under the uniform distribution, there are quasi-polynomial algorithms.

Distribution-Specific Learning

2. Margin/Separability Assumptions

Instead of learning any consistent hypothesis, assume data is well-separated.

Margin assumption: Data has a 'gap' from decision boundary
Low-noise assumption: Few examples are near the boundary

Example: Learning halfspaces is NP-hard for 0-1 loss, but with margin γ, we can achieve:

Polynomial-time algorithm (Perceptron converges in O(1/γ²) iterations)
Sample complexity independent of dimension when margin is large

3. Structural Assumptions

Sparsity: True concept depends on few features
Low rank: Weight matrix has low rank
Smoothness: Function varies slowly

Assumptions That Help

•Margin/separability — Large gap from boundary
•Low noise — Labels almost deterministic
•Sparsity — Few relevant features
•Smoothness — Small Lipschitz constant
•Distribution structure — Known form of D

How They Help

•Reduce effective hypothesis class size
•Make optimization landscape convex
•Enable efficient algorithms (Perceptron, Lasso)
•Provide polynomial-time guarantees
•Match practical problem structure

tractability_assumptions.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
 
def margin_based_learning():
    """
    Demonstrate how margin assumptions enable tractability.
    
    Without margin: Halfspace learning (agnostic) is NP-hard
    With margin γ: Perceptron converges in O(1/γ²) iterations
    """
    
    print("Margin Assumption: Enabling Tractability")
    print("=" * 60)
    
    np.random.seed(42)
    
    # Generate linearly separable data with margin
    def generate_margin_data(n: int, m: int, margin: float):
        """Generate data with margin γ from decision boundary."""
        # True separator: w = [1, 0, ..., 0], b = 0
        # Points have distance at least γ from hyperplane
        
        X = np.random.randn(m, n)
        # Project away from boundary
        X[:, 0] = np.sign(X[:, 0]) * (margin + np.abs(X[:, 0]))
        y = np.sign(X[:, 0])
        
        return X, y
    
    n = 10
    m = 100
    
    for margin in [0.1, 0.5, 1.0, 2.0]:
        X, y = generate_margin_data(n, m, margin)
        
        # Perceptron algorithm
        w = np.zeros(n)
        iterations = 0
        max_iter = 10000
        
        while iterations < max_iter:
            mistake = False
            for i in range(m):
                if y[i] * np.dot(w, X[i]) <= 0:
                    w += y[i] * X[i]
                    mistake = True
                    iterations += 1
            if not mistake:
                break
        
        print(f"Margin γ={margin:.1f}: Perceptron converged in {iterations} iterations")
    
    print()
    print("Key insight: Larger margin → fewer iterations")
    print("Perceptron mistake bound: O(R²/γ²) where R = max||x||")
    print("With margin, hard problem becomes tractable!")
 
 
def sparsity_assumption():
    """
    Demonstrate how sparsity enables tractable learning.
    """
    
    print("
" + "=" * 60)
    print("Sparsity Assumption: Learning in High Dimensions")
    print("=" * 60)
    
    print("""
Problem: Linear regression in n = 10,000 dimensions
         But only k = 10 features are relevant (sparse)
 
Without sparsity assumption:
- Need m = Ω(n) = Ω(10,000) samples
- Time: O(n³) = O(10^12) operations
 
With sparsity assumption (k-sparse):
- Use L1 regularization (LASSO)
- Need m = O(k log n) = O(10 × 13) ≈ 130 samples!
- Time: O(m × n) = O(1.3 million) — tractable!
 
The key: Restricted Isometry Property (RIP)
- For k-sparse signals, random matrices preserve distances
- With RIP, LASSO exactly recovers the sparse solution
- Sample complexity: O(k log(n/k)) instead of O(n)
""")
 
 
def surrogate_loss_assumption():
    """
    Demonstrate surrogate losses for making optimization tractable.
    """
    
    print("
" + "=" * 60)
    print("Surrogate Losses: Convexifying Hard Problems")
    print("=" * 60)
    
    print("""
Problem: Minimize 0-1 classification loss
         min_w |{i : sign(w·xi) ≠ yi}|
 
Issue: 0-1 loss is non-convex, discontinuous → NP-hard
 
Solution: Replace with convex surrogate loss
 
+------------------+-------------------+--------------------+
|   Surrogate      |    Formula        |    Properties      |
+------------------+-------------------+--------------------+
| Hinge (SVM)      | max(0, 1-y·f(x)) | Convex, sparse     |
| Logistic         | log(1+e^{-y·f(x)})| Convex, smooth     |
| Exponential      | e^{-y·f(x)}       | Convex, used in AdaBoost |
| Squared          | (1-y·f(x))²       | Convex, not ideal  |
+------------------+-------------------+--------------------+
 
Key result: Minimizing surrogate also minimizes 0-1 loss!
            (With appropriate calibration)
 
This converts NP-hard problem to convex optimization:
- Gradient descent works
- Polynomial-time algorithms
- Practical machine learning!
""")
 
margin_based_learning()
sparsity_assumption()
surrogate_loss_assumption()

Summary: Computational Complexity in Learning

We've explored the computational landscape of PAC learning—when learning is tractable and what makes it hard. Here are the essential insights:

Key Takeaways

•Statistical ≠ computational efficiency — A problem may need few samples but require exponential time
•Many learning problems are NP-hard — 3-DNF, intersection of halfspaces, minimum disagreement
•Cryptographic hardness goes deeper — LPN implies average-case hardness for noisy parity
•Improper learning often helps — Using a larger hypothesis class can bypass proper learning hardness
•Assumptions restore tractability — Margin, sparsity, distribution structure enable polynomial-time learning
•Surrogate losses make optimization tractable — Convert NP-hard 0-1 loss to convex optimization

Complexity Landscape Summary
Problem	Statistical	Computational	Assumption for Tractability
Halfspace (realizable)	O(n/ε)	Poly (LP)	None needed
Halfspace (agnostic)	O(n/ε²)	NP-hard	Margin, surrogate loss
3-term DNF (proper)	O(n/ε)	NP-hard	Use 3-CNF (improper)
Noisy parity	O(n/ε)	2^{Ω(n)}	None known!
Sparse regression	O(k log n/ε)	Poly (LASSO)	k-sparsity

Module Complete!

5 / 5