Adaboost - Learning Module

Loading content...

0/245

Training Error Bounds

The Guarantee of Exponential Convergence

One of AdaBoost's most remarkable properties is its provable training error bound. When Freund and Schapire introduced AdaBoost in 1995, they proved something extraordinary: the training error decreases exponentially with the number of boosting rounds.

This isn't just an empirical observation—it's a mathematical theorem with a clean, elegant proof. The bound states:

$$\text{TrainErr}(H_T) \leq \exp\left(-2\sum_{t=1}^{T} \gamma_t^2\right)$$

where (\gamma_t = \frac{1}{2} - \varepsilon_t) is the "edge" of classifier (t) over random guessing.

If each weak learner has at least (\gamma) edge, then after (T) rounds:

$$\text{TrainErr}(H_T) \leq e^{-2\gamma^2 T}$$

This bound was revolutionary. It meant that any learning algorithm with predictions slightly better than random could be amplified to arbitrary accuracy. In this page, we'll derive this bound rigorously, understand its tightness, and explore what it means in practice.

What You Will Learn

By completing this page, you will: (1) Derive AdaBoost's training error bound from first principles; (2) Understand the role of the normalization constant Z_t; (3) Analyze when the bound is tight vs. loose; (4) Connect bounds to sample complexity; (5) Explore implications for generalization.

The Main Theorem

Let's state AdaBoost's training error bound precisely.

Theorem (Freund & Schapire, 1995):

Let (H_T(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)) be the AdaBoost classifier after (T) rounds. Then:

$$\text{TrainErr}(H_T) = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}[H_T(x_i) \neq y_i] \leq \prod_{t=1}^{T} Z_t$$

where (Z_t = 2\sqrt{\varepsilon_t(1-\varepsilon_t)}) and (\varepsilon_t) is the weighted error of (h_t).

Corollary (Edge Form):

If (\gamma_t = \frac{1}{2} - \varepsilon_t \geq \gamma > 0) for all (t), then:

$$\text{TrainErr}(H_T) \leq \left(1 - 4\gamma^2\right)^{T/2} \leq e^{-2\gamma^2 T}$$

What This Means:

Even with weak learners that are only 1% better than random ((\gamma = 0.01))
After (T = 5000) rounds: (e^{-2 \times 0.0001 \times 5000} = e^{-1} \approx 0.37)
After (T = 25000) rounds: (e^{-5} \approx 0.0067)
The training error is driven to arbitrarily small values

The Power of Weak Learning

This theorem establishes that weak and strong learning are equivalent. Any algorithm that can consistently beat random guessing (even by 0.1%) can be boosted to achieve any desired accuracy. This settled a major open question in computational learning theory.

Training Error Bounds for Various Edge Values
Edge (γ)	Accuracy	T for TrainErr < 1%	T for TrainErr < 0.1%
0.01 (51%)	51%	23,026	34,539
0.05 (55%)	55%	921	1,382
0.10 (60%)	60%	230	346
0.15 (65%)	65%	102	154
0.20 (70%)	70%	58	86
0.30 (80%)	80%	26	38
0.40 (90%)	90%	14	22

Proof of the Training Error Bound

The proof is elegant and relies on tracking how the normalization constants (Z_t) accumulate.

Step 1: Express Final Weights

Recall the weight update: $$D_{t+1}(i) = \frac{D_t(i) \exp(-\alpha_t y_i h_t(x_i))}{Z_t}$$

Unrolling from (t=1) to (T): $$D_{T+1}(i) = \frac{D_1(i) \prod_{t=1}^{T} \exp(-\alpha_t y_i h_t(x_i))}{\prod_{t=1}^{T} Z_t}$$

Since (D_1(i) = \frac{1}{n}): $$D_{T+1}(i) = \frac{\exp\left(-y_i \sum_{t=1}^{T} \alpha_t h_t(x_i)\right)}{n \prod_{t=1}^{T} Z_t}$$

$$= \frac{\exp(-y_i F_T(x_i))}{n \prod_{t=1}^{T} Z_t}$$

Step 2: Sum Over All Examples

Since (D_{T+1}) is a distribution, it sums to 1: $$1 = \sum_{i=1}^{n} D_{T+1}(i) = \sum_{i=1}^{n} \frac{\exp(-y_i F_T(x_i))}{n \prod_{t=1}^{T} Z_t}$$

Therefore: $$\prod_{t=1}^{T} Z_t = \frac{1}{n} \sum_{i=1}^{n} \exp(-y_i F_T(x_i))$$

Step 3: Bound Training Error

A training error occurs when (H_T(x_i) \neq y_i), which means (y_i F_T(x_i) \leq 0).

For such examples: $$\exp(-y_i F_T(x_i)) \geq \exp(0) = 1$$

Therefore: $$\sum_{i=1}^{n} \exp(-y_i F_T(x_i)) \geq \sum_{i: H_T(x_i) \neq y_i} 1 = n \cdot \text{TrainErr}(H_T)$$

Combining: $$\prod_{t=1}^{T} Z_t = \frac{1}{n} \sum_{i=1}^{n} \exp(-y_i F_T(x_i)) \geq \text{TrainErr}(H_T)$$

$$\boxed{\text{TrainErr}(H_T) \leq \prod_{t=1}^{T} Z_t}$$

Step 4: Express Z_t in Terms of Error

Recall: $$Z_t = \sum_{i=1}^{n} D_t(i) \exp(-\alpha_t y_i h_t(x_i))$$

Split into correct and incorrect: $$Z_t = (1-\varepsilon_t) e^{-\alpha_t} + \varepsilon_t e^{+\alpha_t}$$

With (\alpha_t = \frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)):

$$e^{-\alpha_t} = \sqrt{\frac{\varepsilon_t}{1-\varepsilon_t}}, \quad e^{+\alpha_t} = \sqrt{\frac{1-\varepsilon_t}{\varepsilon_t}}$$

$$Z_t = (1-\varepsilon_t)\sqrt{\frac{\varepsilon_t}{1-\varepsilon_t}} + \varepsilon_t\sqrt{\frac{1-\varepsilon_t}{\varepsilon_t}}$$

$$= \sqrt{\varepsilon_t(1-\varepsilon_t)} + \sqrt{\varepsilon_t(1-\varepsilon_t)} = 2\sqrt{\varepsilon_t(1-\varepsilon_t)}$$

Step 5: Convert to Exponential Form

Using (\gamma_t = \frac{1}{2} - \varepsilon_t): $$Z_t = 2\sqrt{\left(\frac{1}{2}-\gamma_t\right)\left(\frac{1}{2}+\gamma_t\right)} = 2\sqrt{\frac{1}{4}-\gamma_t^2} = \sqrt{1-4\gamma_t^2}$$

For small (x), (\sqrt{1-x} \leq e^{-x/2}), so: $$Z_t = \sqrt{1-4\gamma_t^2} \leq e^{-2\gamma_t^2}$$

Therefore: $$\prod_{t=1}^{T} Z_t \leq \prod_{t=1}^{T} e^{-2\gamma_t^2} = \exp\left(-2\sum_{t=1}^{T}\gamma_t^2\right)$$

If (\gamma_t \geq \gamma) for all (t): $$\text{TrainErr}(H_T) \leq e^{-2\gamma^2 T}$$ ∎

Proof Structure

The proof elegantly connects: (1) The weight update mechanism, (2) The normalization constants Z_t, (3) The exponential loss of misclassified examples, (4) The product bound on training error. Each piece plays an essential role in the final result.

Tightness of the Bound

A natural question: Is the bound (\prod_t Z_t) tight, or is it a loose upper bound?

Sources of Looseness:

Margin Gap: The bound counts examples with (y_i F_T(x_i) \leq 0) as errors, but uses (\exp(-y_i F_T(x_i))) which can be arbitrarily larger than 1 for correctly classified examples with small margin.
Product Approximation: The inequality (\sqrt{1-4\gamma^2} \leq e^{-2\gamma^2}) introduces additional looseness for large (\gamma).
Uniform (\gamma) Assumption: The corollary assumes all (\gamma_t \geq \gamma), but in practice (\gamma_t) varies.

When the Bound Is Tight:

The bound is approximately tight when:

Margins are small (correctly classified examples barely correct)
Error rates are close to 0.5 (weak learners are barely better than random)
No examples have very large margins

When the Bound Is Loose:

The bound can be very loose when:

AdaBoost achieves large margins on most examples
Early rounds achieve low error, creating strong classifiers quickly
The data is "easy" relative to the weak learner class

Empirical Observation:

In practice, AdaBoost often achieves zero training error well before the bound predicts. For instance:

Bound predicts: (e^{-0.5} \approx 0.6) after 500 rounds with (\gamma = 0.05)
Actual: Often 0% training error after 50-100 rounds

This gap arises because the bound assumes worst-case behavior, while AdaBoost's adaptive reweighting often does much better.

bound_tightness.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
import matplotlib.pyplot as plt
 
def compute_training_error_bound(errors: np.ndarray) -> np.ndarray:
    """
    Compute cumulative training error bound.
    
    Args:
        errors: Array of weighted errors ε_t for each round
        
    Returns:
        Array of cumulative bounds after each round
    """
    Z_t = 2 * np.sqrt(errors * (1 - errors))
    cumulative_bound = np.cumprod(Z_t)
    return cumulative_bound
 
 
def compute_exponential_bound(errors: np.ndarray) -> np.ndarray:
    """Compute looser exponential form of bound."""
    gammas = 0.5 - errors
    cumulative_gamma_sq = np.cumsum(gammas ** 2)
    return np.exp(-2 * cumulative_gamma_sq)
 
 
def visualize_bound_vs_actual(T: int = 100, gamma: float = 0.15):
    """
    Compare theoretical bound to typical actual training error.
    Creates visualization showing the gap.
    """
    # Generate typical error sequence
    # Errors tend to increase as training progresses
    base_errors = 0.5 - gamma + 0.3 * gamma * np.arange(T) / T
    errors = np.clip(base_errors + 0.03 * np.random.randn(T), 0.01, 0.49)
    
    # Compute bounds
    product_bound = compute_training_error_bound(errors)
    exp_bound = compute_exponential_bound(errors)
    
    # Simulate actual training error (typically much lower)
    # This is illustrative - real behavior depends on data
    actual_errors = np.maximum(0, 0.3 - 0.015 * np.arange(T) + 
                               0.01 * np.random.randn(T))
    actual_errors = np.minimum(actual_errors, np.ones(T))
    actual_errors[70:] = 0  # Zero training error after ~70 rounds
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))
    
    rounds = np.arange(1, T + 1)
    ax.semilogy(rounds, exp_bound, 'b-', linewidth=2, 
                label='Exponential bound: exp(-2γ²T)')
    ax.semilogy(rounds, product_bound, 'g-', linewidth=2,
                label='Product bound: Π Z_t')
    ax.semilogy(rounds, actual_errors + 1e-10, 'r-', linewidth=2,
                label='Typical actual error')
    
    ax.set_xlabel('Boosting Rounds (T)')
    ax.set_ylabel('Training Error (log scale)')
    ax.set_title('Training Error: Bounds vs. Actual')
    ax.legend()
    ax.grid(True, alpha=0.3)
    ax.set_ylim(1e-4, 1)
    
    plt.tight_layout()
    plt.savefig('bound_tightness.png', dpi=150)
    plt.close()
    
    # Compute how much looser the exp bound is
    tightness_ratio = exp_bound / product_bound
    print(f"Average tightness ratio (exp/product): {tightness_ratio.mean():.2f}")
    print(f"Final bound (product): {product_bound[-1]:.4e}")
    print(f"Final bound (exp): {exp_bound[-1]:.4e}")
    
    return product_bound, exp_bound, actual_errors
 
 
product, exp, actual = visualize_bound_vs_actual(T=100, gamma=0.15)

Practical Implication

Don't use the theoretical bound to predict when training error will hit zero—it's almost always pessimistic. Instead, monitor actual training error and use the bound as a worst-case guarantee that confirms convergence properties.

Sample Complexity Implications

The training error bound has important implications for sample complexity—how many examples are needed to learn effectively.

Rounds for Target Training Error:

To achieve training error (\delta), we need: $$e^{-2\gamma^2 T} \leq \delta$$ $$T \geq \frac{\ln(1/\delta)}{2\gamma^2}$$

Key Observations:

Inverse Dependence on (\gamma^2): Rounds scale as (O(1/\gamma^2)). Halving the edge quadruples the required rounds.
Logarithmic in (\delta): Reducing target error from 1% to 0.01% (100× smaller) only requires (\ln(100) \approx 4.6 \times) more rounds.
No Dependence on (n): The bound is independent of sample size! This is because it's a training error bound, not a generalization bound.

Combining with Generalization Bounds:

For generalization, we need additional analysis. The VC-dimension of AdaBoost ensembles provides:

$$\text{TestErr} \leq \text{TrainErr} + O\left(\sqrt{\frac{T \cdot d_{\text{weak}}}{n} \ln(n)}\right)$$

where (d_{\text{weak}}) is the VC-dimension of the weak learner class.

The Tradeoff:

More rounds (T) → Lower training error
More rounds (T) → Higher complexity penalty in generalization bound

This suggests an optimal number of rounds exists, balancing training error reduction against capacity control.

Rounds Required to Achieve Training Error δ
Target δ	γ = 0.05	γ = 0.10	γ = 0.20	γ = 0.30
10%	461	115	29	13
1%	921	230	58	26
0.1%	1382	346	86	38
0.01%	1842	461	115	51
0.001%	2303	576	144	64

Training vs. Generalization

The training error bound guarantees convergence but says nothing about generalization. Zero training error doesn't mean zero test error! In the next section, we'll see how margin theory bridges this gap, explaining why AdaBoost often generalizes well despite zero training error.

Margin Theory Connection

The training error bound tells only part of the story. The deeper insight comes from margin theory, which explains why AdaBoost continues to improve generalization even after training error hits zero.

The Margin:

For example ((x_i, y_i)), the margin is: $$\rho_i = \frac{y_i \sum_{t=1}^{T} \alpha_t h_t(x_i)}{\sum_{t=1}^{T} \alpha_t} = \frac{y_i F_T(x_i)}{|\alpha|_1}$$

(\rho_i > 0): Correctly classified
(\rho_i < 0): Incorrectly classified
(|\rho_i|): Confidence of prediction

Margin Distribution Bound:

Schapire et al. (1998) proved:

$$P(y F(x) \leq \theta \cdot |\alpha|1) \leq \prod{t=1}^{T} Z_t(\theta)$$

where (Z_t(\theta) = (1-\varepsilon_t)e^{-\alpha_t\theta} + \varepsilon_t e^{\alpha_t\theta}).

For (\theta > 0), this bounds the fraction of examples with margin less than (\theta).

Key Insight:

AdaBoost doesn't just minimize errors—it maximizes margins. Even after training error reaches zero:

Subsequent rounds increase the margin on hard examples
Larger margins correlate with better generalization
This explains the "delayed overfitting" phenomenon

The Margin Bound:

A tighter generalization bound involves the minimum margin: $$\rho_{\min} = \min_{i} \frac{y_i F_T(x_i)}{\sum_t \alpha_t}$$

$$\text{TestErr} \leq \text{TrainErr}{\leq \rho{\min}} + O\left(\sqrt{\frac{d_{\text{weak}}}{n \cdot \rho_{\min}^2}}\right)$$

Larger margin → better bound → better generalization.

Why Margins Matter:

Consider two classifiers that achieve 0% training error:

Classifier A: Minimum margin = 0.01 (examples barely correctly classified)
Classifier B: Minimum margin = 0.3 (confident predictions)

Classifier B will generalize much better because:

Small perturbations in the data won't change predictions
The decision boundary has "safety buffer" around training points
The effective capacity is lower (fewer "degrees of freedom" are used)

AdaBoost's Behavior:

Empirically, AdaBoost exhibits distinct phases:

Phase	Training Error	Margin Behavior	Effect on Generalization
Initial	Decreasing rapidly	Margins forming	Generalization improving
Middle	Approaching zero	Margins growing	Generalization improving
Late	Zero	Margins still growing	May still improve or plateau
Very late	Zero	Margins saturating	Risk of overfitting increases

The "very late" phase is where AdaBoost can overfit—when margins stop improving but model complexity continues to grow.

The Margin Miracle

This explains a surprising empirical observation: AdaBoost often continues to improve test accuracy for many rounds after training error hits zero. The secret is margin maximization—later rounds make correct predictions more confident, improving robustness to noise and sampling variation.

Comparison to Other Bounds

AdaBoost's training error bound differs fundamentally from bounds for other learning algorithms. Let's compare:

AdaBoost: $$\text{TrainErr}(H_T) \leq e^{-2\gamma^2 T}$$

Exponential decay in (T)
Depends on weak learner edge (\gamma)
Independent of sample size (n)

Perceptron: $$\text{Mistakes} \leq \left(\frac{R}{\gamma}\right)^2$$

Finite mistake bound (not error rate)
(R) = radius of data, (\gamma) = margin
No dependence on (T) (converges in finite steps)

SVM (Soft Margin): $$\text{TrainErr} \leq \frac{C \sum_i \xi_i}{n}$$

Bounded by slack variables
Depends on regularization parameter (C)
Direct trade-off with margin

Gradient Boosting:

No simple closed-form bound like AdaBoost
Convergence rate depends on loss curvature
Often analyzed via optimization theory

Training Error Bounds Comparison
Algorithm	Bound Type	Key Parameters	Convergence
AdaBoost	Exponential	Edge γ, Rounds T	Exponential in T
Perceptron	Mistake bound	Margin γ, Radius R	Finite steps
SVM	Slack-based	C, ε	Fixed (optimization)
Gradient Boosting	Optimization-theoretic	Learning rate, Rounds	Depends on loss
Neural Networks	Implicit regularization	Architecture, Training	No simple bound

What Makes AdaBoost's Bound Special:

Clean Dependence on Weak Learner Quality: The bound explicitly shows how weak learner accuracy (through (\gamma)) affects convergence.
Product Structure: The (\prod_t Z_t) form reveals that each round multiplies progress, not just adds.
Exponential Decay: Guarantees arbitrarily small training error with enough rounds.
Connection to Optimization: The bound is simultaneously:
- A learning-theoretic convergence guarantee
- A consequence of exponential loss minimization
- Related to margin maximization

Limitations:

Training Error Only: Says nothing directly about generalization
Requires Weak Learning Condition: Fails if any (\varepsilon_t \geq 0.5)
Assumes Perfect Weak Learner Access: In practice, weak learner quality varies
No Noise Modeling: Assumes labels are correct; breaks down with label noise

Theoretical vs. Practical Significance

The AdaBoost bound is more theoretically significant than practically useful. It establishes that boosting works in principle, but actual convergence is usually much faster than the bound suggests. Practitioners should focus on monitoring actual training/validation error rather than computing the bound.

Practical Implications

The training error bound, while theoretical, has concrete practical implications.

Implication 1: Weak Learner Selection

The bound (e^{-2\gamma^2 T}) shows quadratic dependence on (\gamma). This means:

Doubling weak learner edge (e.g., 55% → 60% accuracy) quarters required rounds
It's often better to invest in a slightly stronger weak learner than to run more rounds
The marginal value of weak learner improvement is highest when starting weak

Rule of Thumb: If your weak learner achieves < 55% accuracy, consider improving it rather than adding more boosting rounds.

Implication 2: Stopping Criteria

The bound suggests natural stopping criteria:

Target training error threshold (though this is often 0)
Maximum rounds (e.g., (T = \frac{\ln(n)}{2\gamma^2}) ensures bound < (\frac{1}{n}))
Weak learner edge threshold (stop if (\gamma_t < \gamma_{\min}))

Implication 3: Debugging Poor Performance

If AdaBoost isn't converging:

Check if weak learners are achieving (\varepsilon_t < 0.5)
If (\varepsilon_t \approx 0.5), the weak learner class is too weak for the problem
Consider expanding the weak learner class (e.g., depth-2 trees instead of stumps)

Practical Guidelines from Theory

•Monitor γ_t across rounds: Decreasing edge suggests diminishing returns
•Use validation error for stopping: The bound doesn't predict generalization well
•Early stopping is often beneficial: Prevents overfitting in late rounds
•Invest in base learner quality: Small improvements in γ have large effects
•Log Z_t values: Sharp increases indicate potential problems
•Compare to bound: If actual error >> bound, check for implementation bugs

monitor_convergence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
 
def monitor_adaboost_convergence(errors: list, n_samples: int) -> dict:
    """
    Monitor AdaBoost training convergence and compare to bound.
    
    Args:
        errors: List of weighted errors ε_t for each round
        n_samples: Number of training samples
        
    Returns:
        Dictionary with convergence diagnostics
    """
    errors = np.array(errors)
    T = len(errors)
    
    # Compute edges
    gammas = 0.5 - errors
    
    # Z_t values
    Z_t = 2 * np.sqrt(errors * (1 - errors))
    
    # Cumulative bounds
    product_bound = np.cumprod(Z_t)
    
    # Theoretical rounds to achieve 1/n training error
    mean_gamma = gammas.mean()
    if mean_gamma > 0:
        T_for_inv_n = np.log(n_samples) / (2 * mean_gamma**2)
    else:
        T_for_inv_n = np.inf
    
    # Detect problematic rounds (γ < 0.01)
    weak_rounds = np.where(gammas < 0.01)[0]
    
    # Convergence rate (how fast bound is decreasing)
    if T > 10:
        recent_Z = Z_t[-10:].mean()
        early_Z = Z_t[:10].mean()
        convergence_slowdown = recent_Z / early_Z
    else:
        convergence_slowdown = 1.0
    
    return {
        'num_rounds': T,
        'mean_edge': mean_gamma,
        'min_edge': gammas.min(),
        'mean_Z_t': Z_t.mean(),
        'final_bound': product_bound[-1],
        'T_for_inv_n': T_for_inv_n,
        'num_weak_rounds': len(weak_rounds),
        'convergence_slowdown': convergence_slowdown,
        'weak_rounds': weak_rounds.tolist() if len(weak_rounds) <= 10 
                       else weak_rounds[:10].tolist() + ['...'],
    }
 
 
# Example usage
np.random.seed(42)
errors = 0.5 - (0.2 - 0.15 * np.arange(50) / 50)  # Declining edge
errors = np.clip(errors + 0.02 * np.random.randn(50), 0.01, 0.49)
 
diagnostics = monitor_adaboost_convergence(errors.tolist(), n_samples=1000)
 
print("AdaBoost Convergence Diagnostics:")
print(f"  Rounds: {diagnostics['num_rounds']}")
print(f"  Mean edge: {diagnostics['mean_edge']:.4f}")
print(f"  Min edge: {diagnostics['min_edge']:.4f}")
print(f"  Final bound: {diagnostics['final_bound']:.2e}")
print(f"  Rounds for bound < 1/n: {diagnostics['T_for_inv_n']:.0f}")
print(f"  Convergence slowdown: {diagnostics['convergence_slowdown']:.2f}")
 
if diagnostics['num_weak_rounds'] > 0:
    print(f"  WARNING: {diagnostics['num_weak_rounds']} rounds with γ < 0.01")

Summary and Key Takeaways

We've explored AdaBoost's training error bounds in comprehensive detail. Let's consolidate the essential knowledge:

Key Concepts to Remember

•Main Bound: TrainErr(H_T) ≤ Π Z_t ≤ exp(-2Σγ_t²) provides exponential decay guarantee.
•Normalization Constant: Z_t = 2√(ε_t(1-ε_t)) is the key quantity; Z_t < 1 ensures progress.
•Edge Dependence: Rounds scale as O(1/γ²)—small improvements in weak learner edge have large effects.
•Bound Looseness: The theoretical bound is often much looser than actual convergence, which is typically faster.
•Margin Connection: Beyond training error, AdaBoost maximizes margins, explaining continued generalization improvement.
•Practical Use: Monitor γ_t and Z_t for diagnostics, but use validation error for stopping decisions.
•Weak Learning Equivalence: This bound proves that any learner better than random can be boosted to arbitrary accuracy.

What's Next

The final page explores the exponential loss connection—how AdaBoost can be understood as gradient descent in function space, minimizing exponential loss. This perspective unifies AdaBoost with the broader framework of functional optimization.

Page Complete

You now have a complete understanding of AdaBoost's training error bounds: the theorem statement, rigorous proof, tightness analysis, sample complexity implications, and connection to margin theory. This theoretical foundation enables you to analyze AdaBoost's convergence and diagnose potential issues.