Adaboost - Learning Module

Loading content...

0/245

Weighted Voting

The Art of Combining Classifiers

Having trained T weak classifiers (h_1, h_2, \ldots, h_T), AdaBoost faces a critical question: How should their predictions be combined?

The simplest approach—unweighted majority voting—treats all classifiers equally. But this ignores crucial information: some classifiers are more accurate than others. A classifier with 5% error rate should clearly contribute more than one with 45% error rate.

AdaBoost's answer is weighted voting, where each classifier's vote is scaled by:

$$\alpha_t = \frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)$$

The final prediction is:

$$H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t \cdot h_t(x)\right)$$

This page explores why this specific weighting scheme is optimal, analyzes its mathematical properties, compares it to alternatives, and reveals its deep connection to probabilistic reasoning and information theory.

What You Will Learn

By completing this page, you will: (1) Master the mathematical justification for α-weighted voting; (2) Understand the log-odds interpretation of classifier weights; (3) Analyze how weighted voting amplifies ensemble accuracy; (4) Compare AdaBoost voting to alternatives (uniform, confidence-based); (5) Connect weighted voting to optimal Bayesian committee decisions.

The Weighted Voting Formula

Let's carefully analyze AdaBoost's voting mechanism.

The Final Classifier:

$$H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t \cdot h_t(x)\right) = \text{sign}(F(x))$$

where (F(x) = \sum_{t=1}^{T} \alpha_t \cdot h_t(x)) is the weighted vote or margin function.

Components:

(h_t(x) \in {-1, +1}): Individual classifier predictions
(\alpha_t \geq 0): Voting weight for classifier (t)
(F(x)): Aggregate score (positive → predict +1, negative → predict -1)

The Sign Function:

$$\text{sign}(z) = \begin{cases} +1 & \text{if } z > 0 \ -1 & \text{if } z < 0 \ 0 & \text{if } z = 0 \end{cases}$$

In practice, ties ((F(x) = 0)) are rare; they can be broken by predicting +1 or by examining individual classifier confidences.

Interpreting F(x):

The magnitude (|F(x)|) indicates prediction confidence:

Large (|F(x)|): Strong consensus among weighted classifiers
Small (|F(x)|): Close to decision boundary, uncertain prediction

This is why (F(x)) is sometimes called the "margin"—it measures distance from the decision boundary.

Weighted Voting Examples (T=3)
h₁(x)	h₂(x)	h₃(x)	α₁=1.0	α₂=0.5	α₃=0.8	F(x)	H(x)
+1	+1	+1	+1.0	+0.5	+0.8	+2.3	+1
+1	+1	-1	+1.0	+0.5	-0.8	+0.7	+1
+1	-1	-1	+1.0	-0.5	-0.8	-0.3	-1
-1	-1	-1	-1.0	-0.5	-0.8	-2.3	-1
-1	+1	+1	-1.0	+0.5	+0.8	+0.3	+1

Weighted vs. Unweighted Voting

Notice row 5: unweighted voting (2 vs 1) would predict +1, which matches weighted voting. But consider if α₁ = 2.0: then F(x) = -2.0 + 0.5 + 0.8 = -0.7, predicting -1. Weighting allows a single highly accurate classifier to override multiple weaker ones.

Log-Odds Interpretation

The weight formula (\alpha_t = \frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)) has a beautiful probabilistic interpretation through log-odds.

Odds and Log-Odds:

The odds of an event with probability (p) are: $$\text{odds}(p) = \frac{p}{1-p}$$

The log-odds (or logit) are: $$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)$$

Rewriting α_t:

For a classifier with error rate (\varepsilon_t), the accuracy is (1 - \varepsilon_t). The log-odds of being correct are:

$$\text{logit}(1-\varepsilon_t) = \ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right) = 2\alpha_t$$

Therefore: $$\alpha_t = \frac{1}{2} \cdot \text{logit}(\text{accuracy}_t)$$

Interpretation:

The voting weight (\alpha_t) is half the log-odds of being correct. This means:

Perfect classifier ((\varepsilon \to 0)): (\alpha \to +\infty) (infinite weight)
90% accuracy: (\alpha = \frac{1}{2}\ln(9) \approx 1.10)
75% accuracy: (\alpha = \frac{1}{2}\ln(3) \approx 0.55)
60% accuracy: (\alpha = \frac{1}{2}\ln(1.5) \approx 0.20)
51% accuracy: (\alpha = \frac{1}{2}\ln(1.04) \approx 0.02)
Random (50%): (\alpha = 0) (zero weight—excluded from voting)

Why Half the Log-Odds?

The factor of (\frac{1}{2}) arises from the bipolar encoding ({-1, +1}). If we used ({0, 1}) encoding, the formula would be:

$$\alpha_t = \ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)$$

which is exactly the log-odds. The half factor is an artifact of how predictions combine in (F(x)).

Connection to Naive Bayes:

Log-odds voting is equivalent to Naive Bayes classification under the assumption that classifiers are independent. Each (\alpha_t) contributes to a log-likelihood ratio:

$$\ln\frac{P(y=+1 | h_1, \ldots, h_T)}{P(y=-1 | h_1, \ldots, h_T)} = \sum_{t=1}^{T} \ln\frac{P(h_t | y=+1)}{P(h_t | y=-1)}$$

Under the independence assumption and with known error rates, this yields weighted voting with log-odds weights. AdaBoost's voting is Bayes-optimal for independent classifiers!

The Independence Caveat:

In reality, boosted classifiers are not independent—they're constructed sequentially with dependencies through weight updates. Despite this, empirical studies show AdaBoost's voting works remarkably well, suggesting the independence assumption is a reasonable approximation or that the log-odds weighting is robust to violations.

Information-Theoretic View

Log-odds can also be interpreted as information gain. A classifier with α = 1.10 provides about 1.1 bits of evidence per vote, while one with α = 0.02 provides only 0.02 bits. The weighted sum F(x) aggregates evidence for each class.

Why This Weighting Is Optimal

We've seen that (\alpha_t) arises from exponential loss minimization during training. But is this weighting also optimal for combining classifiers? The answer is yes, from multiple perspectives.

Perspective 1: Exponential Loss Minimization

Recall that AdaBoost minimizes the exponential loss: $$L(F) = \sum_{i=1}^{n} \exp(-y_i \cdot F(x_i))$$

For additive models (F(x) = \sum_t \alpha_t h_t(x)), the optimal (\alpha_t) at each stage is exactly our formula. Thus, the voting weights are stage-wise optimal for exponential loss.

Perspective 2: Margin Maximization

The margin for example (i) is: $$\rho_i = y_i \cdot F(x_i) = y_i \sum_{t=1}^{T} \alpha_t h_t(x_i)$$

AdaBoost's weighting tends to maximize margins, especially for hard examples. The (\alpha_t) weights ensure that classifiers contribute to margins proportionally to their informativeness.

Perspective 3: Consistent with Bayes-Optimal Combination

Under the (approximate) assumption of classifier independence, log-odds weighting is Bayes-optimal. Even when this assumption is violated, log-odds provides a principled, well-calibrated combination rule.

Perspective 4: Training Error Bound

The product (\prod_t Z_t) bounds training error, where: $$Z_t = 2\sqrt{\varepsilon_t(1-\varepsilon_t)}$$

This bound is achieved when using the specific (\alpha_t) formula. Other weightings would loosen the bound, potentially slowing convergence.

Optimality Properties Summary

•Exponential Loss: α_t minimizes stage-wise exponential loss
•Training Error Bound: Tightest bound achieved with this specific α_t
•Margin Maximization: Weights contribute to margin proportionally to classifier quality
•Bayes-Optimal (approx.): Equivalent to optimal combination under independence
•Unique Solution: α_t is the unique minimizer—no other choice is as good

A Beautiful Consistency

The same α_t formula that optimizes exponential loss during training also turns out to be the optimal voting weight for combining classifiers. This isn't coincidence—it reflects the deep mathematical coherence of AdaBoost's design.

Comparison with Alternative Schemes

How does AdaBoost's weighted voting compare to simpler alternatives? Let's analyze several schemes.

Scheme 1: Uniform Voting (Simple Majority)

$$H_{\text{uniform}}(x) = \text{sign}\left(\sum_{t=1}^{T} h_t(x)\right)$$

Every classifier gets equal weight = 1.

Problems:

Ignores classifier quality differences
A 90% accurate classifier counts the same as a 51% accurate one
Vulnerable to many weak classifiers outvoting a few strong ones

Scheme 2: Accuracy-Proportional Weighting

$$\alpha_t = 1 - \varepsilon_t$$

Weight equals accuracy (e.g., 90% accurate → weight 0.9).

Problems:

Doesn't capture nonlinear relationship between error and reliability
90% vs 91% accuracy is treated similarly to 50% vs 51%
Not derived from any optimization principle

Scheme 3: Odds Weighting (Without Log)

$$\alpha_t = \frac{1 - \varepsilon_t}{\varepsilon_t}$$

Weight equals odds of being correct.

Problems:

Not optimal for exponential loss
Gives very high weights to low-error classifiers (e.g., 1% error → weight 99)
Sensitive to estimation errors in ε_t

Weighting Scheme Comparison
Error (ε)	Uniform	Accuracy	Odds	Log-Odds (AdaBoost)
0.01	1.0	0.99	99.0	2.30
0.10	1.0	0.90	9.0	1.10
0.20	1.0	0.80	4.0	0.69
0.30	1.0	0.70	2.33	0.42
0.40	1.0	0.60	1.50	0.20
0.49	1.0	0.51	1.04	0.02

Analysis:

Log-odds (AdaBoost) weighting has several advantages:

Bounded sensitivity: Unlike odds weighting, log-odds grows moderately as error decreases. This prevents a single near-perfect classifier from completely dominating.
Symmetric behavior: The log-odds of 90% accuracy (1.10) is the negative of log-odds of 10% accuracy (-1.10). This symmetry is mathematically natural.
Additive aggregation: Log-odds sum directly, which is compatible with log-likelihood aggregation in probabilistic models.
Robustness: Small estimation errors in ε_t cause proportionally small changes in α_t, unlike odds weighting.

When Alternatives Might Be Preferred:

Uniform voting: When classifiers are equally reliable or when training error estimates are unreliable
Post-hoc calibration: When validation data is available to tune weights empirically
Learned combination: When a meta-learner (stacking) is used to learn combination weights

Practical Advice

Stick with AdaBoost's default log-odds weighting unless you have strong reasons otherwise. It's theoretically justified, empirically validated, and robust. The main cases for deviation are: (1) when weak learners are extremely unreliable, making uniform voting more stable, or (2) when you have validation data for explicit weight tuning.

Voting Power Analysis

Understanding how voting power distributes across boosting rounds provides insight into AdaBoost's behavior.

Cumulative Voting Power:

The total voting power up to round (T) is: $$A_T = \sum_{t=1}^{T} \alpha_t$$

Individual classifier (t) contributes fraction: $$\frac{\alpha_t}{A_T}$$

Typical Distribution:

In most AdaBoost runs, early classifiers have higher (\alpha_t) values because they see uniform weights and can achieve lower errors. As training progresses:

Early rounds (1-10): Often (\varepsilon_t \approx 0.1-0.2), giving (\alpha_t \approx 0.7-1.1)
Middle rounds (10-50): Typically (\varepsilon_t \approx 0.3-0.4), giving (\alpha_t \approx 0.2-0.4)
Late rounds (50+): Often (\varepsilon_t \approx 0.4-0.49), giving (\alpha_t \approx 0.02-0.2)

Power Law Distribution:

Voting power often follows a power-law-like distribution:

Top 10% of classifiers may hold 30-50% of total voting power
Bottom 50% may hold only 10-20%

This is not a problem—it reflects that early classifiers capture "easy" patterns with high accuracy, while later classifiers address increasingly subtle distinctions.

analyze_voting_power.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_voting_power(alphas: list) -> dict:
    """
    Analyze distribution of voting power across classifiers.
    
    Args:
        alphas: List of classifier weights from AdaBoost
        
    Returns:
        Dictionary with analysis results
    """
    alphas = np.array(alphas)
    T = len(alphas)
    total_power = alphas.sum()
    
    # Fractional power per classifier
    fractions = alphas / total_power
    
    # Cumulative power
    cumulative = np.cumsum(alphas) / total_power
    
    # Power concentration metrics
    sorted_fractions = np.sort(fractions)[::-1]
    sorted_cumulative = np.cumsum(sorted_fractions)
    
    # How many classifiers hold 50% of power?
    classifiers_for_50 = np.searchsorted(sorted_cumulative, 0.5) + 1
    
    # How many for 80%?
    classifiers_for_80 = np.searchsorted(sorted_cumulative, 0.8) + 1
    
    # Gini coefficient
    n = len(fractions)
    sorted_fracs = np.sort(fractions)
    cumsum = np.cumsum(sorted_fracs)
    gini = 1 - 2 * cumsum.sum() / (n * total_power / alphas.sum())
    
    return {
        'total_power': total_power,
        'mean_alpha': alphas.mean(),
        'std_alpha': alphas.std(),
        'max_alpha': alphas.max(),
        'min_alpha': alphas.min(),
        'classifiers_for_50pct': classifiers_for_50,
        'classifiers_for_80pct': classifiers_for_80,
        'top10_share': sorted_cumulative[int(0.1 * T) - 1] if T >= 10 else 1.0,
        'gini_coefficient': gini,
    }
 
 
def visualize_voting_power(alphas: list):
    """Create visualization of voting power distribution."""
    alphas = np.array(alphas)
    T = len(alphas)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # 1. Alpha values over rounds
    ax1 = axes[0, 0]
    ax1.bar(range(1, T + 1), alphas, color='steelblue', alpha=0.7)
    ax1.set_xlabel('Boosting Round')
    ax1.set_ylabel('α (Voting Weight)')
    ax1.set_title('Classifier Weights Across Rounds')
    ax1.axhline(y=alphas.mean(), color='red', linestyle='--', 
                label=f'Mean: {alphas.mean():.2f}')
    ax1.legend()
    
    # 2. Cumulative voting power
    ax2 = axes[0, 1]
    cumulative = np.cumsum(alphas) / alphas.sum() * 100
    ax2.plot(range(1, T + 1), cumulative, 'b-', linewidth=2)
    ax2.axhline(y=50, color='red', linestyle='--', alpha=0.5)
    ax2.axhline(y=80, color='orange', linestyle='--', alpha=0.5)
    ax2.set_xlabel('Number of Classifiers')
    ax2.set_ylabel('Cumulative Voting Power (%)')
    ax2.set_title('Cumulative Voting Power')
    ax2.fill_between(range(1, T + 1), 0, cumulative, alpha=0.3)
    
    # 3. Lorenz curve
    ax3 = axes[1, 0]
    sorted_alphas = np.sort(alphas)
    lorenz = np.cumsum(sorted_alphas) / alphas.sum()
    ax3.plot([0] + list(np.arange(1, T + 1) / T), [0] + list(lorenz), 
             'b-', linewidth=2, label='Voting power distribution')
    ax3.plot([0, 1], [0, 1], 'r--', label='Perfect equality')
    ax3.set_xlabel('Cumulative Share of Classifiers')
    ax3.set_ylabel('Cumulative Share of Voting Power')
    ax3.set_title('Lorenz Curve of Voting Power')
    ax3.legend()
    ax3.set_xlim(0, 1)
    ax3.set_ylim(0, 1)
    
    # 4. Error vs Weight relationship
    ax4 = axes[1, 1]
    # Reconstruct error rates from alpha values
    # alpha = 0.5 * ln((1-eps)/eps) => eps = 1 / (1 + exp(2*alpha))
    errors = 1 / (1 + np.exp(2 * alphas))
    ax4.scatter(errors * 100, alphas, c=range(T), cmap='viridis', alpha=0.7)
    ax4.set_xlabel('Weighted Error Rate (%)')
    ax4.set_ylabel('α (Voting Weight)')
    ax4.set_title('Error Rate vs Voting Weight')
    
    # Add theoretical curve
    eps_range = np.linspace(0.01, 0.49, 100)
    alpha_range = 0.5 * np.log((1 - eps_range) / eps_range)
    ax4.plot(eps_range * 100, alpha_range, 'r-', linewidth=2, label='Theoretical')
    ax4.legend()
    
    plt.tight_layout()
    plt.savefig('voting_power_analysis.png', dpi=150)
    plt.close()
 
 
# Simulate typical AdaBoost behavior
np.random.seed(42)
T = 50
 
# Error rates typically increase as training progresses
base_errors = 0.15 + 0.25 * np.arange(T) / T  # 15% to 40%
errors = base_errors + 0.05 * np.random.randn(T)
errors = np.clip(errors, 0.01, 0.49)
 
# Compute alphas
alphas = 0.5 * np.log((1 - errors) / errors)
 
# Analyze
results = analyze_voting_power(alphas)
print("Voting Power Analysis:")
print(f"  Total voting power: {results['total_power']:.2f}")
print(f"  Mean α: {results['mean_alpha']:.3f}")
print(f"  Std α: {results['std_alpha']:.3f}")
print(f"  Top 10% classifiers hold: {results['top10_share']:.1%} of power")
print(f"  Classifiers needed for 50% power: {results['classifiers_for_50pct']}")
print(f"  Classifiers needed for 80% power: {results['classifiers_for_80pct']}")
 
visualize_voting_power(alphas)

Interpreting Voting Power Distribution

High concentration (few classifiers holding most power) is normal and often healthy—it means early classifiers captured the main signal. Very uniform distribution might indicate the weak learner is too weak or the problem is very hard, requiring many classifiers to make progress.

Confidence and Probability Estimation

The weighted vote (F(x)) provides more than just a class prediction—it can be transformed into probability estimates.

The Raw Score:

$$F(x) = \sum_{t=1}^{T} \alpha_t \cdot h_t(x)$$

(F(x) > 0): Predict class +1
(F(x) < 0): Predict class -1
(|F(x)|): Confidence of prediction

Converting to Probabilities:

AdaBoost's exponential loss suggests a natural probability transformation:

$$P(y = +1 | x) = \frac{1}{1 + e^{-2F(x)}}$$

This is the logistic (sigmoid) function applied to (2F(x)). The factor of 2 arises from the bipolar encoding.

Derivation from Odds:

If we interpret (F(x)) as log-odds of correct classification:

$$\text{log-odds}(y=+1) = 2F(x)$$

Then: $$\frac{P(y=+1)}{P(y=-1)} = e^{2F(x)}$$

Since (P(y=+1) + P(y=-1) = 1): $$P(y=+1) = \frac{e^{2F(x)}}{1 + e^{2F(x)}} = \frac{1}{1 + e^{-2F(x)}}$$

F(x) to Probability Mapping
F(x)	2F(x)	P(y=+1)	Interpretation
-3.0	-6.0	0.25%	Very confident -1
-2.0	-4.0	1.8%	Confident -1
-1.0	-2.0	11.9%	Moderate -1
-0.5	-1.0	26.9%	Slight -1
0.0	0.0	50.0%	Uncertain
+0.5	+1.0	73.1%	Slight +1
+1.0	+2.0	88.1%	Moderate +1
+2.0	+4.0	98.2%	Confident +1
+3.0	+6.0	99.75%	Very confident +1

Calibration Issues:

While this transformation provides probability estimates, they may not be well-calibrated. Well-calibrated means: when the model predicts 70% probability, roughly 70% of such predictions should be correct.

Common Issues:

Overconfidence: AdaBoost often produces probabilities too close to 0 or 1
Insufficient coverage: Not enough probability mass near 50% for hard cases

Calibration Methods:

Platt Scaling: Fit logistic regression (P = \sigma(aF(x) + b)) on validation data
Isotonic Regression: Fit non-parametric monotonic function
Temperature Scaling: (P = \sigma(F(x)/T)) where (T > 1) softens probabilities

Practical Recommendation:

For tasks requiring calibrated probabilities (cost-sensitive classification, ranking), apply Platt scaling:

Split training data into train/calibration sets
Train AdaBoost on training set
Fit logistic regression on calibration set using (F(x)) as feature
Use calibrated probabilities for final predictions

calibration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.optimize import minimize
from sklearn.isotonic import IsotonicRegression
 
def uncalibrated_probability(F: np.ndarray) -> np.ndarray:
    """Convert AdaBoost scores to uncalibrated probabilities."""
    return 1 / (1 + np.exp(-2 * F))
 
 
def platt_scaling(F_train: np.ndarray, y_train: np.ndarray,
                  F_test: np.ndarray) -> np.ndarray:
    """
    Apply Platt scaling for probability calibration.
    
    Args:
        F_train: AdaBoost scores on calibration set
        y_train: True labels (0 or 1) on calibration set
        F_test: AdaBoost scores to calibrate
        
    Returns:
        Calibrated probabilities for test set
    """
    # Convert labels to 0/1 if needed
    y_train = (y_train + 1) / 2 if y_train.min() < 0 else y_train
    
    def neg_log_likelihood(params):
        a, b = params
        p = 1 / (1 + np.exp(-(a * F_train + b)))
        p = np.clip(p, 1e-10, 1 - 1e-10)
        return -np.mean(y_train * np.log(p) + (1 - y_train) * np.log(1 - p))
    
    result = minimize(neg_log_likelihood, [1.0, 0.0], method='L-BFGS-B')
    a_opt, b_opt = result.x
    
    return 1 / (1 + np.exp(-(a_opt * F_test + b_opt)))
 
 
def isotonic_calibration(F_train: np.ndarray, y_train: np.ndarray,
                         F_test: np.ndarray) -> np.ndarray:
    """
    Apply isotonic regression for probability calibration.
    """
    y_train = (y_train + 1) / 2 if y_train.min() < 0 else y_train
    
    iso_reg = IsotonicRegression(out_of_bounds='clip')
    iso_reg.fit(F_train, y_train)
    
    return iso_reg.predict(F_test)
 
 
def evaluate_calibration(y_true: np.ndarray, y_prob: np.ndarray, 
                         n_bins: int = 10) -> dict:
    """
    Compute calibration metrics.
    """
    y_true = (y_true + 1) / 2 if y_true.min() < 0 else y_true
    
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(y_prob, bin_edges[1:-1])
    
    ece = 0  # Expected Calibration Error
    mce = 0  # Maximum Calibration Error
    
    for bin_idx in range(n_bins):
        mask = bin_indices == bin_idx
        if mask.sum() == 0:
            continue
            
        bin_accuracy = y_true[mask].mean()
        bin_confidence = y_prob[mask].mean()
        bin_size = mask.sum() / len(y_true)
        
        gap = abs(bin_accuracy - bin_confidence)
        ece += bin_size * gap
        mce = max(mce, gap)
    
    # Brier score
    brier = np.mean((y_prob - y_true) ** 2)
    
    return {
        'expected_calibration_error': ece,
        'maximum_calibration_error': mce,
        'brier_score': brier,
    }

Multi-Class Extension

Binary weighted voting extends naturally to multi-class problems through several strategies.

Strategy 1: One-vs-All (OvA)

Train K binary AdaBoost classifiers, one for each class:

Classifier (k): Class (k) vs. all others

Prediction: $$H(x) = \arg\max_{k} F_k(x)$$

where (F_k(x) = \sum_t \alpha_t^{(k)} h_t^{(k)}(x)) is the score for class (k).

Strategy 2: One-vs-One (OvO)

Train (\binom{K}{2}) binary classifiers, one for each class pair.

Prediction by voting: $$H(x) = \arg\max_{k} \sum_{j \neq k} \mathbb{1}[F_{kj}(x) > 0]$$

Strategy 3: AdaBoost.MH (Multiclass-Hamming)

Extends AdaBoost directly to multiclass with label encoding:

Use (K) pseudo-labels per example
Weak learner returns (K) predictions
Combined via sum of weighted votes per class

Strategy 4: SAMME (Stagewise Additive Modeling using Multiclass Exponential Loss)

The multiclass extension of AdaBoost with modified:

Weight update: $$\alpha_t = \ln\left(\frac{1 - \varepsilon_t}{\varepsilon_t}\right) + \ln(K-1)$$

where (K) is the number of classes.

Prediction: $$H(x) = \arg\max_{k} \sum_{t=1}^{T} \alpha_t \cdot \mathbb{1}[h_t(x) = k]$$

Multi-Class Voting Strategies Comparison
Strategy	Classifiers	Pros	Cons
One-vs-All	K	Simple, parallel training	Class imbalance issues
One-vs-One	K(K-1)/2	Balanced training	Many classifiers for large K
AdaBoost.MH	1 (multioutput)	Direct optimization	Weak learner must be multioutput
SAMME	1	Principled extension	Requires multiclass weak learner

Practical Recommendation

For most applications, SAMME (used by default in scikit-learn's AdaBoostClassifier with algorithm='SAMME') or One-vs-All is recommended. One-vs-All is simpler and parallelizable; SAMME is theoretically principled. For very large K (hundreds of classes), consider hierarchical approaches.

Summary and Key Takeaways

We've explored AdaBoost's weighted voting mechanism in comprehensive detail. Let's consolidate the essential knowledge:

Key Concepts to Remember

•Weighted Voting Formula: H(x) = sign(Σ α_t · h_t(x)) combines classifiers through weighted majority vote.
•Log-Odds Interpretation: α_t = ½ log((1-ε)/ε) is half the log-odds of correct classification—a principled probabilistic weighting.
•Optimality: The α_t weights are optimal for exponential loss, margin maximization, and (approximately) Bayesian combination.
•Voting Power Distribution: Early classifiers typically dominate voting power; this is normal and reflects diminishing returns.
•Probability Estimation: F(x) can be transformed to probabilities via sigmoid, but calibration may be needed.
•Multi-Class Extensions: SAMME, One-vs-All, and One-vs-One extend weighted voting to multiple classes.
•Alternative Schemes: Uniform and accuracy-proportional voting exist but lack AdaBoost's theoretical guarantees.

What's Next

The next page explores training error bounds—the remarkable theoretical guarantee that AdaBoost's training error decreases exponentially with the number of boosting rounds. We'll derive the bound and understand its practical implications.

Page Complete

You now understand AdaBoost's weighted voting mechanism in depth: its mathematical formulation, log-odds interpretation, optimality properties, and connections to probability estimation. This knowledge enables you to interpret AdaBoost predictions, tune ensemble behavior, and extend to multi-class settings.

Weighted Voting

The Art of Combining Classifiers

Having trained T weak classifiers (h_1, h_2, \ldots, h_T), AdaBoost faces a critical question: How should their predictions be combined?

AdaBoost's answer is weighted voting, where each classifier's vote is scaled by:

$$\alpha_t = \frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)$$

The final prediction is:

$$H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t \cdot h_t(x)\right)$$

What You Will Learn

The Weighted Voting Formula

Let's carefully analyze AdaBoost's voting mechanism.

The Final Classifier:

$$H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t \cdot h_t(x)\right) = \text{sign}(F(x))$$

where (F(x) = \sum_{t=1}^{T} \alpha_t \cdot h_t(x)) is the weighted vote or margin function.

Components:

(h_t(x) \in {-1, +1}): Individual classifier predictions
(\alpha_t \geq 0): Voting weight for classifier (t)
(F(x)): Aggregate score (positive → predict +1, negative → predict -1)

The Sign Function:

$$\text{sign}(z) = \begin{cases} +1 & \text{if } z > 0 \ -1 & \text{if } z < 0 \ 0 & \text{if } z = 0 \end{cases}$$

In practice, ties ((F(x) = 0)) are rare; they can be broken by predicting +1 or by examining individual classifier confidences.

Interpreting F(x):

The magnitude (|F(x)|) indicates prediction confidence:

Large (|F(x)|): Strong consensus among weighted classifiers
Small (|F(x)|): Close to decision boundary, uncertain prediction

This is why (F(x)) is sometimes called the "margin"—it measures distance from the decision boundary.

Weighted Voting Examples (T=3)
h₁(x)	h₂(x)	h₃(x)	α₁=1.0	α₂=0.5	α₃=0.8	F(x)	H(x)
+1	+1	+1	+1.0	+0.5	+0.8	+2.3	+1
+1	+1	-1	+1.0	+0.5	-0.8	+0.7	+1
+1	-1	-1	+1.0	-0.5	-0.8	-0.3	-1
-1	-1	-1	-1.0	-0.5	-0.8	-2.3	-1
-1	+1	+1	-1.0	+0.5	+0.8	+0.3	+1

Weighted vs. Unweighted Voting

Log-Odds Interpretation

The weight formula (\alpha_t = \frac{1}{2}\ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)) has a beautiful probabilistic interpretation through log-odds.

Odds and Log-Odds:

The odds of an event with probability (p) are: $$\text{odds}(p) = \frac{p}{1-p}$$

The log-odds (or logit) are: $$\text{logit}(p) = \ln\left(\frac{p}{1-p}\right)$$

Rewriting α_t:

For a classifier with error rate (\varepsilon_t), the accuracy is (1 - \varepsilon_t). The log-odds of being correct are:

$$\text{logit}(1-\varepsilon_t) = \ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right) = 2\alpha_t$$

Therefore: $$\alpha_t = \frac{1}{2} \cdot \text{logit}(\text{accuracy}_t)$$

Interpretation:

The voting weight (\alpha_t) is half the log-odds of being correct. This means:

Perfect classifier ((\varepsilon \to 0)): (\alpha \to +\infty) (infinite weight)
90% accuracy: (\alpha = \frac{1}{2}\ln(9) \approx 1.10)
75% accuracy: (\alpha = \frac{1}{2}\ln(3) \approx 0.55)
60% accuracy: (\alpha = \frac{1}{2}\ln(1.5) \approx 0.20)
51% accuracy: (\alpha = \frac{1}{2}\ln(1.04) \approx 0.02)
Random (50%): (\alpha = 0) (zero weight—excluded from voting)

Why Half the Log-Odds?

The factor of (\frac{1}{2}) arises from the bipolar encoding ({-1, +1}). If we used ({0, 1}) encoding, the formula would be:

$$\alpha_t = \ln\left(\frac{1-\varepsilon_t}{\varepsilon_t}\right)$$

which is exactly the log-odds. The half factor is an artifact of how predictions combine in (F(x)).

Connection to Naive Bayes:

Log-odds voting is equivalent to Naive Bayes classification under the assumption that classifiers are independent. Each (\alpha_t) contributes to a log-likelihood ratio:

$$\ln\frac{P(y=+1 | h_1, \ldots, h_T)}{P(y=-1 | h_1, \ldots, h_T)} = \sum_{t=1}^{T} \ln\frac{P(h_t | y=+1)}{P(h_t | y=-1)}$$

Under the independence assumption and with known error rates, this yields weighted voting with log-odds weights. AdaBoost's voting is Bayes-optimal for independent classifiers!

The Independence Caveat:

Information-Theoretic View

Why This Weighting Is Optimal

We've seen that (\alpha_t) arises from exponential loss minimization during training. But is this weighting also optimal for combining classifiers? The answer is yes, from multiple perspectives.

Perspective 1: Exponential Loss Minimization

Recall that AdaBoost minimizes the exponential loss: $$L(F) = \sum_{i=1}^{n} \exp(-y_i \cdot F(x_i))$$

For additive models (F(x) = \sum_t \alpha_t h_t(x)), the optimal (\alpha_t) at each stage is exactly our formula. Thus, the voting weights are stage-wise optimal for exponential loss.

Perspective 2: Margin Maximization

The margin for example (i) is: $$\rho_i = y_i \cdot F(x_i) = y_i \sum_{t=1}^{T} \alpha_t h_t(x_i)$$

AdaBoost's weighting tends to maximize margins, especially for hard examples. The (\alpha_t) weights ensure that classifiers contribute to margins proportionally to their informativeness.

Perspective 3: Consistent with Bayes-Optimal Combination

Perspective 4: Training Error Bound

The product (\prod_t Z_t) bounds training error, where: $$Z_t = 2\sqrt{\varepsilon_t(1-\varepsilon_t)}$$

This bound is achieved when using the specific (\alpha_t) formula. Other weightings would loosen the bound, potentially slowing convergence.

Optimality Properties Summary

•Exponential Loss: α_t minimizes stage-wise exponential loss
•Training Error Bound: Tightest bound achieved with this specific α_t
•Margin Maximization: Weights contribute to margin proportionally to classifier quality
•Bayes-Optimal (approx.): Equivalent to optimal combination under independence
•Unique Solution: α_t is the unique minimizer—no other choice is as good

A Beautiful Consistency

Comparison with Alternative Schemes

How does AdaBoost's weighted voting compare to simpler alternatives? Let's analyze several schemes.

Scheme 1: Uniform Voting (Simple Majority)

$$H_{\text{uniform}}(x) = \text{sign}\left(\sum_{t=1}^{T} h_t(x)\right)$$

Every classifier gets equal weight = 1.

Problems:

Ignores classifier quality differences
A 90% accurate classifier counts the same as a 51% accurate one
Vulnerable to many weak classifiers outvoting a few strong ones

Scheme 2: Accuracy-Proportional Weighting

$$\alpha_t = 1 - \varepsilon_t$$

Weight equals accuracy (e.g., 90% accurate → weight 0.9).

Problems:

Doesn't capture nonlinear relationship between error and reliability
90% vs 91% accuracy is treated similarly to 50% vs 51%
Not derived from any optimization principle

Scheme 3: Odds Weighting (Without Log)

$$\alpha_t = \frac{1 - \varepsilon_t}{\varepsilon_t}$$

Weight equals odds of being correct.

Problems:

Not optimal for exponential loss
Gives very high weights to low-error classifiers (e.g., 1% error → weight 99)
Sensitive to estimation errors in ε_t

Weighting Scheme Comparison
Error (ε)	Uniform	Accuracy	Odds	Log-Odds (AdaBoost)
0.01	1.0	0.99	99.0	2.30
0.10	1.0	0.90	9.0	1.10
0.20	1.0	0.80	4.0	0.69
0.30	1.0	0.70	2.33	0.42
0.40	1.0	0.60	1.50	0.20
0.49	1.0	0.51	1.04	0.02

Analysis:

Log-odds (AdaBoost) weighting has several advantages:

Bounded sensitivity: Unlike odds weighting, log-odds grows moderately as error decreases. This prevents a single near-perfect classifier from completely dominating.
Symmetric behavior: The log-odds of 90% accuracy (1.10) is the negative of log-odds of 10% accuracy (-1.10). This symmetry is mathematically natural.
Additive aggregation: Log-odds sum directly, which is compatible with log-likelihood aggregation in probabilistic models.
Robustness: Small estimation errors in ε_t cause proportionally small changes in α_t, unlike odds weighting.

When Alternatives Might Be Preferred:

Uniform voting: When classifiers are equally reliable or when training error estimates are unreliable
Post-hoc calibration: When validation data is available to tune weights empirically
Learned combination: When a meta-learner (stacking) is used to learn combination weights

Practical Advice

Voting Power Analysis

Understanding how voting power distributes across boosting rounds provides insight into AdaBoost's behavior.

Cumulative Voting Power:

The total voting power up to round (T) is: $$A_T = \sum_{t=1}^{T} \alpha_t$$

Individual classifier (t) contributes fraction: $$\frac{\alpha_t}{A_T}$$

Typical Distribution:

In most AdaBoost runs, early classifiers have higher (\alpha_t) values because they see uniform weights and can achieve lower errors. As training progresses:

Early rounds (1-10): Often (\varepsilon_t \approx 0.1-0.2), giving (\alpha_t \approx 0.7-1.1)
Middle rounds (10-50): Typically (\varepsilon_t \approx 0.3-0.4), giving (\alpha_t \approx 0.2-0.4)
Late rounds (50+): Often (\varepsilon_t \approx 0.4-0.49), giving (\alpha_t \approx 0.02-0.2)

Power Law Distribution:

Voting power often follows a power-law-like distribution:

Top 10% of classifiers may hold 30-50% of total voting power
Bottom 50% may hold only 10-20%

This is not a problem—it reflects that early classifiers capture "easy" patterns with high accuracy, while later classifiers address increasingly subtle distinctions.

analyze_voting_power.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_voting_power(alphas: list) -> dict:
    """
    Analyze distribution of voting power across classifiers.
    
    Args:
        alphas: List of classifier weights from AdaBoost
        
    Returns:
        Dictionary with analysis results
    """
    alphas = np.array(alphas)
    T = len(alphas)
    total_power = alphas.sum()
    
    # Fractional power per classifier
    fractions = alphas / total_power
    
    # Cumulative power
    cumulative = np.cumsum(alphas) / total_power
    
    # Power concentration metrics
    sorted_fractions = np.sort(fractions)[::-1]
    sorted_cumulative = np.cumsum(sorted_fractions)
    
    # How many classifiers hold 50% of power?
    classifiers_for_50 = np.searchsorted(sorted_cumulative, 0.5) + 1
    
    # How many for 80%?
    classifiers_for_80 = np.searchsorted(sorted_cumulative, 0.8) + 1
    
    # Gini coefficient
    n = len(fractions)
    sorted_fracs = np.sort(fractions)
    cumsum = np.cumsum(sorted_fracs)
    gini = 1 - 2 * cumsum.sum() / (n * total_power / alphas.sum())
    
    return {
        'total_power': total_power,
        'mean_alpha': alphas.mean(),
        'std_alpha': alphas.std(),
        'max_alpha': alphas.max(),
        'min_alpha': alphas.min(),
        'classifiers_for_50pct': classifiers_for_50,
        'classifiers_for_80pct': classifiers_for_80,
        'top10_share': sorted_cumulative[int(0.1 * T) - 1] if T >= 10 else 1.0,
        'gini_coefficient': gini,
    }
 
 
def visualize_voting_power(alphas: list):
    """Create visualization of voting power distribution."""
    alphas = np.array(alphas)
    T = len(alphas)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # 1. Alpha values over rounds
    ax1 = axes[0, 0]
    ax1.bar(range(1, T + 1), alphas, color='steelblue', alpha=0.7)
    ax1.set_xlabel('Boosting Round')
    ax1.set_ylabel('α (Voting Weight)')
    ax1.set_title('Classifier Weights Across Rounds')
    ax1.axhline(y=alphas.mean(), color='red', linestyle='--', 
                label=f'Mean: {alphas.mean():.2f}')
    ax1.legend()
    
    # 2. Cumulative voting power
    ax2 = axes[0, 1]
    cumulative = np.cumsum(alphas) / alphas.sum() * 100
    ax2.plot(range(1, T + 1), cumulative, 'b-', linewidth=2)
    ax2.axhline(y=50, color='red', linestyle='--', alpha=0.5)
    ax2.axhline(y=80, color='orange', linestyle='--', alpha=0.5)
    ax2.set_xlabel('Number of Classifiers')
    ax2.set_ylabel('Cumulative Voting Power (%)')
    ax2.set_title('Cumulative Voting Power')
    ax2.fill_between(range(1, T + 1), 0, cumulative, alpha=0.3)
    
    # 3. Lorenz curve
    ax3 = axes[1, 0]
    sorted_alphas = np.sort(alphas)
    lorenz = np.cumsum(sorted_alphas) / alphas.sum()
    ax3.plot([0] + list(np.arange(1, T + 1) / T), [0] + list(lorenz), 
             'b-', linewidth=2, label='Voting power distribution')
    ax3.plot([0, 1], [0, 1], 'r--', label='Perfect equality')
    ax3.set_xlabel('Cumulative Share of Classifiers')
    ax3.set_ylabel('Cumulative Share of Voting Power')
    ax3.set_title('Lorenz Curve of Voting Power')
    ax3.legend()
    ax3.set_xlim(0, 1)
    ax3.set_ylim(0, 1)
    
    # 4. Error vs Weight relationship
    ax4 = axes[1, 1]
    # Reconstruct error rates from alpha values
    # alpha = 0.5 * ln((1-eps)/eps) => eps = 1 / (1 + exp(2*alpha))
    errors = 1 / (1 + np.exp(2 * alphas))
    ax4.scatter(errors * 100, alphas, c=range(T), cmap='viridis', alpha=0.7)
    ax4.set_xlabel('Weighted Error Rate (%)')
    ax4.set_ylabel('α (Voting Weight)')
    ax4.set_title('Error Rate vs Voting Weight')
    
    # Add theoretical curve
    eps_range = np.linspace(0.01, 0.49, 100)
    alpha_range = 0.5 * np.log((1 - eps_range) / eps_range)
    ax4.plot(eps_range * 100, alpha_range, 'r-', linewidth=2, label='Theoretical')
    ax4.legend()
    
    plt.tight_layout()
    plt.savefig('voting_power_analysis.png', dpi=150)
    plt.close()
 
 
# Simulate typical AdaBoost behavior
np.random.seed(42)
T = 50
 
# Error rates typically increase as training progresses
base_errors = 0.15 + 0.25 * np.arange(T) / T  # 15% to 40%
errors = base_errors + 0.05 * np.random.randn(T)
errors = np.clip(errors, 0.01, 0.49)
 
# Compute alphas
alphas = 0.5 * np.log((1 - errors) / errors)
 
# Analyze
results = analyze_voting_power(alphas)
print("Voting Power Analysis:")
print(f"  Total voting power: {results['total_power']:.2f}")
print(f"  Mean α: {results['mean_alpha']:.3f}")
print(f"  Std α: {results['std_alpha']:.3f}")
print(f"  Top 10% classifiers hold: {results['top10_share']:.1%} of power")
print(f"  Classifiers needed for 50% power: {results['classifiers_for_50pct']}")
print(f"  Classifiers needed for 80% power: {results['classifiers_for_80pct']}")
 
visualize_voting_power(alphas)

Interpreting Voting Power Distribution

Confidence and Probability Estimation

The weighted vote (F(x)) provides more than just a class prediction—it can be transformed into probability estimates.

The Raw Score:

$$F(x) = \sum_{t=1}^{T} \alpha_t \cdot h_t(x)$$

(F(x) > 0): Predict class +1
(F(x) < 0): Predict class -1
(|F(x)|): Confidence of prediction

Converting to Probabilities:

AdaBoost's exponential loss suggests a natural probability transformation:

$$P(y = +1 | x) = \frac{1}{1 + e^{-2F(x)}}$$

This is the logistic (sigmoid) function applied to (2F(x)). The factor of 2 arises from the bipolar encoding.

Derivation from Odds:

If we interpret (F(x)) as log-odds of correct classification:

$$\text{log-odds}(y=+1) = 2F(x)$$

Then: $$\frac{P(y=+1)}{P(y=-1)} = e^{2F(x)}$$

Since (P(y=+1) + P(y=-1) = 1): $$P(y=+1) = \frac{e^{2F(x)}}{1 + e^{2F(x)}} = \frac{1}{1 + e^{-2F(x)}}$$

F(x) to Probability Mapping
F(x)	2F(x)	P(y=+1)	Interpretation
-3.0	-6.0	0.25%	Very confident -1
-2.0	-4.0	1.8%	Confident -1
-1.0	-2.0	11.9%	Moderate -1
-0.5	-1.0	26.9%	Slight -1
0.0	0.0	50.0%	Uncertain
+0.5	+1.0	73.1%	Slight +1
+1.0	+2.0	88.1%	Moderate +1
+2.0	+4.0	98.2%	Confident +1
+3.0	+6.0	99.75%	Very confident +1

Calibration Issues:

Common Issues:

Overconfidence: AdaBoost often produces probabilities too close to 0 or 1
Insufficient coverage: Not enough probability mass near 50% for hard cases

Calibration Methods:

Platt Scaling: Fit logistic regression (P = \sigma(aF(x) + b)) on validation data
Isotonic Regression: Fit non-parametric monotonic function
Temperature Scaling: (P = \sigma(F(x)/T)) where (T > 1) softens probabilities

Practical Recommendation:

For tasks requiring calibrated probabilities (cost-sensitive classification, ranking), apply Platt scaling:

Split training data into train/calibration sets
Train AdaBoost on training set
Fit logistic regression on calibration set using (F(x)) as feature
Use calibrated probabilities for final predictions

calibration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy.optimize import minimize
from sklearn.isotonic import IsotonicRegression
 
def uncalibrated_probability(F: np.ndarray) -> np.ndarray:
    """Convert AdaBoost scores to uncalibrated probabilities."""
    return 1 / (1 + np.exp(-2 * F))
 
 
def platt_scaling(F_train: np.ndarray, y_train: np.ndarray,
                  F_test: np.ndarray) -> np.ndarray:
    """
    Apply Platt scaling for probability calibration.
    
    Args:
        F_train: AdaBoost scores on calibration set
        y_train: True labels (0 or 1) on calibration set
        F_test: AdaBoost scores to calibrate
        
    Returns:
        Calibrated probabilities for test set
    """
    # Convert labels to 0/1 if needed
    y_train = (y_train + 1) / 2 if y_train.min() < 0 else y_train
    
    def neg_log_likelihood(params):
        a, b = params
        p = 1 / (1 + np.exp(-(a * F_train + b)))
        p = np.clip(p, 1e-10, 1 - 1e-10)
        return -np.mean(y_train * np.log(p) + (1 - y_train) * np.log(1 - p))
    
    result = minimize(neg_log_likelihood, [1.0, 0.0], method='L-BFGS-B')
    a_opt, b_opt = result.x
    
    return 1 / (1 + np.exp(-(a_opt * F_test + b_opt)))
 
 
def isotonic_calibration(F_train: np.ndarray, y_train: np.ndarray,
                         F_test: np.ndarray) -> np.ndarray:
    """
    Apply isotonic regression for probability calibration.
    """
    y_train = (y_train + 1) / 2 if y_train.min() < 0 else y_train
    
    iso_reg = IsotonicRegression(out_of_bounds='clip')
    iso_reg.fit(F_train, y_train)
    
    return iso_reg.predict(F_test)
 
 
def evaluate_calibration(y_true: np.ndarray, y_prob: np.ndarray, 
                         n_bins: int = 10) -> dict:
    """
    Compute calibration metrics.
    """
    y_true = (y_true + 1) / 2 if y_true.min() < 0 else y_true
    
    bin_edges = np.linspace(0, 1, n_bins + 1)
    bin_indices = np.digitize(y_prob, bin_edges[1:-1])
    
    ece = 0  # Expected Calibration Error
    mce = 0  # Maximum Calibration Error
    
    for bin_idx in range(n_bins):
        mask = bin_indices == bin_idx
        if mask.sum() == 0:
            continue
            
        bin_accuracy = y_true[mask].mean()
        bin_confidence = y_prob[mask].mean()
        bin_size = mask.sum() / len(y_true)
        
        gap = abs(bin_accuracy - bin_confidence)
        ece += bin_size * gap
        mce = max(mce, gap)
    
    # Brier score
    brier = np.mean((y_prob - y_true) ** 2)
    
    return {
        'expected_calibration_error': ece,
        'maximum_calibration_error': mce,
        'brier_score': brier,
    }

Multi-Class Extension

Binary weighted voting extends naturally to multi-class problems through several strategies.

Strategy 1: One-vs-All (OvA)

Train K binary AdaBoost classifiers, one for each class:

Classifier (k): Class (k) vs. all others

Prediction: $$H(x) = \arg\max_{k} F_k(x)$$

where (F_k(x) = \sum_t \alpha_t^{(k)} h_t^{(k)}(x)) is the score for class (k).

Strategy 2: One-vs-One (OvO)

Train (\binom{K}{2}) binary classifiers, one for each class pair.

Prediction by voting: $$H(x) = \arg\max_{k} \sum_{j \neq k} \mathbb{1}[F_{kj}(x) > 0]$$

Strategy 3: AdaBoost.MH (Multiclass-Hamming)

Extends AdaBoost directly to multiclass with label encoding:

Use (K) pseudo-labels per example
Weak learner returns (K) predictions
Combined via sum of weighted votes per class

Strategy 4: SAMME (Stagewise Additive Modeling using Multiclass Exponential Loss)

The multiclass extension of AdaBoost with modified:

Weight update: $$\alpha_t = \ln\left(\frac{1 - \varepsilon_t}{\varepsilon_t}\right) + \ln(K-1)$$

where (K) is the number of classes.

Prediction: $$H(x) = \arg\max_{k} \sum_{t=1}^{T} \alpha_t \cdot \mathbb{1}[h_t(x) = k]$$

Multi-Class Voting Strategies Comparison
Strategy	Classifiers	Pros	Cons
One-vs-All	K	Simple, parallel training	Class imbalance issues
One-vs-One	K(K-1)/2	Balanced training	Many classifiers for large K
AdaBoost.MH	1 (multioutput)	Direct optimization	Weak learner must be multioutput
SAMME	1	Principled extension	Requires multiclass weak learner

Practical Recommendation

Summary and Key Takeaways

We've explored AdaBoost's weighted voting mechanism in comprehensive detail. Let's consolidate the essential knowledge:

Key Concepts to Remember

•Weighted Voting Formula: H(x) = sign(Σ α_t · h_t(x)) combines classifiers through weighted majority vote.
•Log-Odds Interpretation: α_t = ½ log((1-ε)/ε) is half the log-odds of correct classification—a principled probabilistic weighting.
•Optimality: The α_t weights are optimal for exponential loss, margin maximization, and (approximately) Bayesian combination.
•Voting Power Distribution: Early classifiers typically dominate voting power; this is normal and reflects diminishing returns.
•Probability Estimation: F(x) can be transformed to probabilities via sigmoid, but calibration may be needed.
•Multi-Class Extensions: SAMME, One-vs-All, and One-vs-One extend weighted voting to multiple classes.
•Alternative Schemes: Uniform and accuracy-proportional voting exist but lack AdaBoost's theoretical guarantees.

What's Next

Page Complete