Loading content...
Adaptive Boosting (AdaBoost) is one of the most influential ensemble learning algorithms in machine learning history. Introduced by Freund and Schapire in 1996, it revolutionized the field by demonstrating how a collection of "weak learners" — classifiers that perform only slightly better than random guessing — can be combined into a highly accurate "strong learner."
Unlike bagging algorithms (such as Random Forests) that train learners independently, boosting algorithms train learners sequentially, with each subsequent learner focusing on the mistakes made by its predecessors. AdaBoost achieves this through an elegant mechanism of adaptive sample weighting:
In this implementation, we use decision stumps as weak learners. A decision stump makes predictions based on a single feature threshold:
$$h(x) = \begin{cases} +1 & \text{if } p \cdot x_j > p \cdot \theta \ -1 & \text{otherwise} \end{cases}$$
Where:
After training each weak classifier, sample weights are updated using:
$$w_i^{(t+1)} = w_i^{(t)} \cdot \exp(-\alpha_t \cdot y_i \cdot h_t(x_i))$$
Where $\alpha_t$ is the classifier weight computed as:
$$\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)$$
Here, $\epsilon_t$ is the weighted error rate of the t-th weak learner. Note that when $\epsilon_t$ is very small (near-perfect classification), $\alpha_t$ becomes large, giving that classifier more voting power in the final ensemble.
Implement the adaboost_train function that trains an AdaBoost ensemble classifier. Your function should:
polarity, threshold, feature_index, and alpha for each weak learnerImportant Implementation Details:
X = np.array([[1.0, 2.0], [2.0, 3.0], [3.0, 4.0], [4.0, 5.0]])
y = np.array([1, 1, -1, -1])
n_clf = 3[
{"polarity": -1, "threshold": 2.5, "feature_index": 0, "alpha": 11.5129},
{"polarity": -1, "threshold": 2.5, "feature_index": 0, "alpha": 11.5129},
{"polarity": -1, "threshold": 2.5, "feature_index": 0, "alpha": 11.5129}
]The dataset has 4 samples with 2 features each. Samples at positions 0 and 1 belong to class +1 (feature 0 values: 1.0 and 2.0), while samples at positions 2 and 3 belong to class -1 (feature 0 values: 3.0 and 4.0).
Optimal Decision Stump: The best threshold is 2.5 on feature 0 with polarity -1. This means:
This stump perfectly separates the two classes, resulting in zero classification error. When error approaches zero, alpha becomes very large (11.5129 ≈ 0.5 × ln(1/ε) with ε near machine epsilon).
Since this stump achieves perfect classification, all three weak learners converge to the same optimal configuration.
X = np.array([[1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [10.0, 10.0], [11.0, 11.0], [12.0, 12.0]])
y = np.array([1, 1, 1, -1, -1, -1])
n_clf = 2[
{"polarity": -1, "threshold": 6.5, "feature_index": 0, "alpha": 11.5129},
{"polarity": -1, "threshold": 6.5, "feature_index": 0, "alpha": 11.5129}
]This dataset has two clearly separated clusters: class +1 samples have low feature values (1-3), while class -1 samples have high values (10-12).
Optimal Threshold Selection: The midpoint between the closest samples of different classes is (3 + 10) / 2 = 6.5. With polarity -1 and threshold 6.5 on feature 0:
Both weak learners identify this same perfect split, achieving zero error and maximum alpha values.
X = np.array([[1.0], [2.0], [3.0], [7.0], [8.0], [9.0]])
y = np.array([1, 1, 1, -1, -1, -1])
n_clf = 2[
{"polarity": -1, "threshold": 5.0, "feature_index": 0, "alpha": 11.5129},
{"polarity": -1, "threshold": 5.0, "feature_index": 0, "alpha": 11.5129}
]With a single feature and clear class separation, the optimal threshold lies at the midpoint of the gap: (3 + 7) / 2 = 5.0.
Using polarity -1 (predictions are flipped relative to the threshold comparison):
The single-feature case demonstrates how even simple threshold-based rules can achieve perfect separation when data is linearly separable.
Constraints