Loading learning content...
Bagging is not a universal solution—it excels in specific circumstances and provides minimal benefit in others. The ability to recognize when bagging will help (and when it won't) is a mark of ensemble expertise.
This page synthesizes everything we've learned into practical decision frameworks. You'll learn to diagnose whether your problem will benefit from bagging, anticipate the magnitude of improvement, and understand the conditions that make bagging most effective.
By the end of this page, you will understand: (1) The key conditions that make bagging effective, (2) How to diagnose high-variance situations, (3) When bagging provides minimal benefit, (4) Practical decision frameworks, and (5) Real-world case studies demonstrating bagging's impact.
Bagging works by averaging diverse predictions to reduce variance. For this to be effective, three conditions must be met:
Condition 1: High Base Model Variance
The base model must exhibit significant variance—its predictions should change noticeably with different training samples. Models with high variance include:
Models with low variance (linear regression, Naive Bayes, heavily regularized models) show little benefit from bagging—they produce similar predictions regardless of which training samples are used.
Condition 2: Low Correlation Between Models
Bagging's variance reduction is bounded by the correlation between model predictions:
$$\text{Var}_{ensemble} = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$
When $\rho$ is high (models agree), bagging helps little. The bootstrap sampling must create sufficient diversity.
Factors that increase model correlation (bad for bagging):
Condition 3: Low Bias (or Acceptable Bias)
Bagging does not reduce bias. If your base model systematically underfits, averaging more underfitting models won't help. You need:
| Condition | Diagnostic | If Not Met |
|---|---|---|
| High variance | Predictions change with training sample | Use simpler aggregation or boost instead |
| Low correlation | Models disagree on substantial fraction of samples | Add feature randomization (→ Random Forest) |
| Low bias | Individual models fit training data well | Use more complex base model or switch to boosting |
Train 5-10 models with different random seeds on bootstrapped samples. If their predictions largely agree (>90% agreement), bagging won't help much. If they disagree significantly (30-50% on some examples), bagging has high potential.
High variance is the primary driver of bagging's benefit. Here's how to identify and quantify it:
Method 1: Repeated Cross-Validation Variance
Train your model with different random seeds or data splits. High variance in test performance indicates bagging opportunity.
Run 1: 85.2% accuracy
Run 2: 82.1% accuracy
Run 3: 87.4% accuracy
Run 4: 83.0% accuracy
Run 5: 86.8% accuracy
Standard deviation: 2.2% → High variance, bagging likely helps
Compare to a low-variance model:
Run 1: 78.5% accuracy
Run 2: 78.7% accuracy
Run 3: 78.4% accuracy
Run 4: 78.6% accuracy
Run 5: 78.5% accuracy
Standard deviation: 0.1% → Low variance, bagging won't help
Method 2: Prediction Stability Analysis
For the same input, measure how predictions change across different model instances:
$$\text{Prediction Stability} = \frac{1}{N}\sum_{i=1}^{N} \frac{\text{std}(\hat{y}_i^{(1)}, ..., \hat{y}_i^{(B)})}{\text{mean}(\hat{y}_i)}$$
High stability coefficient = high variance = bagging opportunity.
Method 3: Learning Curve Shape
Plot validation error vs. training set size. High-variance models show:
Low-variance models show:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168
import numpy as npfrom sklearn.model_selection import cross_val_score, learning_curvefrom sklearn.tree import DecisionTreeClassifierfrom sklearn.linear_model import LogisticRegressionimport matplotlib.pyplot as plt def variance_diagnostic(model_factory, X, y, n_iterations=20, cv=5): """ Diagnose model variance through repeated cross-validation. Returns: -------- dict with variance metrics and recommendation """ scores = [] for i in range(n_iterations): model = model_factory() cv_scores = cross_val_score(model, X, y, cv=cv) scores.extend(cv_scores) mean_score = np.mean(scores) std_score = np.std(scores) cv_coefficient = std_score / mean_score # Coefficient of variation # Interpretation if std_score > 0.05: variance_level = "HIGH" recommendation = "Bagging will likely provide significant benefit" elif std_score > 0.02: variance_level = "MODERATE" recommendation = "Bagging may help, worth testing" else: variance_level = "LOW" recommendation = "Bagging unlikely to help much, consider boosting" return { 'mean_score': mean_score, 'std_score': std_score, 'cv_coefficient': cv_coefficient, 'variance_level': variance_level, 'recommendation': recommendation, 'all_scores': scores, } def prediction_stability_analysis(models, X): """ Analyze how predictions vary across model instances. Parameters: ----------- models : list of fitted models X : test data Returns: -------- stability metrics per sample """ # Collect predictions if hasattr(models[0], 'predict_proba'): all_probs = np.array([m.predict_proba(X)[:, 1] for m in models]) predictions = all_probs else: predictions = np.array([m.predict(X) for m in models]) # Per-sample statistics sample_means = predictions.mean(axis=0) sample_stds = predictions.std(axis=0) # Identify high-variance samples high_var_mask = sample_stds > np.median(sample_stds) * 2 return { 'mean_predictions': sample_means, 'std_predictions': sample_stds, 'mean_std': sample_stds.mean(), 'max_std': sample_stds.max(), 'high_variance_samples': np.where(high_var_mask)[0], 'high_variance_fraction': high_var_mask.mean(), } def bagging_benefit_estimator(base_model_factory, X_train, y_train, X_test, y_test, max_estimators=50): """ Estimate potential bagging benefit before full training. Trains progressively larger ensembles and measures improvement. """ from sklearn.ensemble import BaggingClassifier estimator_counts = [1, 3, 5, 10, 20, 30, 50] estimator_counts = [e for e in estimator_counts if e <= max_estimators] scores = [] for n_est in estimator_counts: if n_est == 1: # Single model baseline model = base_model_factory() model.fit(X_train, y_train) score = model.score(X_test, y_test) else: ensemble = BaggingClassifier( estimator=base_model_factory(), n_estimators=n_est, random_state=42, n_jobs=-1 ) ensemble.fit(X_train, y_train) score = ensemble.score(X_test, y_test) scores.append(score) print(f"n_estimators={n_est}: {score:.4f}") # Compute improvement single_score = scores[0] max_score = max(scores) improvement = max_score - single_score # Estimate saturation point improvements = [scores[i+1] - scores[i] for i in range(len(scores)-1)] if improvements and max(improvements) > 0: marginal_utility = improvements[-1] / improvements[0] if improvements[0] > 0 else 0 else: marginal_utility = 0 return { 'estimator_counts': estimator_counts, 'scores': scores, 'single_model_score': single_score, 'best_ensemble_score': max_score, 'improvement': improvement, 'marginal_utility': marginal_utility, 'recommendation': ( "Strong bagging benefit" if improvement > 0.03 else "Moderate bagging benefit" if improvement > 0.01 else "Limited bagging benefit" ) } def compare_variance_across_models(X, y): """ Compare variance characteristics of different model types. """ models = [ ("Deep Tree (max_depth=None)", lambda: DecisionTreeClassifier(max_depth=None)), ("Shallow Tree (max_depth=3)", lambda: DecisionTreeClassifier(max_depth=3)), ("Logistic Regression", lambda: LogisticRegression(max_iter=1000)), ] results = {} for name, factory in models: result = variance_diagnostic(factory, X, y, n_iterations=20) results[name] = result print(f"{name}:") print(f" Mean: {result['mean_score']:.4f}") print(f" Std: {result['std_score']:.4f}") print(f" Level: {result['variance_level']}") print(f" {result['recommendation']}") print() return resultsIf repeated cross-validation standard deviation exceeds 2-3% of the mean score, bagging is likely to help. If ensemble improvement plateaus before 20-30 trees, you've captured most of the benefit. If improvement continues growing with more trees, your base model has high variance and bagging is working.
Recognizing when bagging won't help is as important as knowing when it will. Here are the key scenarios where bagging provides minimal benefit:
Quantitative Thresholds:
| Diagnostic | Bagging Unlikely to Help If |
|---|---|
| CV standard deviation | < 1% of mean |
| Improvement from 1→10 trees | < 0.5% |
| Model correlation $\rho$ | > 0.8 |
| Training-test gap | < 2% (underfit) |
| Feature importance concentration | Top feature > 70% |
Case Study: Why Bagging a Linear Model Fails
Consider bagging logistic regression for a linearly separable problem:
In this case, invest computation in better features or more data, not bagging.
Watch for early saturation. If performance plateaus at 10-20 trees, adding more is wasted computation. But if you're still seeing improvement at 100 trees, your base model has very high variance—consider simplifying it slightly for efficiency.
Different problem types present different opportunities for bagging. Here's what to expect based on your domain:
| Problem Type | Typical Benefit | Key Considerations |
|---|---|---|
| Tabular classification (small) | High (3-8%) | Trees have high variance; bagging works well |
| Tabular classification (large) | Moderate (1-3%) | Less variance due to data quantity; still helps |
| Tabular regression | High (5-15%) | Regression targets amplify variance; bagging very effective |
| Text classification | Moderate (2-5%) | Feature sparsity increases diversity |
| Image classification (CNN) | Moderate (1-3%) | Initialization variance enables ensemble benefit |
| Time series forecasting | Moderate (3-8%) | Care needed: temporal structure in bootstrap |
| Ranking problems | Moderate (2-4%) | Ensemble rankings more stable than individual |
| Anomaly detection | High (5-10%) | Boundary estimation highly variable |
Domain-Specific Considerations:
Medical Diagnosis:
Financial Prediction:
Computer Vision:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
import numpy as npfrom sklearn.datasets import ( make_classification, make_regression, fetch_20newsgroups_vectorized)from sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressorfrom sklearn.model_selection import cross_val_score def estimate_bagging_benefit_for_problem_type(): """ Demonstrate bagging benefit varies by problem type. """ results = {} # Binary classification - balanced X, y = make_classification( n_samples=1000, n_features=20, n_informative=10, random_state=42 ) single_scores = cross_val_score( DecisionTreeClassifier(random_state=42), X, y, cv=5 ) ensemble_scores = cross_val_score( RandomForestClassifier(n_estimators=50, random_state=42), X, y, cv=5 ) results['binary_classification'] = { 'single': single_scores.mean(), 'ensemble': ensemble_scores.mean(), 'improvement': ensemble_scores.mean() - single_scores.mean() } # Regression X, y = make_regression( n_samples=1000, n_features=20, n_informative=10, noise=10, random_state=42 ) single_scores = cross_val_score( DecisionTreeRegressor(random_state=42), X, y, cv=5, scoring='neg_mean_squared_error' ) ensemble_scores = cross_val_score( RandomForestRegressor(n_estimators=50, random_state=42), X, y, cv=5, scoring='neg_mean_squared_error' ) # Convert to RMSE improvement single_rmse = np.sqrt(-single_scores.mean()) ensemble_rmse = np.sqrt(-ensemble_scores.mean()) results['regression'] = { 'single_rmse': single_rmse, 'ensemble_rmse': ensemble_rmse, 'improvement_percent': (single_rmse - ensemble_rmse) / single_rmse * 100 } # Imbalanced classification X, y = make_classification( n_samples=1000, n_features=20, n_informative=10, weights=[0.9, 0.1], random_state=42 # 90-10 imbalance ) single_scores = cross_val_score( DecisionTreeClassifier(random_state=42), X, y, cv=5, scoring='f1' ) ensemble_scores = cross_val_score( RandomForestClassifier(n_estimators=50, random_state=42), X, y, cv=5, scoring='f1' ) results['imbalanced'] = { 'single_f1': single_scores.mean(), 'ensemble_f1': ensemble_scores.mean(), 'improvement': ensemble_scores.mean() - single_scores.mean() } # High-dimensional sparse (text-like) X, y = make_classification( n_samples=1000, n_features=1000, n_informative=100, random_state=42 ) single_scores = cross_val_score( DecisionTreeClassifier(random_state=42), X, y, cv=5 ) ensemble_scores = cross_val_score( RandomForestClassifier(n_estimators=50, random_state=42), X, y, cv=5 ) results['high_dimensional'] = { 'single': single_scores.mean(), 'ensemble': ensemble_scores.mean(), 'improvement': ensemble_scores.mean() - single_scores.mean() } # Print summary print("Bagging Benefit by Problem Type:") print("=" * 50) for problem, metrics in results.items(): print(f"\n{problem}:") for k, v in metrics.items(): if isinstance(v, float): print(f" {k}: {v:.4f}") else: print(f" {k}: {v}") return results def time_series_bagging_considerations(): """ Special considerations for time series bagging. Standard bagging violates temporal structure. Use block bootstrap or moving block bootstrap instead. """ def block_bootstrap(data, block_size, n_samples): """ Block bootstrap for time series. Rather than sampling individual points, sample contiguous blocks to preserve temporal structure. """ n = len(data) n_blocks = n_samples // block_size bootstrapped = [] for _ in range(n_blocks): # Random start position for block start = np.random.randint(0, n - block_size + 1) block = data[start:start + block_size] bootstrapped.extend(block) return np.array(bootstrapped[:n_samples]) # Example usage np.random.seed(42) time_series = np.sin(np.linspace(0, 4*np.pi, 100)) + np.random.randn(100) * 0.1 # Standard bootstrap (breaks temporal structure) standard_boot = np.random.choice(time_series, size=100, replace=True) # Block bootstrap (preserves local structure) block_boot = block_bootstrap(time_series, block_size=10, n_samples=100) print("Standard bootstrap loses temporal structure") print("Block bootstrap maintains local patterns") return { 'original_autocorr': np.corrcoef(time_series[:-1], time_series[1:])[0,1], 'block_boot_autocorr': np.corrcoef(block_boot[:-1], block_boot[1:])[0,1], }Use this decision framework to determine if bagging is right for your problem:
Step 1: Characterize Your Base Model
Ask: Does my base model have high variance?
Step 2: Check for Bias Issues
Ask: Is my base model underfitting?
Step 3: Estimate Diversity Potential
Ask: Will bootstrap samples produce diverse models?
Step 4: Run Quick Experiment
# 5-minute bagging test
from sklearn.ensemble import BaggingClassifier
baseline = base_model.fit(X_train, y_train).score(X_test, y_test)
bagged = BaggingClassifier(base_model, n_estimators=20).fit(X_train, y_train)
bagged_score = bagged.score(X_test, y_test)
improvement = bagged_score - baseline
print(f"Improvement: {improvement:.3f}")
# > 0.02: Significant benefit, proceed with full bagging
# 0.01-0.02: Moderate benefit, consider if compute allows
# < 0.01: Limited benefit, likely not worth it
Before committing to full ensemble training, run a quick test with 20 trees. If you see > 2% improvement, proceed. If not, investigate why (low variance? high bias? correlated models?) before investing more compute.
Decision Tree Summary:
Start
│
├── Is training accuracy high (>95%)?
│ ├── No → Model underfits. Use boosting or more complex model.
│ └── Yes → Continue
│
├── Is there a large train-test gap (>5%)?
│ ├── Yes → High variance! Bagging will likely help.
│ └── No → May be optimal. Continue.
│
├── 5 random seeds → prediction agreement?
│ ├── < 85% → High variance. Bagging will help!
│ └── > 95% → Low variance. Bagging minimal benefit.
│
├── Top-3 feature importance > 80%?
│ ├── Yes → Models will be correlated. Use Random Forest (feature sampling)
│ └── No → Standard bagging sufficient.
│
└── Quick 20-tree experiment → improvement?
├── > 2% → Proceed with full bagging
├── 0.5-2% → Marginal benefit. Consider if compute allows.
└── < 0.5% → Not worth it. Try other approaches.
Let's examine real scenarios where bagging's impact was measured:
Case Study 1: Medical Image Classification (Skin Cancer Detection)
Setup: 25,000 dermoscopy images, 7-class classification, ResNet-50 base model
Results:
Analysis:
Lesson: In high-stakes applications, even 2-3% improvement matters, and uncertainty quantification is invaluable.
Case Study 2: Credit Scoring (Financial Services)
Setup: 100,000 loan applications, 50 features, binary default prediction
Results:
Analysis:
Lesson: Bagging helps high-variance models; low-variance models need different approaches.
Case Study 3: E-commerce Product Categorization
Setup: 2 million products, 500 categories, text + numeric features
Results:
Analysis:
Lesson: At large scale, the compute cost of ensembles must be weighed against accuracy gains.
| Domain | Base Model | Improvement from Bagging | Verdict |
|---|---|---|---|
| Medical imaging | Deep CNN | +3.2% | Essential for deployment |
| Credit scoring (LogReg) | Logistic Regression | <0.1% | Not worth it |
| Credit scoring (Trees) | Decision Tree | +9.1% | Highly effective |
| E-commerce (large scale) | Gradient Boosting | +0.8% | Marginal, high compute cost |
| Fraud detection | Random Forest | +4.2% | Effective for rare event |
| Time series forecasting | ARIMA ensemble | +12% | Very effective (block bootstrap) |
Bagging's benefit is highly problem-dependent. The same technique can provide essential improvements in one domain and negligible gains in another. Always validate on your specific data rather than assuming universal benefit.
Knowing when bagging will help—and when it won't—separates ensemble practitioners from ensemble experts. Let's consolidate the key insights:
Module Complete!
With this page, you've completed Module 6: Bagging for Other Models. You've learned:
You're now equipped to apply bagging intelligently across model families, diagnose when it will help, and implement effective ensemble strategies for real-world problems.
Congratulations! You've mastered bagging for diverse model types. You understand not just how to implement bagging, but when and why it works—the mark of true ensemble expertise. Next, explore boosting techniques in Chapter 16 to complete your ensemble methods toolkit.