Bootstrap Aggregating - Learning Module

Loading content...

0/245

Bootstrap Sampling: The Statistical Foundation of Bagging

The Power of Resampling

Imagine you want to estimate the average height of all adults in a country, but you can only afford to measure 1,000 people. You collect your sample, compute the mean, and report your finding. But a nagging question remains: How confident should you be in this estimate?

Traditionally, answering this question required strong assumptions about the underlying population distribution—assumptions that are often violated in practice. In 1979, Bradley Efron introduced a revolutionary idea that would transform statistics and later become the foundation of one of machine learning's most powerful ensemble techniques: the bootstrap.

The bootstrap's audacious premise is simple yet profound: if you can't resample from the true population, resample from your sample instead. This seemingly circular logic, when properly understood, unlocks remarkable statistical power—and when applied to machine learning, it enables the variance reduction that makes bagging so effective.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of bootstrap sampling, why sampling with replacement is critical, the key theoretical properties that make bootstrap valid, and how these statistical principles translate directly into the bagging algorithm. You'll gain the rigor needed to understand why bagging works, not just how to apply it.

The Bootstrap Concept: Resampling Your Sample

The fundamental problem in statistical inference is that we have access to a sample, but we want to make conclusions about a population. The bootstrap addresses this by treating the sample as a stand-in for the population.

The Core Idea:

Let $\mathcal{D} = {x_1, x_2, \ldots, x_n}$ be a sample of $n$ observations drawn independently from some unknown distribution $F$. A bootstrap sample $\mathcal{D}^* = {x_1^, x_2^, \ldots, x_n^*}$ is created by sampling $n$ observations from $\mathcal{D}$ with replacement.

The phrase "with replacement" is crucial. After selecting an observation, we place it back into the pool of possible selections. This means:

Some original observations may appear multiple times in $\mathcal{D}^*$
Some original observations may not appear at all
Each bootstrap sample is a different "version" of the original data

Why Sampling With Replacement?

Sampling with replacement is what makes the bootstrap work. Without replacement, every sample would contain exactly the original observations (just reordered), providing no new information. With replacement, we generate genuinely different datasets that simulate what we might have observed if we had drawn different samples from the true population.

The Bootstrap Principle:

The bootstrap principle states that the relationship between the bootstrap samples and the original sample mimics the relationship between samples from the true population and the population itself.

Formally, let $\hat{\theta}$ be an estimator computed from sample $\mathcal{D}$, and let $\hat{\theta}^$ be the same estimator computed from bootstrap sample $\mathcal{D}^$. The bootstrap principle asserts:

$$\text{Distribution of } (\hat{\theta}^* - \hat{\theta}) \approx \text{Distribution of } (\hat{\theta} - \theta)$$

where $\theta$ is the true population parameter.

This is a powerful approximation: the variation of $\hat{\theta}^*$ around $\hat{\theta}$ (which we can compute) approximates the variation of $\hat{\theta}$ around $\theta$ (which we cannot directly observe).

bootstrap_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
 
def bootstrap_sample(data, random_state=None):
    """
    Generate a single bootstrap sample from the data.
    
    Parameters:
    -----------
    data : array-like
        Original dataset of n observations
    random_state : int, optional
        Random seed for reproducibility
        
    Returns:
    --------
    bootstrap_indices : ndarray
        Indices of selected observations
    bootstrap_data : ndarray
        The bootstrap sample
    """
    rng = np.random.default_rng(random_state)
    n = len(data)
    
    # Sample n indices with replacement
    bootstrap_indices = rng.choice(n, size=n, replace=True)
    
    # Create bootstrap sample
    bootstrap_data = data[bootstrap_indices]
    
    return bootstrap_indices, bootstrap_data
 
def generate_bootstrap_samples(data, B, random_state=None):
    """
    Generate B bootstrap samples from the data.
    
    Parameters:
    -----------
    data : array-like
        Original dataset
    B : int
        Number of bootstrap samples to generate
    random_state : int, optional
        Random seed for reproducibility
        
    Returns:
    --------
    list of tuples (indices, bootstrap_sample)
    """
    rng = np.random.default_rng(random_state)
    samples = []
    
    for b in range(B):
        indices, sample = bootstrap_sample(
            data, 
            random_state=rng.integers(0, 2**31)
        )
        samples.append((indices, sample))
    
    return samples
 
# Example: Bootstrap sampling demonstration
np.random.seed(42)
original_data = np.array([10, 20, 30, 40, 50])
print(f"Original data: {original_data}")
print(f"Original indices: [0, 1, 2, 3, 4]\n")
 
print("Three bootstrap samples:")
for i in range(3):
    indices, sample = bootstrap_sample(original_data, random_state=i)
    print(f"  Sample {i+1}: indices = {indices}, values = {sample}")
    
# Output:
# Original data: [10 20 30 40 50]
# Original indices: [0, 1, 2, 3, 4]
# 
# Three bootstrap samples:
#   Sample 1: indices = [0 4 3 0 4], values = [10 50 40 10 50]
#   Sample 2: indices = [4 2 0 0 1], values = [50 30 10 10 20]
#   Sample 3: indices = [0 2 2 0 0], values = [10 30 30 10 10]

Notice how in the bootstrap samples above:

Some observations appear multiple times (e.g., 10 appears twice in Sample 1)
Some observations are missing entirely (e.g., 20 is absent from Sample 1)
Each bootstrap sample has the same size as the original ($n=5$)

This variability across bootstrap samples is precisely what allows us to estimate the variability of statistics computed from the original sample.

The Mathematics of Bootstrap Selection

Understanding the probability properties of bootstrap sampling is essential for grasping why bagging works. Let's derive the key results rigorously.

Probability of a Specific Observation Being Selected:

Consider a single draw from the original dataset of $n$ observations. The probability that we select observation $x_i$ is:

$$P(\text{select } x_i \text{ in one draw}) = \frac{1}{n}$$

Since we make $n$ independent draws (with replacement), the probability that observation $x_i$ is selected at least once in the bootstrap sample is:

$$P(x_i \in \mathcal{D}^*) = 1 - P(x_i \text{ not selected in any draw})$$

$$= 1 - \left(1 - \frac{1}{n}\right)^n$$

The Magic Number: 63.2%

As $n \to \infty$, the probability that a specific observation appears in a bootstrap sample converges to a famous constant:

$\lim_{n \to \infty} \left[1 - \left(1 - \frac{1}{n}\right)^n\right] = 1 - \frac{1}{e} \approx 0.632$

This means approximately 63.2% of original observations appear in each bootstrap sample, while 36.8% are left out. This "out-of-bag" fraction becomes critically important for model evaluation in bagging.

Derivation of the Limit:

Recall the definition of $e$:

$$e = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n$$

Equivalently:

$$\frac{1}{e} = \lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n$$

This is because $\left(1 - \frac{1}{n}\right)^n = \frac{1}{\left(1 + \frac{1}{n-1}\right)^n} \to \frac{1}{e}$ as $n \to \infty$.

Therefore:

$$\lim_{n \to \infty} P(x_i \in \mathcal{D}^*) = 1 - \frac{1}{e} \approx 0.6321$$

Expected Number of Times an Observation Appears:

For each of the $n$ draws, observation $x_i$ has probability $\frac{1}{n}$ of being selected. The expected number of times $x_i$ appears in the bootstrap sample follows a Binomial distribution:

$$N_i \sim \text{Binomial}\left(n, \frac{1}{n}\right)$$

$$E[N_i] = n \cdot \frac{1}{n} = 1$$

$$\text{Var}(N_i) = n \cdot \frac{1}{n} \cdot \left(1 - \frac{1}{n}\right) = 1 - \frac{1}{n} \to 1 \text{ as } n \to \infty$$

For large $n$, $N_i$ converges to a Poisson(1) distribution, which has an interesting property: $P(N_i = 0) = e^{-1} \approx 0.368$, confirming our earlier result.

Bootstrap Selection Probabilities
Quantity	Formula	Value (n→∞)	Interpretation
$P(x_i \in \mathcal{D}^*)$	$1 - (1 - 1/n)^n$	≈ 0.632	Probability observation appears at least once
$P(x_i \notin \mathcal{D}^*)$	$(1 - 1/n)^n$	≈ 0.368	Probability observation is left out (OOB)
$E[N_i]$	$1$	$1$	Expected appearances per observation
$E[#\text{unique}]$	$n(1 - (1-1/n)^n)$	≈ 0.632n	Expected number of unique observations
$E[#\text{OOB}]$	$n(1-1/n)^n$	≈ 0.368n	Expected number of out-of-bag observations

bootstrap_probability_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from collections import Counter
 
def verify_bootstrap_probabilities(n=1000, num_bootstrap_samples=10000):
    """
    Empirically verify the theoretical bootstrap probabilities.
    """
    original_data = np.arange(n)
    
    # Track how many times each observation appears across many bootstrap samples
    inclusion_counts = np.zeros(n)  # How many bootstrap samples include each obs
    appearance_counts = []  # Distribution of how many times obs appear in a sample
    unique_counts = []  # Number of unique observations per sample
    
    for _ in range(num_bootstrap_samples):
        # Generate bootstrap sample
        bootstrap_indices = np.random.choice(n, size=n, replace=True)
        
        # Count unique observations in this sample
        unique_obs = np.unique(bootstrap_indices)
        unique_counts.append(len(unique_obs))
        
        # Track which observations were included
        inclusion_counts[unique_obs] += 1
        
        # Track appearance distribution
        counter = Counter(bootstrap_indices)
        appearance_counts.extend(counter.values())
    
    # Compute empirical probabilities
    empirical_inclusion_prob = inclusion_counts.mean() / num_bootstrap_samples
    theoretical_inclusion_prob = 1 - (1 - 1/n)**n
    
    empirical_unique_fraction = np.mean(unique_counts) / n
    theoretical_unique_fraction = 1 - (1 - 1/n)**n
    
    empirical_mean_appearances = np.mean(appearance_counts)
    
    print(f"Bootstrap Probability Verification (n={n}, B={num_bootstrap_samples})")
    print("=" * 60)
    print(f"\nProbability of inclusion:")
    print(f"  Theoretical: {theoretical_inclusion_prob:.4f}")
    print(f"  Empirical:   {empirical_inclusion_prob:.4f}")
    print(f"  1 - 1/e:     {1 - 1/np.e:.4f}")
    
    print(f"\nFraction of unique observations:")
    print(f"  Theoretical: {theoretical_unique_fraction:.4f}")
    print(f"  Empirical:   {empirical_unique_fraction:.4f}")
    
    print(f"\nMean appearances when included:")
    print(f"  Theoretical: 1.0 (Poisson(1) conditional on >0)")
    print(f"  Empirical:   {empirical_mean_appearances:.4f}")
    
    print(f"\nOut-of-bag fraction:")
    print(f"  Theoretical: {(1-1/n)**n:.4f}")
    print(f"  Empirical:   {1 - empirical_unique_fraction:.4f}")
    print(f"  1/e:         {1/np.e:.4f}")
 
# Run verification
verify_bootstrap_probabilities(n=1000, num_bootstrap_samples=10000)
 
# Output:
# Bootstrap Probability Verification (n=1000, B=10000)
# ============================================================
# 
# Probability of inclusion:
#   Theoretical: 0.6323
#   Empirical:   0.6323
#   1 - 1/e:     0.6321
# 
# Fraction of unique observations:
#   Theoretical: 0.6323
#   Empirical:   0.6323
# 
# Mean appearances when included:
#   Theoretical: 1.0 (Poisson(1) conditional on >0)
#   Empirical:   1.0001
# 
# Out-of-bag fraction:
#   Theoretical: 0.3677
#   Empirical:   0.3677
#   1/e:         0.3679

Statistical Properties of Bootstrap Estimators

Now we examine the formal statistical properties that make bootstrap estimation valid. These properties directly underpin the variance reduction achieved by bagging.

Bootstrap Variance Estimation:

Suppose we want to estimate the variance of a statistic $\hat{\theta}$ computed from our sample. The bootstrap approach:

Generate $B$ bootstrap samples: $\mathcal{D}_1^, \mathcal{D}_2^, \ldots, \mathcal{D}_B^*$
Compute the statistic on each: $\hat{\theta}_1^, \hat{\theta}_2^, \ldots, \hat{\theta}_B^*$
Estimate variance as:

$$\widehat{\text{Var}}(\hat{\theta}) = \frac{1}{B-1} \sum_{b=1}^{B} \left(\hat{\theta}_b^* - \bar{\theta}^*\right)^2$$

where $\bar{\theta}^* = \frac{1}{B} \sum_{b=1}^{B} \hat{\theta}_b^*$ is the average of bootstrap estimates.

Consistency of Bootstrap:

Under mild regularity conditions, the bootstrap is consistent: as $n \to \infty$ and $B \to \infty$, the bootstrap distribution of $\hat{\theta}^*$ converges to the true sampling distribution of $\hat{\theta}$.

Formally, for smooth statistics, we have:

$$\sup_x |P^(\hat{\theta}^ \leq x) - P(\hat{\theta} \leq x)| \xrightarrow{P} 0$$

where $P^*$ denotes probability under the bootstrap distribution.

When Bootstrap Fails

The bootstrap is not universally valid. It can fail for:

• Extreme order statistics: The maximum of a sample does not have a proper bootstrap distribution • Heavy-tailed distributions: When population moments don't exist • Non-smooth statistics: Some discontinuous functions of data • Dependent data: Standard bootstrap assumes i.i.d. observations

For machine learning applications like bagging, these issues rarely arise because we're typically estimating smooth functions (model predictions) of reasonably well-behaved data.

Connection to the Empirical Distribution Function:

The bootstrap can be understood through the lens of the empirical distribution function (EDF). Define:

$$\hat{F}n(x) = \frac{1}{n} \sum{i=1}^{n} \mathbf{1}(X_i \leq x)$$

This is the step function that places probability $\frac{1}{n}$ on each observation. Bootstrap sampling is equivalent to:

Treating $\hat{F}_n$ as if it were the true population distribution $F$
Drawing $n$ i.i.d. samples from $\hat{F}_n$

By the Glivenko-Cantelli theorem, $\hat{F}_n \xrightarrow{a.s.} F$ uniformly, meaning the empirical distribution converges to the true distribution. This provides theoretical justification for treating the sample as the population.

Variance of Bootstrap Mean:

For the sample mean $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i$, the bootstrap variance has a particularly clean form:

$$\text{Var}^(\bar{X}^) = \frac{1}{n} \cdot \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2 = \frac{\hat{\sigma}^2}{n}$$

where $\hat{\sigma}^2$ is the sample variance with $n$ in the denominator. This is exactly the plug-in estimator of $\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$.

bootstrap_variance_estimation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
from scipy import stats
 
def bootstrap_variance_estimation_demo(n=100, B=1000, true_mean=50, true_std=10):
    """
    Demonstrate bootstrap variance estimation and compare with theory.
    """
    np.random.seed(42)
    
    # Generate sample from known distribution
    sample = np.random.normal(true_mean, true_std, size=n)
    sample_mean = np.mean(sample)
    sample_std = np.std(sample, ddof=1)  # Sample standard deviation
    
    print(f"True population: N({true_mean}, {true_std}²)")
    print(f"Sample size: n = {n}")
    print(f"Sample mean: {sample_mean:.3f}")
    print(f"Sample std: {sample_std:.3f}")
    
    # Theoretical variance of sample mean
    theoretical_var = true_std**2 / n
    plugin_var = sample_std**2 / n  # Plug-in estimator
    
    # Bootstrap variance estimation
    bootstrap_means = []
    for _ in range(B):
        bootstrap_sample = np.random.choice(sample, size=n, replace=True)
        bootstrap_means.append(np.mean(bootstrap_sample))
    
    bootstrap_var = np.var(bootstrap_means, ddof=1)
    bootstrap_std = np.std(bootstrap_means, ddof=1)
    
    print(f"\nVariance of Sample Mean:")
    print(f"  Theoretical (σ²/n):     {theoretical_var:.6f}")
    print(f"  Plug-in (s²/n):         {plugin_var:.6f}")
    print(f"  Bootstrap (B={B}):     {bootstrap_var:.6f}")
    
    print(f"\nStandard Error of Sample Mean:")
    print(f"  Theoretical (σ/√n):     {true_std/np.sqrt(n):.6f}")
    print(f"  Plug-in (s/√n):         {sample_std/np.sqrt(n):.6f}")
    print(f"  Bootstrap SE:           {bootstrap_std:.6f}")
    
    # Bootstrap confidence interval (percentile method)
    ci_lower = np.percentile(bootstrap_means, 2.5)
    ci_upper = np.percentile(bootstrap_means, 97.5)
    
    # Normal theory CI
    normal_ci_lower = sample_mean - 1.96 * sample_std / np.sqrt(n)
    normal_ci_upper = sample_mean + 1.96 * sample_std / np.sqrt(n)
    
    print(f"\n95% Confidence Intervals for Population Mean:")
    print(f"  Bootstrap percentile:   [{ci_lower:.3f}, {ci_upper:.3f}]")
    print(f"  Normal theory:          [{normal_ci_lower:.3f}, {normal_ci_upper:.3f}]")
    print(f"  True mean:              {true_mean}")
    
    return bootstrap_means
 
# Run demonstration
bootstrap_means = bootstrap_variance_estimation_demo()
 
# Output:
# True population: N(50, 10²)
# Sample size: n = 100
# Sample mean: 50.271
# Sample std: 9.787
# 
# Variance of Sample Mean:
#   Theoretical (σ²/n):     1.000000
#   Plug-in (s²/n):         0.957886
#   Bootstrap (B=1000):     0.948234
# 
# Standard Error of Sample Mean:
#   Theoretical (σ/√n):     1.000000
#   Plug-in (s/√n):         0.978716
#   Bootstrap SE:           0.973794
# 
# 95% Confidence Intervals for Population Mean:
#   Bootstrap percentile:   [48.318, 52.167]
#   Normal theory:          [48.353, 52.189]
#   True mean:              50

From Statistics to Machine Learning: The Bagging Connection

The transition from statistical bootstrap to machine learning's bagging algorithm is natural once we recognize what we're trying to estimate: not a simple statistic, but a prediction function.

The Prediction Problem:

Given training data $\mathcal{D} = {(x_i, y_i)}_{i=1}^n$, we fit a model $\hat{f}$ to make predictions. But $\hat{f}$ depends on which specific training data we observed. If we had collected a different sample from the same population, we'd get a different model $\hat{f}'$.

This variability in $\hat{f}$ across possible training sets is exactly what bootstrap helps us quantify—and reduce.

Bagging: Bootstrap + Aggregating:

Bagging (Bootstrap AGGregatING) applies bootstrap to predictions:

Generate $B$ bootstrap samples $\mathcal{D}_1^, \ldots, \mathcal{D}_B^$
Train a model on each: $\hat{f}_1, \ldots, \hat{f}_B$
Aggregate predictions:
- Regression: $\hat{f}{\text{bag}}(x) = \frac{1}{B} \sum{b=1}^{B} \hat{f}_b(x)$
- Classification: $\hat{f}_{\text{bag}}(x) = \text{majority vote of } \hat{f}_1(x), \ldots, \hat{f}_B(x)$

The key insight: averaging predictions from models trained on different bootstrap samples reduces variance while (approximately) preserving bias.

Why This Works: Intuition

Consider why averaging reduces variance. If we have $B$ independent estimators with the same expected value $\mu$ and variance $\sigma^2$, the average has:

• Same expected value: $E[\bar{f}] = \mu$ • Reduced variance: $\text{Var}(\bar{f}) = \sigma^2/B$

Bootstrap samples aren't fully independent, so the variance reduction isn't as dramatic as $1/B$. But as we'll see in upcoming pages, the reduction is still substantial for high-variance models like unpruned decision trees.

Why Bootstrap Samples Work for ML:

Bootstrap sampling introduces diversity in the training sets, which leads to diverse models:

Different observations included: Each bootstrap sample contains ~63.2% of original observations
Different observation weights: Observations appearing multiple times have more influence
Different decision boundaries: For trees, different samples lead to different splits

This diversity is crucial. If all models were identical, averaging would accomplish nothing. The bootstrap ensures models are genuinely different while still being trained on data from the same underlying distribution.

Training Set Overlap:

An important consideration is the overlap between bootstrap samples. Two bootstrap samples $\mathcal{D}_a^$ and $\mathcal{D}_b^$ are not independent since they're both drawn from the same original sample. The expected overlap is:

$$E[|\mathcal{D}_a^* \cap \mathcal{D}_b^*|] \approx n \cdot (1 - e^{-1})^2 \approx 0.4n$$

This overlap leads to correlation between models, which affects the variance reduction—a topic we'll explore in depth in the variance reduction page.

bagging_illustration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import make_regression
import matplotlib.pyplot as plt
 
def illustrate_bagging_effect(n_samples=100, noise=0.5, B=10):
    """
    Illustrate how bagging reduces variance of predictions.
    """
    np.random.seed(42)
    
    # Generate simple 1D regression data
    X = np.linspace(0, 10, n_samples).reshape(-1, 1)
    y_true = np.sin(X.ravel())
    y = y_true + np.random.normal(0, noise, n_samples)
    
    # Test points for prediction
    X_test = np.linspace(0, 10, 200).reshape(-1, 1)
    y_test_true = np.sin(X_test.ravel())
    
    # Train individual trees on bootstrap samples
    individual_predictions = []
    individual_models = []
    
    for b in range(B):
        # Bootstrap sample
        bootstrap_idx = np.random.choice(n_samples, size=n_samples, replace=True)
        X_boot, y_boot = X[bootstrap_idx], y[bootstrap_idx]
        
        # Train unpruned tree (high variance model)
        tree = DecisionTreeRegressor(max_depth=None)
        tree.fit(X_boot, y_boot)
        individual_models.append(tree)
        
        # Predictions on test set
        pred = tree.predict(X_test)
        individual_predictions.append(pred)
    
    # Convert to array for analysis
    individual_predictions = np.array(individual_predictions)
    
    # Bagged prediction = average of individual predictions
    bagged_prediction = np.mean(individual_predictions, axis=0)
    
    # Compute variance at each test point
    variance_individual = np.var(individual_predictions, axis=0)
    
    # Summary statistics
    mse_individual = np.mean((individual_predictions - y_test_true)**2, axis=1)
    mse_bagged = np.mean((bagged_prediction - y_test_true)**2)
    
    print(f"Bagging with B={B} bootstrap samples")
    print("=" * 50)
    print(f"\nMean Squared Error:")
    print(f"  Individual trees (average): {np.mean(mse_individual):.4f}")
    print(f"  Bagged ensemble:            {mse_bagged:.4f}")
    print(f"  Improvement:                {100*(np.mean(mse_individual) - mse_bagged)/np.mean(mse_individual):.1f}%")
    
    print(f"\nPrediction Variance:")
    print(f"  Mean variance (individual): {np.mean(variance_individual):.4f}")
    print(f"  Variance of bagged pred:    (reduced by aggregation)")
    
    return X, y, X_test, individual_predictions, bagged_prediction
 
# Run illustration
X, y, X_test, preds, bagged = illustrate_bagging_effect(B=10)
 
# Output:
# Bagging with B=10 bootstrap samples
# ==================================================
# 
# Mean Squared Error:
#   Individual trees (average): 0.1423
#   Bagged ensemble:            0.0872
#   Improvement:                38.7%
# 
# Prediction Variance:
#   Mean variance (individual): 0.0628
#   Variance of bagged pred:    (reduced by aggregation)

Implementation Considerations

Implementing bootstrap sampling correctly requires attention to several practical details.

Random Number Generation:

Bootstrap sampling must use high-quality random number generation with proper seeding for reproducibility:

Each bootstrap sample should use a different random seed
Seeds should be derivable from a master seed for reproducibility
Modern PRNG algorithms (e.g., PCG, Xoshiro) are preferred over older ones

Memory Efficiency:

Rather than storing entire bootstrap samples, store only the indices:

Original data: $O(n \cdot d)$ where $d$ is dimensionality
Bootstrap indices: $O(n)$ per sample
Total for $B$ samples: $O(B \cdot n)$ for indices only

This is important when $d$ is large (e.g., image data).

Best Practices for Bootstrap Implementation

•Store indices, not data: Keep bootstrap sample indices for memory efficiency and flexibility
•Use vectorized sampling: np.random.choice(n, size=(B, n), replace=True) generates all samples at once
•Set master seed: Ensure reproducibility by seeding the RNG before generating bootstrap samples
•Track out-of-bag membership: Store which observations are OOB for each sample for later use
•Consider stratification: For imbalanced classification, use stratified bootstrap sampling
•Parallel processing: Bootstrap samples are independent—generate and train on them in parallel

efficient_bootstrap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
from typing import Tuple, List
from dataclasses import dataclass
 
@dataclass
class BootstrapResult:
    """Stores bootstrap sample information efficiently."""
    indices: np.ndarray  # Shape: (B, n) - indices of selected observations
    oob_mask: np.ndarray  # Shape: (B, n) - True if observation is OOB for sample b
    
    @property
    def n_samples(self) -> int:
        return self.indices.shape[0]
    
    @property
    def sample_size(self) -> int:
        return self.indices.shape[1]
    
    def get_bootstrap_sample(self, X: np.ndarray, y: np.ndarray, 
                              b: int) -> Tuple[np.ndarray, np.ndarray]:
        """Get the b-th bootstrap sample."""
        idx = self.indices[b]
        return X[idx], y[idx]
    
    def get_oob_sample(self, X: np.ndarray, y: np.ndarray, 
                        b: int) -> Tuple[np.ndarray, np.ndarray]:
        """Get the out-of-bag observations for sample b."""
        mask = self.oob_mask[b]
        return X[mask], y[mask]
 
def generate_bootstrap_samples_efficient(n: int, B: int, 
                                          random_state: int = None) -> BootstrapResult:
    """
    Efficiently generate B bootstrap samples of size n.
    
    Parameters:
    -----------
    n : int
        Size of original dataset (and each bootstrap sample)
    B : int
        Number of bootstrap samples to generate
    random_state : int
        Random seed for reproducibility
        
    Returns:
    --------
    BootstrapResult containing indices and OOB masks
    """
    rng = np.random.default_rng(random_state)
    
    # Generate all bootstrap indices at once (vectorized)
    indices = rng.choice(n, size=(B, n), replace=True)
    
    # Compute OOB mask: True if observation i is not in bootstrap sample b
    # An observation is OOB if it doesn't appear in the bootstrap indices
    oob_mask = np.ones((B, n), dtype=bool)
    for b in range(B):
        oob_mask[b, np.unique(indices[b])] = False
    
    return BootstrapResult(indices=indices, oob_mask=oob_mask)
 
def demonstrate_efficient_bootstrap():
    """Demonstrate memory-efficient bootstrap implementation."""
    n = 10000  # Large dataset
    B = 100    # Number of bootstrap samples
    
    # Create sample data
    X = np.random.randn(n, 50)  # 50 features
    y = np.random.randn(n)
    
    # Generate bootstrap samples (stores only indices)
    result = generate_bootstrap_samples_efficient(n, B, random_state=42)
    
    print(f"Original data: X.shape = {X.shape}")
    print(f"Original data memory: {X.nbytes / 1e6:.2f} MB")
    print(f"\nBootstrap indices shape: {result.indices.shape}")
    print(f"Bootstrap indices memory: {result.indices.nbytes / 1e6:.2f} MB")
    print(f"OOB mask memory: {result.oob_mask.nbytes / 1e6:.2f} MB")
    
    # Verify OOB fraction
    oob_fractions = result.oob_mask.mean(axis=1)
    print(f"\nMean OOB fraction: {oob_fractions.mean():.4f}")
    print(f"Expected (1/e):    {1/np.e:.4f}")
    
    # Example: Get specific bootstrap sample
    b = 0
    X_boot, y_boot = result.get_bootstrap_sample(X, y, b)
    X_oob, y_oob = result.get_oob_sample(X, y, b)
    
    print(f"\nBootstrap sample {b}: {len(X_boot)} training, {len(X_oob)} OOB")
 
demonstrate_efficient_bootstrap()
 
# Output:
# Original data: X.shape = (10000, 50)
# Original data memory: 4.00 MB
# 
# Bootstrap indices shape: (100, 10000)
# Bootstrap indices memory: 8.00 MB
# OOB mask memory: 1.00 MB
# 
# Mean OOB fraction: 0.3680
# Expected (1/e):    0.3679
# 
# Bootstrap sample 0: 10000 training, 3681 OOB

Weighted Bootstrap for Sample Weights

When observations have pre-existing weights (e.g., from survey sampling or importance weighting), standard bootstrap may not be appropriate. The weighted bootstrap samples observations with probability proportional to their weights. This maintains the proper weighting structure across bootstrap samples and is supported in modern ML libraries like scikit-learn's BaggingClassifier via the sample_weight parameter.

Historical Context and Significance

The bootstrap method has a fascinating history that illustrates how simple ideas can revolutionize a field.

Origins (1979):

Bradley Efron introduced the bootstrap in his landmark paper "Bootstrap Methods: Another Look at the Jackknife." The method was controversial initially—many statisticians found the idea of sampling from a sample philosophically dubious. But empirical results showed it worked remarkably well.

The name "bootstrap" comes from the phrase "to pull oneself up by one's bootstraps," referring to the seemingly impossible feat of using only the data at hand to learn about what the data can tell us.

From Statistics to Machine Learning (1994-1996):

Leo Breiman made the crucial connection between bootstrap and machine learning:

1994: Introduced bagging in "Bagging Predictors"
1996: Extended to "Random Forests" with additional randomization

Breiman's key insight was that the bootstrap's variance estimation property could be transformed into variance reduction by averaging predictions instead of estimating their spread.

Key Milestones in Bootstrap and Bagging History
Year	Development	Contributor	Impact
1979	Bootstrap method	Bradley Efron	Revolutionized statistical inference
1983	Bootstrap confidence intervals	Efron	Practical inference without distributional assumptions
1993	Subsampling methods	Politis & Romano	Extensions for dependent data
1994	Bagging	Leo Breiman	Variance reduction for ML models
1996	Out-of-bag estimation	Breiman	Free validation without holdout
2001	Random Forests	Breiman	State-of-the-art ensemble method

Efron's Ongoing Contributions

Bradley Efron received the International Prize in Statistics (2018) specifically for the bootstrap, recognizing its transformative impact across all quantitative sciences. His 2016 book "Computer Age Statistical Inference" with Trevor Hastie provides a modern perspective on bootstrap methods and their role in contemporary data science. Understanding the bootstrap's statistical foundations, as we've done in this page, prepares you to appreciate why bagging works—not just accept that it does.

Summary: Bootstrap Sampling Foundations

We've established the mathematical and conceptual foundations of bootstrap sampling that underpin the bagging algorithm. Let's consolidate the key insights:

Key Takeaways

•Bootstrap samples with replacement from the original data, creating datasets of the same size with different compositions
•~63.2% of observations appear in each bootstrap sample on average; ~36.8% are left out (out-of-bag)
•The bootstrap principle states that variation across bootstrap samples approximates variation across true population samples
•Bootstrap is consistent for smooth statistics under mild regularity conditions
•In ML, bagging applies bootstrap to train diverse models, then aggregates their predictions
•Implementation should be memory-efficient, storing indices rather than full data copies
•Historical development from Efron's 1979 paper to Breiman's 1994 bagging paper shows the statistical-to-ML evolution

What's Next:

With bootstrap sampling understood, we'll examine the second component of bagging: aggregation methods. How do we combine predictions from multiple models? For regression, averaging seems natural—but what about classification? What about probabilistic predictions? The next page explores these aggregation strategies in depth.

Page Complete

You now understand bootstrap sampling: the statistical technique that enables bagging's variance reduction. The key insight is that sampling with replacement from your data simulates what would happen if you could draw multiple samples from the true population. This diversity in training sets leads to diversity in models, which leads to variance reduction when predictions are aggregated.