Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

1 / 5

Bootstrap Resampling

The Bootstrap Revolution

In 1979, Bradley Efron introduced a deceptively simple idea that would transform statistical inference: what if we could estimate the sampling distribution of any statistic by resampling from our observed data? This idea—the bootstrap—eliminated the need for complex analytical derivations and opened the door to statistical inference for virtually any quantity of interest.

The term 'bootstrap' comes from the phrase 'pulling oneself up by one's bootstraps,' a reference to something seemingly impossible. Indeed, the bootstrap appears to create information from nothing: we have only one dataset, yet we can estimate the variability of statistics as if we had collected many datasets. This apparent paradox dissolves when we understand the bootstrap's theoretical foundation.

What You Will Master

By the end of this page, you will understand: (1) The theoretical foundations of bootstrap resampling, including the plug-in principle and empirical distribution approximation; (2) The mechanics of generating bootstrap samples and computing bootstrap statistics; (3) The asymptotic properties that justify bootstrap inference; (4) Common pitfalls and when bootstrap methods fail; (5) Practical implementation for machine learning model evaluation.

The Fundamental Problem of Statistical Inference

Statistical inference centers on a fundamental challenge: we observe one sample from a population but must make statements about population parameters. The inherent uncertainty in our estimates stems from the fact that a different sample would yield different estimates—this variability is captured by the sampling distribution.

The Classical Approach:

Traditionally, statisticians derived sampling distributions analytically. For the sample mean $\bar{X}$ from a normal population with known variance $\sigma^2$, we know:

$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$

This elegant result allows us to construct confidence intervals and conduct hypothesis tests. However, this analytical approach has severe limitations:

Distributional Assumptions: We must assume the population follows a specific distribution (normal, exponential, etc.)
Complex Statistics: For medians, ratios, or other complex statistics, analytical sampling distributions may not exist in closed form
Finite Samples: Asymptotic approximations may be poor for small samples
Model-Based Quantities: For prediction error or other machine learning metrics, no analytical theory exists

The Bootstrap Philosophy

The bootstrap takes an empirical approach: rather than deriving the sampling distribution mathematically, we simulate it by treating our sample as if it were the population. By repeatedly sampling from our observed data, we can approximate the sampling distribution of any statistic, however complex.

The Key Insight:

Let $F$ be the true (unknown) population distribution and $\hat{F}_n$ be the empirical distribution function—the discrete distribution placing probability $1/n$ on each observed data point. The bootstrap substitutes $\hat{F}_n$ for $F$ whenever the sampling distribution depends on the unknown population.

If a statistic $\theta(F)$ has true sampling distribution under $F$, the bootstrap approximates this by the distribution of $\theta(\hat{F}_n)$ computed via simulation from $\hat{F}_n$. This is the plug-in principle: replace unknown quantities with their empirical estimates and simulate.

The Bootstrap Algorithm

The bootstrap algorithm is remarkably simple to implement, belying its deep theoretical foundations. Let $X = (X_1, X_2, \ldots, X_n)$ be our observed sample and $\hat{\theta} = s(X)$ be a statistic of interest computed from this sample.

The Non-Parametric Bootstrap Algorithm:

Generate a bootstrap sample: Draw $n$ observations with replacement from the original sample $X$, creating $X^* = (X_1^, X_2^, \ldots, X_n^*)$
Compute the bootstrap statistic: Calculate $\hat{\theta}^* = s(X^*)$ on the bootstrap sample
Repeat: Perform steps 1-2 a total of $B$ times, generating bootstrap statistics $\hat{\theta}^_1, \hat{\theta}^_2, \ldots, \hat{\theta}^*_B$
Analyze the bootstrap distribution: Use the collection of $\hat{\theta}^*_b$ values to estimate properties of the sampling distribution

bootstrap_resampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from typing import Callable, List, Tuple
 
def bootstrap_resample(data: np.ndarray, 
                       statistic_fn: Callable[[np.ndarray], float],
                       n_bootstrap: int = 10000,
                       random_state: int = None) -> np.ndarray:
    """
    Perform non-parametric bootstrap resampling.
    
    Parameters
    ----------
    data : np.ndarray
        Original sample data of shape (n_samples,) or (n_samples, n_features)
    statistic_fn : Callable
        Function that computes the statistic of interest from a sample
    n_bootstrap : int
        Number of bootstrap iterations (B)
    random_state : int, optional
        Random seed for reproducibility
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    
    The algorithm:
    1. For each bootstrap iteration b = 1, ..., B:
       a. Draw n samples WITH REPLACEMENT from data
       b. Compute statistic on resampled data
    2. Return all B statistics
    """
    rng = np.random.RandomState(random_state)
    n_samples = len(data)
    
    bootstrap_statistics = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Step 1: Sample WITH REPLACEMENT
        # Each observation has probability 1/n of being selected each time
        indices = rng.choice(n_samples, size=n_samples, replace=True)
        bootstrap_sample = data[indices]
        
        # Step 2: Compute statistic on bootstrap sample
        bootstrap_statistics[b] = statistic_fn(bootstrap_sample)
    
    return bootstrap_statistics
 
 
def compute_bootstrap_standard_error(data: np.ndarray,
                                     statistic_fn: Callable[[np.ndarray], float],
                                     n_bootstrap: int = 10000) -> Tuple[float, float]:
    """
    Compute bootstrap estimate of standard error.
    
    This is the most basic bootstrap application: estimating the
    standard deviation of the sampling distribution.
    
    Returns
    -------
    Tuple[float, float]
        (original_statistic, bootstrap_standard_error)
    """
    # Original statistic
    theta_hat = statistic_fn(data)
    
    # Bootstrap distribution
    theta_stars = bootstrap_resample(data, statistic_fn, n_bootstrap)
    
    # Bootstrap SE is just the standard deviation of bootstrap statistics
    se_boot = np.std(theta_stars, ddof=1)
    
    return theta_hat, se_boot
 
 
# Example: Bootstrap SE of the sample mean
if __name__ == "__main__":
    # Generate sample data
    np.random.seed(42)
    sample = np.random.exponential(scale=2.0, size=50)
    
    # Statistic: sample mean
    theta_hat, se_boot = compute_bootstrap_standard_error(
        sample, 
        statistic_fn=np.mean,
        n_bootstrap=10000
    )
    
    # Compare with analytical SE (known for the mean)
    se_analytical = np.std(sample, ddof=1) / np.sqrt(len(sample))
    
    print(f"Sample mean: {theta_hat:.4f}")
    print(f"Bootstrap SE: {se_boot:.4f}")
    print(f"Analytical SE: {se_analytical:.4f}")
    print(f"Ratio (Boot/Analytical): {se_boot/se_analytical:.4f}")

Sampling With Replacement: The Key Mechanism

The 'with replacement' aspect is crucial. Since we're treating the empirical distribution as the population, sampling with replacement is equivalent to drawing from this discrete distribution. Without replacement, we'd just get permutations of our sample—every bootstrap sample would have exactly the same composition, just in different order.

The Probability of Inclusion

A critical property of bootstrap sampling concerns which observations appear in each bootstrap sample. When we draw $n$ samples with replacement from $n$ observations, not all original observations will appear. Understanding this probability structure is essential for methods like the out-of-bootstrap estimator.

Probability of Being Included:

Consider a specific observation $X_i$. The probability that $X_i$ is not selected in a single draw is $(n-1)/n$. Over $n$ independent draws:

$$P(X_i \text{ not in bootstrap sample}) = \left(\frac{n-1}{n}\right)^n = \left(1 - \frac{1}{n}\right)^n$$

As $n \to \infty$, this converges to a famous limit:

$$\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368$$

This means approximately 36.8% of observations are excluded from each bootstrap sample.

Probability of Exclusion by Sample Size
Sample Size (n)	Exact P(excluded)	Percent Excluded	Observations Excluded (Expected)
10	0.3487	34.87%	~3.5 observations
25	0.3604	36.04%	~9 observations
50	0.3642	36.42%	~18 observations
100	0.3660	36.60%	~37 observations
500	0.3675	36.75%	~184 observations
1000	0.3677	36.77%	~368 observations
∞	0.3679 (1/e)	36.79%	~36.79% of n

Expected Number of Unique Observations:

Conversely, the expected proportion of unique observations in a bootstrap sample is:

$$E\left[\frac{\text{unique obs}}{n}\right] = 1 - \left(1 - \frac{1}{n}\right)^n \approx 1 - e^{-1} \approx 0.632$$

This 0.632 figure is not merely theoretical curiosity—it directly motivates the .632 bootstrap estimator we'll study later. The included observations contribute to training, while excluded observations serve as natural test sets.

inclusion_probability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from collections import Counter
 
def analyze_bootstrap_inclusion(n_samples: int, 
                                 n_bootstrap: int = 10000,
                                 random_state: int = 42) -> dict:
    """
    Empirically verify bootstrap inclusion probabilities.
    
    Parameters
    ----------
    n_samples : int
        Size of the dataset
    n_bootstrap : int
        Number of bootstrap samples to generate
    random_state : int
        Random seed
        
    Returns
    -------
    dict
        Statistics about inclusion patterns
    """
    rng = np.random.RandomState(random_state)
    
    # Track inclusion across bootstrap samples
    inclusion_counts = np.zeros((n_bootstrap, n_samples))
    unique_counts = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices
        indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Count unique observations
        unique_counts[b] = len(np.unique(indices))
        
        # Track which observations were included
        for idx in indices:
            inclusion_counts[b, idx] += 1
    
    # Theoretical values
    theoretical_exclusion = (1 - 1/n_samples)**n_samples
    theoretical_inclusion = 1 - theoretical_exclusion
    
    # Empirical values
    empirical_unique_mean = np.mean(unique_counts)
    empirical_unique_std = np.std(unique_counts)
    
    # For each observation, count how often it was excluded
    never_included_per_bootstrap = np.sum(inclusion_counts == 0, axis=1)
    empirical_exclusion_rate = np.mean(never_included_per_bootstrap) / n_samples
    
    return {
        'n_samples': n_samples,
        'theoretical_exclusion': theoretical_exclusion,
        'theoretical_inclusion': theoretical_inclusion,
        'empirical_exclusion_rate': empirical_exclusion_rate,
        'mean_unique_obs': empirical_unique_mean,
        'std_unique_obs': empirical_unique_std,
        'theoretical_unique': n_samples * theoretical_inclusion,
    }
 
 
# Demonstrate the 1/e phenomenon
if __name__ == "__main__":
    import math
    
    print("Bootstrap Inclusion Probability Analysis")
    print("=" * 60)
    print(f"Theoretical limit (n→∞): P(excluded) = 1/e ≈ {1/math.e:.6f}")
    print()
    
    for n in [10, 50, 100, 500, 1000]:
        results = analyze_bootstrap_inclusion(n, n_bootstrap=5000)
        print(f"n = {n:4d}: Theoretical = {results['theoretical_exclusion']:.4f}, "
              f"Empirical = {results['empirical_exclusion_rate']:.4f}, "
              f"Unique obs = {results['mean_unique_obs']:.1f} ± {results['std_unique_obs']:.1f}")

Multiplicity of Appearance

The included observations don't appear exactly once—some appear multiple times. On average, each included observation appears 1/(1-e^{-1}) ≈ 1.58 times. This means bootstrap samples are not just subsets but weighted versions of the original data, which affects how we interpret bootstrap-based estimates.

Theoretical Foundations: Why Does Bootstrap Work?

The bootstrap's validity rests on deep results from mathematical statistics. Understanding these foundations helps us recognize when bootstrap methods will succeed and when they may fail.

The Glivenko-Cantelli Theorem:

The empirical distribution function $\hat{F}_n(x)$ converges uniformly to the true distribution $F(x)$ almost surely:

$$\sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{a.s.} 0 \text{ as } n \to \infty$$

This theorem establishes that the empirical distribution is a consistent estimator of the true distribution in the strongest sense. As our sample size grows, our plug-in approximation becomes arbitrarily accurate.

Bootstrap Consistency:

For the bootstrap to be valid, we need the bootstrap distribution to converge to the true sampling distribution. Formally, let $\hat{\theta}_n$ be our statistic and $\hat{\theta}_n^*$ be the bootstrap version. The bootstrap is consistent if:

$$\sup_t \left| P^(\sqrt{n}(\hat{\theta}_n^ - \hat{\theta}_n) \leq t) - P(\sqrt{n}(\hat{\theta}_n - \theta) \leq t) \right| \xrightarrow{P} 0$$

where $P^*$ denotes probability under the bootstrap distribution (conditional on the observed data).

Key Theoretical Results

•Sample Mean: For i.i.d. observations with finite variance, the bootstrap is consistent for the distribution of √n(X̄ - μ). The bootstrap SE converges to the true SE at rate √n.
•Smooth Functions of Parameters: By the delta method, if θ̂ is asymptotically normal and g is smooth, then g(θ̂) is also asymptotically normal, and the bootstrap is consistent.
•Quantiles: The bootstrap is consistent for sample quantiles under mild conditions on the population density near the quantile of interest.
•Higher-Order Accuracy: For pivotal statistics (e.g., t-statistics), the bootstrap achieves higher-order accuracy: errors are O(n^{-1}) rather than O(n^{-1/2}) from normal approximation.

When Bootstrap Fails

The bootstrap is NOT universally valid. It can fail catastrophically for: (1) Extreme value statistics like the sample maximum—the bootstrap cannot generate values beyond the observed range; (2) Infinite variance distributions where the CLT doesn't apply; (3) Non-i.i.d. data without appropriate modifications (block bootstrap for time series); (4) Statistics whose limiting distribution depends on local properties of F that the empirical distribution cannot capture.

The Plug-in Principle in Detail:

The bootstrap fundamentally relies on the plug-in principle: if a parameter $\theta = \theta(F)$ is a functional of the distribution, we estimate it by $\hat{\theta} = \theta(\hat{F}_n)$. Similarly, if the standard error of $\hat{\theta}$ depends on $F$, we estimate it by plugging in $\hat{F}_n$.

For machine learning, this principle extends naturally. If we want to estimate the variance of a test error estimate, we:

Treat our dataset as the population
Simulate the data collection process (bootstrap sampling)
Compute test errors for each simulated dataset
Use the variance of these test errors as our estimate

The beauty of this approach is its generality—it works for any statistic, any model, any evaluation metric, without requiring analytical derivations.

Bootstrap for Machine Learning Evaluation

In machine learning, we face a specific form of the statistical inference problem: estimating the generalization performance of a model trained on finite data. The bootstrap provides a powerful framework for this task, with several variants suited to different scenarios.

The Basic Bootstrap Error Estimate:

A naive application of the bootstrap to model evaluation proceeds as follows:

Generate bootstrap sample $D^*_b$ from original data $D$
Train model $\hat{f}^_b$ on $D^_b$
Evaluate on $D^*_b$ to get training error

However, this estimates training error, not generalization error—the same observations used for training are used for evaluation. The result is severely optimistically biased.

The Bias Problem

Evaluating on the same data used for training gives optimistically biased error estimates. For flexible models that can interpolate the training data, this 'resubstitution error' can approach zero even when the true generalization error is substantial. The bootstrap's value for ML comes from leveraging the OUT-OF-BOOTSTRAP samples for evaluation.

The Leave-One-Out Bootstrap (LOOB) / Out-of-Bootstrap Error:

The key insight is that the ~36.8% of observations NOT in a bootstrap sample can serve as a test set. For each bootstrap iteration:

Generate bootstrap sample $D^*_b$
Train model $\hat{f}^_b$ on $D^_b$
Identify the out-of-bootstrap (OOB) observations: $\text{OOB}_b = D \setminus D^*_b$
Evaluate on OOB observations only

This gives an approximately unbiased estimate of generalization error, but it has its own issues—the training set size is effectively reduced (to ~63.2% of n on average), causing pessimistic bias.

bootstrap_model_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from sklearn.base import clone
from typing import List, Tuple, Callable
 
def out_of_bootstrap_error(X: np.ndarray,
                            y: np.ndarray,
                            model,
                            n_bootstrap: int = 200,
                            scoring_fn: Callable = None,
                            random_state: int = 42) -> dict:
    """
    Compute the Out-of-Bootstrap (OOB) error estimate.
    
    For each bootstrap iteration:
    1. Create bootstrap sample (with replacement)
    2. Train model on bootstrap sample
    3. Evaluate on observations NOT in bootstrap sample
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix of shape (n_samples, n_features)
    y : np.ndarray
        Target vector of shape (n_samples,)
    model : sklearn estimator
        Model with fit() and predict() methods
    n_bootstrap : int
        Number of bootstrap iterations
    scoring_fn : Callable, optional
        Custom scoring function(y_true, y_pred) -> float
        Default: mean squared error for regression
    random_state : int
        Random seed
    
    Returns
    -------
    dict
        Dictionary containing OOB error estimate and diagnostics
    """
    if scoring_fn is None:
        # Default: MSE for regression
        scoring_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2)
    
    rng = np.random.RandomState(random_state)
    n_samples = len(y)
    
    # Track predictions for each observation when it's OOB
    oob_predictions = {i: [] for i in range(n_samples)}
    bootstrap_errors = []
    oob_sizes = []
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices
        bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Identify in-bag and out-of-bag observations
        in_bag = set(bootstrap_indices)
        out_of_bag = [i for i in range(n_samples) if i not in in_bag]
        
        if len(out_of_bag) == 0:
            continue  # Rare case: all observations selected
        
        oob_sizes.append(len(out_of_bag))
        
        # Train on bootstrap sample
        X_train = X[bootstrap_indices]
        y_train = y[bootstrap_indices]
        
        model_clone = clone(model)
        model_clone.fit(X_train, y_train)
        
        # Predict on OOB observations
        X_oob = X[out_of_bag]
        y_oob_true = y[out_of_bag]
        y_oob_pred = model_clone.predict(X_oob)
        
        # Store predictions for aggregation
        for i, obs_idx in enumerate(out_of_bag):
            oob_predictions[obs_idx].append(y_oob_pred[i])
        
        # Compute error for this bootstrap iteration
        bootstrap_errors.append(scoring_fn(y_oob_true, y_oob_pred))
    
    # Aggregate OOB predictions per observation
    aggregated_predictions = []
    aggregated_true = []
    prediction_counts = []
    
    for i in range(n_samples):
        if len(oob_predictions[i]) > 0:
            aggregated_predictions.append(np.mean(oob_predictions[i]))
            aggregated_true.append(y[i])
            prediction_counts.append(len(oob_predictions[i]))
    
    # Final OOB error estimate (using aggregated predictions)
    oob_error = scoring_fn(
        np.array(aggregated_true), 
        np.array(aggregated_predictions)
    )
    
    return {
        'oob_error': oob_error,
        'oob_error_std': np.std(bootstrap_errors),
        'mean_bootstrap_error': np.mean(bootstrap_errors),
        'mean_oob_size': np.mean(oob_sizes),
        'mean_oob_fraction': np.mean(oob_sizes) / n_samples,
        'mean_predictions_per_obs': np.mean(prediction_counts),
        'observations_with_predictions': len(aggregated_true),
    }
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.linear_model import Ridge
    from sklearn.datasets import make_regression
    
    # Generate sample data
    X, y = make_regression(n_samples=200, n_features=10, noise=10, random_state=42)
    
    # Compute OOB error
    model = Ridge(alpha=1.0)
    results = out_of_bootstrap_error(X, y, model, n_bootstrap=500)
    
    print("Out-of-Bootstrap Error Estimation")
    print("=" * 50)
    print(f"OOB Error (MSE): {results['oob_error']:.4f}")
    print(f"OOB Error Std: {results['oob_error_std']:.4f}")
    print(f"Mean OOB fraction: {results['mean_oob_fraction']:.4f}")
    print(f"Mean predictions per observation: {results['mean_predictions_per_obs']:.1f}")
    print(f"Observations covered: {results['observations_with_predictions']}/{len(y)}")

Choosing the Number of Bootstrap Iterations

A practical question: how many bootstrap samples $B$ should we generate? The answer depends on what quantity we're estimating and the precision required.

For Standard Error Estimation:

The bootstrap estimate of standard error is $\hat{SE}B = \sqrt{\frac{1}{B-1}\sum{b=1}^{B}(\hat{\theta}^_b - \bar{\theta}^)^2}$, where $\bar{\theta}^* = \frac{1}{B}\sum_{b=1}^{B}\hat{\theta}^*_b$.

The Monte Carlo error in this estimate (error due to using finite $B$) has approximate coefficient of variation:

$$CV(\hat{SE}_B) \approx \frac{1}{\sqrt{2B}}$$

For $B = 200$, the CV is about 5%—the bootstrap SE is estimated with ~5% relative error. For most practical purposes, $B = 200$–$1000$ is sufficient for standard errors.

Recommended Bootstrap Iterations by Application
Application	Recommended B	Rationale
Standard Error Estimation	200–500	CV ≈ 3–5% is adequate for most uses
Confidence Intervals (Percentile)	1,000–2,000	Need accurate percentile estimation at tails
Bias Estimation	500–1,000	Bias requires stable mean estimation
BCa Confidence Intervals	2,000–5,000	Acceleration parameter estimation is sensitive
Machine Learning Model Comparison	500–2,000	Need stable ranking of models
Publication-Quality Results	10,000+	Minimize Monte Carlo variability

Practical Guidance

For exploratory analysis and model development, B = 200–500 is typically sufficient. For final reported results, consider B = 2000 or more. When in doubt, run with increasing B values and check for stability: if results change substantially when doubling B, you need more iterations.

Computational Considerations:

The total computational cost scales as $O(B \times \text{cost of fitting and evaluating one model})$. For expensive models (deep neural networks, large gradient boosting ensembles), this cost can be prohibitive.

Strategies for reducing computational burden:

Parallelization: Bootstrap iterations are embarrassingly parallel—distribute across cores/machines
Subsampling: For very large datasets, subsample before bootstrapping (m-out-of-n bootstrap)
Early stopping: Monitor bootstrap statistic convergence and stop when stable
Approximate methods: Use analytical approximations when available (e.g., OOB error in Random Forests is computed essentially for free)

Bootstrap for Complex Data Structures

The standard non-parametric bootstrap assumes i.i.d. observations. When this assumption is violated, we need modified approaches.

Paired Data Bootstrap:

When observations are paired (e.g., before-after measurements, matched case-control), we must resample pairs together to preserve the dependence structure.

Clustered Data Bootstrap:

With grouped or clustered data (patients within hospitals, students within schools), resample at the cluster level—select clusters with replacement, then include all observations within selected clusters.

Stratified Bootstrap:

When samples are stratified (e.g., different demographic groups), resample within strata separately to preserve the stratification structure.

bootstrap_variants.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
from typing import List, Tuple
 
def stratified_bootstrap(data: np.ndarray, 
                          strata: np.ndarray,
                          statistic_fn,
                          n_bootstrap: int = 1000,
                          random_state: int = 42) -> np.ndarray:
    """
    Perform stratified bootstrap resampling.
    
    Resamples WITHIN each stratum separately, preserving
    the stratification structure.
    
    Parameters
    ----------
    data : np.ndarray
        Data array of shape (n_samples,) or (n_samples, n_features)
    strata : np.ndarray
        Stratum labels for each observation, shape (n_samples,)
    statistic_fn : Callable
        Function computing statistic from resampled data
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    unique_strata = np.unique(strata)
    
    # Indices for each stratum
    stratum_indices = {s: np.where(strata == s)[0] for s in unique_strata}
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        resampled_indices = []
        
        # Resample within each stratum
        for s in unique_strata:
            indices = stratum_indices[s]
            n_stratum = len(indices)
            
            # Sample with replacement within stratum
            selected = rng.choice(indices, size=n_stratum, replace=True)
            resampled_indices.extend(selected)
        
        # Extract resampled data
        resampled_data = data[resampled_indices]
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(resampled_data)
    
    return bootstrap_stats
 
 
def cluster_bootstrap(data: np.ndarray,
                       clusters: np.ndarray,
                       statistic_fn,
                       n_bootstrap: int = 1000,
                       random_state: int = 42) -> np.ndarray:
    """
    Perform cluster bootstrap resampling.
    
    Resamples CLUSTERS with replacement, then includes all
    observations within selected clusters.
    
    Parameters
    ----------
    data : np.ndarray
        Data array of shape (n_samples,) or (n_samples, n_features)
    clusters : np.ndarray
        Cluster labels for each observation, shape (n_samples,)
    statistic_fn : Callable
        Function computing statistic from resampled data
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    unique_clusters = np.unique(clusters)
    n_clusters = len(unique_clusters)
    
    # Indices for each cluster
    cluster_indices = {c: np.where(clusters == c)[0] for c in unique_clusters}
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Sample clusters with replacement
        selected_clusters = rng.choice(unique_clusters, size=n_clusters, replace=True)
        
        # Collect all observations from selected clusters
        resampled_indices = []
        for c in selected_clusters:
            resampled_indices.extend(cluster_indices[c])
        
        # Extract resampled data
        resampled_data = data[resampled_indices]
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(resampled_data)
    
    return bootstrap_stats
 
 
# Example: Time series block bootstrap
def block_bootstrap(data: np.ndarray,
                     block_size: int,
                     statistic_fn,
                     n_bootstrap: int = 1000,
                     random_state: int = 42) -> np.ndarray:
    """
    Perform block bootstrap for time series data.
    
    Resamples contiguous blocks to preserve temporal dependence.
    Uses the moving block bootstrap (MBB) approach.
    
    Parameters
    ----------
    data : np.ndarray
        Time series of shape (n_timepoints,)
    block_size : int
        Size of each block (should reflect autocorrelation range)
    statistic_fn : Callable
        Function computing statistic from resampled series
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Number of blocks needed to cover at least n observations
    n_blocks_needed = int(np.ceil(n / block_size))
    
    # Possible starting positions for blocks
    max_start = n - block_size
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Sample block starting positions
        starts = rng.randint(0, max_start + 1, size=n_blocks_needed)
        
        # Construct resampled series from blocks
        resampled = []
        for start in starts:
            resampled.extend(data[start:start + block_size])
        
        # Trim to exactly n observations
        resampled = np.array(resampled[:n])
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(resampled)
    
    return bootstrap_stats

Block Size Selection for Time Series

For time series block bootstrap, block size selection is critical. Too small blocks break temporal dependencies; too large blocks yield too few blocks for adequate resampling. A common approach: set block size approximately equal to the effective sample size n/τ where τ is the integrated autocorrelation time.

Parametric vs. Non-Parametric Bootstrap

We've focused on the non-parametric bootstrap, which makes no assumptions about the data distribution. An alternative is the parametric bootstrap, which assumes a specific distributional form.

Non-Parametric Bootstrap:

Sample from empirical distribution $\hat{F}_n$
Makes no distributional assumptions
Robust but may be inefficient if parametric model is correct

Parametric Bootstrap:

Fit parametric model: estimate $\hat{\theta}$ assuming $X \sim F_\theta$
Sample from fitted distribution $F_{\hat{\theta}}$
More efficient when model is correct, biased when wrong

Non-Parametric Bootstrap

•Distribution-free: no assumptions needed
•Automatically captures skewness, heavy tails
•Works for any statistic directly
•Limited by observed data range
•May struggle with small samples

Parametric Bootstrap

•More efficient when model is correct
•Can generate values outside observed range
•Natural for model-based inference
•Requires correct model specification
•Biased if model is wrong

parametric_bootstrap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from scipy import stats
 
def parametric_bootstrap_normal(data: np.ndarray,
                                  statistic_fn,
                                  n_bootstrap: int = 1000,
                                  random_state: int = 42) -> np.ndarray:
    """
    Parametric bootstrap assuming Normal distribution.
    
    1. Estimate μ and σ² from data
    2. Sample from N(μ̂, σ̂²)
    3. Compute statistic on parametric samples
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Statistic to compute
    n_bootstrap : int
        Number of iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Fit Normal distribution (MLE)
    mu_hat = np.mean(data)
    sigma_hat = np.std(data, ddof=1)  # Unbiased estimate
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Generate parametric sample
        parametric_sample = rng.normal(mu_hat, sigma_hat, size=n)
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(parametric_sample)
    
    return bootstrap_stats
 
 
def model_residual_bootstrap(X: np.ndarray,
                              y: np.ndarray,
                              model,
                              n_bootstrap: int = 500,
                              random_state: int = 42) -> dict:
    """
    Residual bootstrap for regression models.
    
    A semi-parametric approach:
    1. Fit model, compute residuals
    2. Resample residuals (non-parametric for residuals)
    3. Reconstruct responses with resampled residuals
    4. Refit model on reconstructed data
    
    This preserves the X distribution while allowing residual variation.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix
    y : np.ndarray
        Response vector
    model : sklearn-like estimator
        Model with fit/predict methods
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    dict
        Bootstrap coefficient distributions
    """
    from sklearn.base import clone
    
    rng = np.random.RandomState(random_state)
    n = len(y)
    
    # Fit original model
    model.fit(X, y)
    y_pred = model.predict(X)
    residuals = y - y_pred
    
    # Center residuals (to remove any bias)
    residuals_centered = residuals - np.mean(residuals)
    
    # Store bootstrap coefficients
    if hasattr(model, 'coef_'):
        n_coefs = len(model.coef_) if hasattr(model.coef_, '__len__') else 1
        bootstrap_coefs = np.zeros((n_bootstrap, n_coefs))
    else:
        bootstrap_coefs = None
    
    bootstrap_intercepts = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Resample residuals
        resampled_indices = rng.choice(n, size=n, replace=True)
        resampled_residuals = residuals_centered[resampled_indices]
        
        # Construct new response
        y_star = y_pred + resampled_residuals
        
        # Refit model
        model_b = clone(model)
        model_b.fit(X, y_star)
        
        # Store coefficients
        if hasattr(model_b, 'coef_'):
            bootstrap_coefs[b] = model_b.coef_.flatten()
        if hasattr(model_b, 'intercept_'):
            bootstrap_intercepts[b] = model_b.intercept_
    
    return {
        'coefficients': bootstrap_coefs,
        'intercepts': bootstrap_intercepts,
        'coef_means': np.mean(bootstrap_coefs, axis=0) if bootstrap_coefs is not None else None,
        'coef_stds': np.std(bootstrap_coefs, axis=0) if bootstrap_coefs is not None else None,
    }

When to Use Which Bootstrap

Use non-parametric bootstrap when: (1) distribution is unknown, (2) you want robustness over efficiency, (3) sample size is moderate to large. Use parametric bootstrap when: (1) distributional model is well-justified, (2) sample size is small, (3) you need to extrapolate beyond observed range. Use residual bootstrap for regression when you trust the model form but not error distribution.

Summary: The Power of Bootstrap Resampling

Bootstrap resampling represents a paradigm shift in statistical inference—from analytical derivation to computational simulation. By treating our sample as if it were the population and resampling from it, we can estimate the sampling distribution of any statistic without requiring distributional assumptions.

Key Takeaways

•The Plug-in Principle: The bootstrap works by substituting the empirical distribution for the unknown true distribution, then simulating the sampling process.
•Sampling With Replacement: This is essential—it's equivalent to drawing from the empirical distribution, creating variability across bootstrap samples.
•The 1/e Rule: Approximately 36.8% of observations are excluded from each bootstrap sample. These out-of-bootstrap observations provide natural test sets.
•Theoretical Validity: The bootstrap is consistent under mild conditions for smooth functionals, with higher-order accuracy for pivotal statistics.
•Limitations: The bootstrap can fail for extreme value statistics, infinite variance populations, and non-i.i.d. data without modifications.
•Variants for Complex Data: Stratified, cluster, and block bootstrap handle violations of i.i.d. assumptions.

What's Next:

In the next pages, we'll explore specific bootstrap estimators designed for machine learning:

The .632 Bootstrap: Combines training error and OOB error to reduce bias
The .632+ Bootstrap: Addresses overfitting with a 'no-information' error rate
Out-of-Bootstrap Error: Detailed treatment of the OOB estimator
Bootstrap Confidence Intervals: Percentile, BCa, and other interval methods

These methods address the specific challenges of estimating generalization performance when the same data used for training informs our error estimates.

Foundation Established

You now understand the fundamental mechanics and theory of bootstrap resampling. This foundation is essential for the more sophisticated estimators we'll develop in subsequent pages. The bootstrap's power lies in its generality—once you understand resampling from the empirical distribution, you can estimate the uncertainty of virtually any quantity.

1 / 5

Loading learning content...

Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

1 / 5

Bootstrap Resampling

The Bootstrap Revolution

What You Will Master

The Fundamental Problem of Statistical Inference

The Classical Approach:

Traditionally, statisticians derived sampling distributions analytically. For the sample mean $\bar{X}$ from a normal population with known variance $\sigma^2$, we know:

$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$

This elegant result allows us to construct confidence intervals and conduct hypothesis tests. However, this analytical approach has severe limitations:

Distributional Assumptions: We must assume the population follows a specific distribution (normal, exponential, etc.)
Complex Statistics: For medians, ratios, or other complex statistics, analytical sampling distributions may not exist in closed form
Finite Samples: Asymptotic approximations may be poor for small samples
Model-Based Quantities: For prediction error or other machine learning metrics, no analytical theory exists

The Bootstrap Philosophy

The Key Insight:

The Bootstrap Algorithm

The Non-Parametric Bootstrap Algorithm:

Generate a bootstrap sample: Draw $n$ observations with replacement from the original sample $X$, creating $X^* = (X_1^, X_2^, \ldots, X_n^*)$
Compute the bootstrap statistic: Calculate $\hat{\theta}^* = s(X^*)$ on the bootstrap sample
Repeat: Perform steps 1-2 a total of $B$ times, generating bootstrap statistics $\hat{\theta}^_1, \hat{\theta}^_2, \ldots, \hat{\theta}^*_B$
Analyze the bootstrap distribution: Use the collection of $\hat{\theta}^*_b$ values to estimate properties of the sampling distribution

bootstrap_resampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from typing import Callable, List, Tuple
 
def bootstrap_resample(data: np.ndarray, 
                       statistic_fn: Callable[[np.ndarray], float],
                       n_bootstrap: int = 10000,
                       random_state: int = None) -> np.ndarray:
    """
    Perform non-parametric bootstrap resampling.
    
    Parameters
    ----------
    data : np.ndarray
        Original sample data of shape (n_samples,) or (n_samples, n_features)
    statistic_fn : Callable
        Function that computes the statistic of interest from a sample
    n_bootstrap : int
        Number of bootstrap iterations (B)
    random_state : int, optional
        Random seed for reproducibility
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    
    The algorithm:
    1. For each bootstrap iteration b = 1, ..., B:
       a. Draw n samples WITH REPLACEMENT from data
       b. Compute statistic on resampled data
    2. Return all B statistics
    """
    rng = np.random.RandomState(random_state)
    n_samples = len(data)
    
    bootstrap_statistics = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Step 1: Sample WITH REPLACEMENT
        # Each observation has probability 1/n of being selected each time
        indices = rng.choice(n_samples, size=n_samples, replace=True)
        bootstrap_sample = data[indices]
        
        # Step 2: Compute statistic on bootstrap sample
        bootstrap_statistics[b] = statistic_fn(bootstrap_sample)
    
    return bootstrap_statistics
 
 
def compute_bootstrap_standard_error(data: np.ndarray,
                                     statistic_fn: Callable[[np.ndarray], float],
                                     n_bootstrap: int = 10000) -> Tuple[float, float]:
    """
    Compute bootstrap estimate of standard error.
    
    This is the most basic bootstrap application: estimating the
    standard deviation of the sampling distribution.
    
    Returns
    -------
    Tuple[float, float]
        (original_statistic, bootstrap_standard_error)
    """
    # Original statistic
    theta_hat = statistic_fn(data)
    
    # Bootstrap distribution
    theta_stars = bootstrap_resample(data, statistic_fn, n_bootstrap)
    
    # Bootstrap SE is just the standard deviation of bootstrap statistics
    se_boot = np.std(theta_stars, ddof=1)
    
    return theta_hat, se_boot
 
 
# Example: Bootstrap SE of the sample mean
if __name__ == "__main__":
    # Generate sample data
    np.random.seed(42)
    sample = np.random.exponential(scale=2.0, size=50)
    
    # Statistic: sample mean
    theta_hat, se_boot = compute_bootstrap_standard_error(
        sample, 
        statistic_fn=np.mean,
        n_bootstrap=10000
    )
    
    # Compare with analytical SE (known for the mean)
    se_analytical = np.std(sample, ddof=1) / np.sqrt(len(sample))
    
    print(f"Sample mean: {theta_hat:.4f}")
    print(f"Bootstrap SE: {se_boot:.4f}")
    print(f"Analytical SE: {se_analytical:.4f}")
    print(f"Ratio (Boot/Analytical): {se_boot/se_analytical:.4f}")

Sampling With Replacement: The Key Mechanism

The Probability of Inclusion

Probability of Being Included:

Consider a specific observation $X_i$. The probability that $X_i$ is not selected in a single draw is $(n-1)/n$. Over $n$ independent draws:

$$P(X_i \text{ not in bootstrap sample}) = \left(\frac{n-1}{n}\right)^n = \left(1 - \frac{1}{n}\right)^n$$

As $n \to \infty$, this converges to a famous limit:

$$\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368$$

This means approximately 36.8% of observations are excluded from each bootstrap sample.

Probability of Exclusion by Sample Size
Sample Size (n)	Exact P(excluded)	Percent Excluded	Observations Excluded (Expected)
10	0.3487	34.87%	~3.5 observations
25	0.3604	36.04%	~9 observations
50	0.3642	36.42%	~18 observations
100	0.3660	36.60%	~37 observations
500	0.3675	36.75%	~184 observations
1000	0.3677	36.77%	~368 observations
∞	0.3679 (1/e)	36.79%	~36.79% of n

Expected Number of Unique Observations:

Conversely, the expected proportion of unique observations in a bootstrap sample is:

$$E\left[\frac{\text{unique obs}}{n}\right] = 1 - \left(1 - \frac{1}{n}\right)^n \approx 1 - e^{-1} \approx 0.632$$

inclusion_probability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from collections import Counter
 
def analyze_bootstrap_inclusion(n_samples: int, 
                                 n_bootstrap: int = 10000,
                                 random_state: int = 42) -> dict:
    """
    Empirically verify bootstrap inclusion probabilities.
    
    Parameters
    ----------
    n_samples : int
        Size of the dataset
    n_bootstrap : int
        Number of bootstrap samples to generate
    random_state : int
        Random seed
        
    Returns
    -------
    dict
        Statistics about inclusion patterns
    """
    rng = np.random.RandomState(random_state)
    
    # Track inclusion across bootstrap samples
    inclusion_counts = np.zeros((n_bootstrap, n_samples))
    unique_counts = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices
        indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Count unique observations
        unique_counts[b] = len(np.unique(indices))
        
        # Track which observations were included
        for idx in indices:
            inclusion_counts[b, idx] += 1
    
    # Theoretical values
    theoretical_exclusion = (1 - 1/n_samples)**n_samples
    theoretical_inclusion = 1 - theoretical_exclusion
    
    # Empirical values
    empirical_unique_mean = np.mean(unique_counts)
    empirical_unique_std = np.std(unique_counts)
    
    # For each observation, count how often it was excluded
    never_included_per_bootstrap = np.sum(inclusion_counts == 0, axis=1)
    empirical_exclusion_rate = np.mean(never_included_per_bootstrap) / n_samples
    
    return {
        'n_samples': n_samples,
        'theoretical_exclusion': theoretical_exclusion,
        'theoretical_inclusion': theoretical_inclusion,
        'empirical_exclusion_rate': empirical_exclusion_rate,
        'mean_unique_obs': empirical_unique_mean,
        'std_unique_obs': empirical_unique_std,
        'theoretical_unique': n_samples * theoretical_inclusion,
    }
 
 
# Demonstrate the 1/e phenomenon
if __name__ == "__main__":
    import math
    
    print("Bootstrap Inclusion Probability Analysis")
    print("=" * 60)
    print(f"Theoretical limit (n→∞): P(excluded) = 1/e ≈ {1/math.e:.6f}")
    print()
    
    for n in [10, 50, 100, 500, 1000]:
        results = analyze_bootstrap_inclusion(n, n_bootstrap=5000)
        print(f"n = {n:4d}: Theoretical = {results['theoretical_exclusion']:.4f}, "
              f"Empirical = {results['empirical_exclusion_rate']:.4f}, "
              f"Unique obs = {results['mean_unique_obs']:.1f} ± {results['std_unique_obs']:.1f}")

Multiplicity of Appearance

Theoretical Foundations: Why Does Bootstrap Work?

The bootstrap's validity rests on deep results from mathematical statistics. Understanding these foundations helps us recognize when bootstrap methods will succeed and when they may fail.

The Glivenko-Cantelli Theorem:

The empirical distribution function $\hat{F}_n(x)$ converges uniformly to the true distribution $F(x)$ almost surely:

$$\sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{a.s.} 0 \text{ as } n \to \infty$$

Bootstrap Consistency:

$$\sup_t \left| P^(\sqrt{n}(\hat{\theta}_n^ - \hat{\theta}_n) \leq t) - P(\sqrt{n}(\hat{\theta}_n - \theta) \leq t) \right| \xrightarrow{P} 0$$

where $P^*$ denotes probability under the bootstrap distribution (conditional on the observed data).

Key Theoretical Results

•Sample Mean: For i.i.d. observations with finite variance, the bootstrap is consistent for the distribution of √n(X̄ - μ). The bootstrap SE converges to the true SE at rate √n.
•Smooth Functions of Parameters: By the delta method, if θ̂ is asymptotically normal and g is smooth, then g(θ̂) is also asymptotically normal, and the bootstrap is consistent.
•Quantiles: The bootstrap is consistent for sample quantiles under mild conditions on the population density near the quantile of interest.
•Higher-Order Accuracy: For pivotal statistics (e.g., t-statistics), the bootstrap achieves higher-order accuracy: errors are O(n^{-1}) rather than O(n^{-1/2}) from normal approximation.

When Bootstrap Fails

The Plug-in Principle in Detail:

For machine learning, this principle extends naturally. If we want to estimate the variance of a test error estimate, we:

Treat our dataset as the population
Simulate the data collection process (bootstrap sampling)
Compute test errors for each simulated dataset
Use the variance of these test errors as our estimate

The beauty of this approach is its generality—it works for any statistic, any model, any evaluation metric, without requiring analytical derivations.

Bootstrap for Machine Learning Evaluation

The Basic Bootstrap Error Estimate:

A naive application of the bootstrap to model evaluation proceeds as follows:

Generate bootstrap sample $D^*_b$ from original data $D$
Train model $\hat{f}^_b$ on $D^_b$
Evaluate on $D^*_b$ to get training error

However, this estimates training error, not generalization error—the same observations used for training are used for evaluation. The result is severely optimistically biased.

The Bias Problem

The Leave-One-Out Bootstrap (LOOB) / Out-of-Bootstrap Error:

The key insight is that the ~36.8% of observations NOT in a bootstrap sample can serve as a test set. For each bootstrap iteration:

Generate bootstrap sample $D^*_b$
Train model $\hat{f}^_b$ on $D^_b$
Identify the out-of-bootstrap (OOB) observations: $\text{OOB}_b = D \setminus D^*_b$
Evaluate on OOB observations only

bootstrap_model_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from sklearn.base import clone
from typing import List, Tuple, Callable
 
def out_of_bootstrap_error(X: np.ndarray,
                            y: np.ndarray,
                            model,
                            n_bootstrap: int = 200,
                            scoring_fn: Callable = None,
                            random_state: int = 42) -> dict:
    """
    Compute the Out-of-Bootstrap (OOB) error estimate.
    
    For each bootstrap iteration:
    1. Create bootstrap sample (with replacement)
    2. Train model on bootstrap sample
    3. Evaluate on observations NOT in bootstrap sample
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix of shape (n_samples, n_features)
    y : np.ndarray
        Target vector of shape (n_samples,)
    model : sklearn estimator
        Model with fit() and predict() methods
    n_bootstrap : int
        Number of bootstrap iterations
    scoring_fn : Callable, optional
        Custom scoring function(y_true, y_pred) -> float
        Default: mean squared error for regression
    random_state : int
        Random seed
    
    Returns
    -------
    dict
        Dictionary containing OOB error estimate and diagnostics
    """
    if scoring_fn is None:
        # Default: MSE for regression
        scoring_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2)
    
    rng = np.random.RandomState(random_state)
    n_samples = len(y)
    
    # Track predictions for each observation when it's OOB
    oob_predictions = {i: [] for i in range(n_samples)}
    bootstrap_errors = []
    oob_sizes = []
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices
        bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Identify in-bag and out-of-bag observations
        in_bag = set(bootstrap_indices)
        out_of_bag = [i for i in range(n_samples) if i not in in_bag]
        
        if len(out_of_bag) == 0:
            continue  # Rare case: all observations selected
        
        oob_sizes.append(len(out_of_bag))
        
        # Train on bootstrap sample
        X_train = X[bootstrap_indices]
        y_train = y[bootstrap_indices]
        
        model_clone = clone(model)
        model_clone.fit(X_train, y_train)
        
        # Predict on OOB observations
        X_oob = X[out_of_bag]
        y_oob_true = y[out_of_bag]
        y_oob_pred = model_clone.predict(X_oob)
        
        # Store predictions for aggregation
        for i, obs_idx in enumerate(out_of_bag):
            oob_predictions[obs_idx].append(y_oob_pred[i])
        
        # Compute error for this bootstrap iteration
        bootstrap_errors.append(scoring_fn(y_oob_true, y_oob_pred))
    
    # Aggregate OOB predictions per observation
    aggregated_predictions = []
    aggregated_true = []
    prediction_counts = []
    
    for i in range(n_samples):
        if len(oob_predictions[i]) > 0:
            aggregated_predictions.append(np.mean(oob_predictions[i]))
            aggregated_true.append(y[i])
            prediction_counts.append(len(oob_predictions[i]))
    
    # Final OOB error estimate (using aggregated predictions)
    oob_error = scoring_fn(
        np.array(aggregated_true), 
        np.array(aggregated_predictions)
    )
    
    return {
        'oob_error': oob_error,
        'oob_error_std': np.std(bootstrap_errors),
        'mean_bootstrap_error': np.mean(bootstrap_errors),
        'mean_oob_size': np.mean(oob_sizes),
        'mean_oob_fraction': np.mean(oob_sizes) / n_samples,
        'mean_predictions_per_obs': np.mean(prediction_counts),
        'observations_with_predictions': len(aggregated_true),
    }
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.linear_model import Ridge
    from sklearn.datasets import make_regression
    
    # Generate sample data
    X, y = make_regression(n_samples=200, n_features=10, noise=10, random_state=42)
    
    # Compute OOB error
    model = Ridge(alpha=1.0)
    results = out_of_bootstrap_error(X, y, model, n_bootstrap=500)
    
    print("Out-of-Bootstrap Error Estimation")
    print("=" * 50)
    print(f"OOB Error (MSE): {results['oob_error']:.4f}")
    print(f"OOB Error Std: {results['oob_error_std']:.4f}")
    print(f"Mean OOB fraction: {results['mean_oob_fraction']:.4f}")
    print(f"Mean predictions per observation: {results['mean_predictions_per_obs']:.1f}")
    print(f"Observations covered: {results['observations_with_predictions']}/{len(y)}")

Choosing the Number of Bootstrap Iterations

A practical question: how many bootstrap samples $B$ should we generate? The answer depends on what quantity we're estimating and the precision required.

For Standard Error Estimation:

The bootstrap estimate of standard error is $\hat{SE}B = \sqrt{\frac{1}{B-1}\sum{b=1}^{B}(\hat{\theta}^_b - \bar{\theta}^)^2}$, where $\bar{\theta}^* = \frac{1}{B}\sum_{b=1}^{B}\hat{\theta}^*_b$.

The Monte Carlo error in this estimate (error due to using finite $B$) has approximate coefficient of variation:

$$CV(\hat{SE}_B) \approx \frac{1}{\sqrt{2B}}$$

For $B = 200$, the CV is about 5%—the bootstrap SE is estimated with ~5% relative error. For most practical purposes, $B = 200$–$1000$ is sufficient for standard errors.

Recommended Bootstrap Iterations by Application
Application	Recommended B	Rationale
Standard Error Estimation	200–500	CV ≈ 3–5% is adequate for most uses
Confidence Intervals (Percentile)	1,000–2,000	Need accurate percentile estimation at tails
Bias Estimation	500–1,000	Bias requires stable mean estimation
BCa Confidence Intervals	2,000–5,000	Acceleration parameter estimation is sensitive
Machine Learning Model Comparison	500–2,000	Need stable ranking of models
Publication-Quality Results	10,000+	Minimize Monte Carlo variability

Practical Guidance

Computational Considerations:

Strategies for reducing computational burden:

Parallelization: Bootstrap iterations are embarrassingly parallel—distribute across cores/machines
Subsampling: For very large datasets, subsample before bootstrapping (m-out-of-n bootstrap)
Early stopping: Monitor bootstrap statistic convergence and stop when stable
Approximate methods: Use analytical approximations when available (e.g., OOB error in Random Forests is computed essentially for free)

Bootstrap for Complex Data Structures

The standard non-parametric bootstrap assumes i.i.d. observations. When this assumption is violated, we need modified approaches.

Paired Data Bootstrap:

When observations are paired (e.g., before-after measurements, matched case-control), we must resample pairs together to preserve the dependence structure.

Clustered Data Bootstrap:

Stratified Bootstrap:

When samples are stratified (e.g., different demographic groups), resample within strata separately to preserve the stratification structure.

bootstrap_variants.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
from typing import List, Tuple
 
def stratified_bootstrap(data: np.ndarray, 
                          strata: np.ndarray,
                          statistic_fn,
                          n_bootstrap: int = 1000,
                          random_state: int = 42) -> np.ndarray:
    """
    Perform stratified bootstrap resampling.
    
    Resamples WITHIN each stratum separately, preserving
    the stratification structure.
    
    Parameters
    ----------
    data : np.ndarray
        Data array of shape (n_samples,) or (n_samples, n_features)
    strata : np.ndarray
        Stratum labels for each observation, shape (n_samples,)
    statistic_fn : Callable
        Function computing statistic from resampled data
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    unique_strata = np.unique(strata)
    
    # Indices for each stratum
    stratum_indices = {s: np.where(strata == s)[0] for s in unique_strata}
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        resampled_indices = []
        
        # Resample within each stratum
        for s in unique_strata:
            indices = stratum_indices[s]
            n_stratum = len(indices)
            
            # Sample with replacement within stratum
            selected = rng.choice(indices, size=n_stratum, replace=True)
            resampled_indices.extend(selected)
        
        # Extract resampled data
        resampled_data = data[resampled_indices]
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(resampled_data)
    
    return bootstrap_stats
 
 
def cluster_bootstrap(data: np.ndarray,
                       clusters: np.ndarray,
                       statistic_fn,
                       n_bootstrap: int = 1000,
                       random_state: int = 42) -> np.ndarray:
    """
    Perform cluster bootstrap resampling.
    
    Resamples CLUSTERS with replacement, then includes all
    observations within selected clusters.
    
    Parameters
    ----------
    data : np.ndarray
        Data array of shape (n_samples,) or (n_samples, n_features)
    clusters : np.ndarray
        Cluster labels for each observation, shape (n_samples,)
    statistic_fn : Callable
        Function computing statistic from resampled data
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    unique_clusters = np.unique(clusters)
    n_clusters = len(unique_clusters)
    
    # Indices for each cluster
    cluster_indices = {c: np.where(clusters == c)[0] for c in unique_clusters}
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Sample clusters with replacement
        selected_clusters = rng.choice(unique_clusters, size=n_clusters, replace=True)
        
        # Collect all observations from selected clusters
        resampled_indices = []
        for c in selected_clusters:
            resampled_indices.extend(cluster_indices[c])
        
        # Extract resampled data
        resampled_data = data[resampled_indices]
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(resampled_data)
    
    return bootstrap_stats
 
 
# Example: Time series block bootstrap
def block_bootstrap(data: np.ndarray,
                     block_size: int,
                     statistic_fn,
                     n_bootstrap: int = 1000,
                     random_state: int = 42) -> np.ndarray:
    """
    Perform block bootstrap for time series data.
    
    Resamples contiguous blocks to preserve temporal dependence.
    Uses the moving block bootstrap (MBB) approach.
    
    Parameters
    ----------
    data : np.ndarray
        Time series of shape (n_timepoints,)
    block_size : int
        Size of each block (should reflect autocorrelation range)
    statistic_fn : Callable
        Function computing statistic from resampled series
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Array of B bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Number of blocks needed to cover at least n observations
    n_blocks_needed = int(np.ceil(n / block_size))
    
    # Possible starting positions for blocks
    max_start = n - block_size
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Sample block starting positions
        starts = rng.randint(0, max_start + 1, size=n_blocks_needed)
        
        # Construct resampled series from blocks
        resampled = []
        for start in starts:
            resampled.extend(data[start:start + block_size])
        
        # Trim to exactly n observations
        resampled = np.array(resampled[:n])
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(resampled)
    
    return bootstrap_stats

Block Size Selection for Time Series

Parametric vs. Non-Parametric Bootstrap

Non-Parametric Bootstrap:

Sample from empirical distribution $\hat{F}_n$
Makes no distributional assumptions
Robust but may be inefficient if parametric model is correct

Parametric Bootstrap:

Fit parametric model: estimate $\hat{\theta}$ assuming $X \sim F_\theta$
Sample from fitted distribution $F_{\hat{\theta}}$
More efficient when model is correct, biased when wrong

Non-Parametric Bootstrap

•Distribution-free: no assumptions needed
•Automatically captures skewness, heavy tails
•Works for any statistic directly
•Limited by observed data range
•May struggle with small samples

Parametric Bootstrap

•More efficient when model is correct
•Can generate values outside observed range
•Natural for model-based inference
•Requires correct model specification
•Biased if model is wrong

parametric_bootstrap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from scipy import stats
 
def parametric_bootstrap_normal(data: np.ndarray,
                                  statistic_fn,
                                  n_bootstrap: int = 1000,
                                  random_state: int = 42) -> np.ndarray:
    """
    Parametric bootstrap assuming Normal distribution.
    
    1. Estimate μ and σ² from data
    2. Sample from N(μ̂, σ̂²)
    3. Compute statistic on parametric samples
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Statistic to compute
    n_bootstrap : int
        Number of iterations
    random_state : int
        Random seed
    
    Returns
    -------
    np.ndarray
        Bootstrap statistics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Fit Normal distribution (MLE)
    mu_hat = np.mean(data)
    sigma_hat = np.std(data, ddof=1)  # Unbiased estimate
    
    bootstrap_stats = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Generate parametric sample
        parametric_sample = rng.normal(mu_hat, sigma_hat, size=n)
        
        # Compute statistic
        bootstrap_stats[b] = statistic_fn(parametric_sample)
    
    return bootstrap_stats
 
 
def model_residual_bootstrap(X: np.ndarray,
                              y: np.ndarray,
                              model,
                              n_bootstrap: int = 500,
                              random_state: int = 42) -> dict:
    """
    Residual bootstrap for regression models.
    
    A semi-parametric approach:
    1. Fit model, compute residuals
    2. Resample residuals (non-parametric for residuals)
    3. Reconstruct responses with resampled residuals
    4. Refit model on reconstructed data
    
    This preserves the X distribution while allowing residual variation.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix
    y : np.ndarray
        Response vector
    model : sklearn-like estimator
        Model with fit/predict methods
    n_bootstrap : int
        Number of bootstrap iterations
    random_state : int
        Random seed
    
    Returns
    -------
    dict
        Bootstrap coefficient distributions
    """
    from sklearn.base import clone
    
    rng = np.random.RandomState(random_state)
    n = len(y)
    
    # Fit original model
    model.fit(X, y)
    y_pred = model.predict(X)
    residuals = y - y_pred
    
    # Center residuals (to remove any bias)
    residuals_centered = residuals - np.mean(residuals)
    
    # Store bootstrap coefficients
    if hasattr(model, 'coef_'):
        n_coefs = len(model.coef_) if hasattr(model.coef_, '__len__') else 1
        bootstrap_coefs = np.zeros((n_bootstrap, n_coefs))
    else:
        bootstrap_coefs = None
    
    bootstrap_intercepts = np.zeros(n_bootstrap)
    
    for b in range(n_bootstrap):
        # Resample residuals
        resampled_indices = rng.choice(n, size=n, replace=True)
        resampled_residuals = residuals_centered[resampled_indices]
        
        # Construct new response
        y_star = y_pred + resampled_residuals
        
        # Refit model
        model_b = clone(model)
        model_b.fit(X, y_star)
        
        # Store coefficients
        if hasattr(model_b, 'coef_'):
            bootstrap_coefs[b] = model_b.coef_.flatten()
        if hasattr(model_b, 'intercept_'):
            bootstrap_intercepts[b] = model_b.intercept_
    
    return {
        'coefficients': bootstrap_coefs,
        'intercepts': bootstrap_intercepts,
        'coef_means': np.mean(bootstrap_coefs, axis=0) if bootstrap_coefs is not None else None,
        'coef_stds': np.std(bootstrap_coefs, axis=0) if bootstrap_coefs is not None else None,
    }

When to Use Which Bootstrap

Summary: The Power of Bootstrap Resampling

Key Takeaways

•The Plug-in Principle: The bootstrap works by substituting the empirical distribution for the unknown true distribution, then simulating the sampling process.
•Sampling With Replacement: This is essential—it's equivalent to drawing from the empirical distribution, creating variability across bootstrap samples.
•The 1/e Rule: Approximately 36.8% of observations are excluded from each bootstrap sample. These out-of-bootstrap observations provide natural test sets.
•Theoretical Validity: The bootstrap is consistent under mild conditions for smooth functionals, with higher-order accuracy for pivotal statistics.
•Limitations: The bootstrap can fail for extreme value statistics, infinite variance populations, and non-i.i.d. data without modifications.
•Variants for Complex Data: Stratified, cluster, and block bootstrap handle violations of i.i.d. assumptions.

What's Next:

In the next pages, we'll explore specific bootstrap estimators designed for machine learning:

The .632 Bootstrap: Combines training error and OOB error to reduce bias
The .632+ Bootstrap: Addresses overfitting with a 'no-information' error rate
Out-of-Bootstrap Error: Detailed treatment of the OOB estimator
Bootstrap Confidence Intervals: Percentile, BCa, and other interval methods

These methods address the specific challenges of estimating generalization performance when the same data used for training informs our error estimates.

Foundation Established

1 / 5