Loading learning content...
In 1979, Bradley Efron introduced a deceptively simple idea that would transform statistical inference: what if we could estimate the sampling distribution of any statistic by resampling from our observed data? This idea—the bootstrap—eliminated the need for complex analytical derivations and opened the door to statistical inference for virtually any quantity of interest.
The term 'bootstrap' comes from the phrase 'pulling oneself up by one's bootstraps,' a reference to something seemingly impossible. Indeed, the bootstrap appears to create information from nothing: we have only one dataset, yet we can estimate the variability of statistics as if we had collected many datasets. This apparent paradox dissolves when we understand the bootstrap's theoretical foundation.
By the end of this page, you will understand: (1) The theoretical foundations of bootstrap resampling, including the plug-in principle and empirical distribution approximation; (2) The mechanics of generating bootstrap samples and computing bootstrap statistics; (3) The asymptotic properties that justify bootstrap inference; (4) Common pitfalls and when bootstrap methods fail; (5) Practical implementation for machine learning model evaluation.
Statistical inference centers on a fundamental challenge: we observe one sample from a population but must make statements about population parameters. The inherent uncertainty in our estimates stems from the fact that a different sample would yield different estimates—this variability is captured by the sampling distribution.
The Classical Approach:
Traditionally, statisticians derived sampling distributions analytically. For the sample mean $\bar{X}$ from a normal population with known variance $\sigma^2$, we know:
$$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$$
This elegant result allows us to construct confidence intervals and conduct hypothesis tests. However, this analytical approach has severe limitations:
The bootstrap takes an empirical approach: rather than deriving the sampling distribution mathematically, we simulate it by treating our sample as if it were the population. By repeatedly sampling from our observed data, we can approximate the sampling distribution of any statistic, however complex.
The Key Insight:
Let $F$ be the true (unknown) population distribution and $\hat{F}_n$ be the empirical distribution function—the discrete distribution placing probability $1/n$ on each observed data point. The bootstrap substitutes $\hat{F}_n$ for $F$ whenever the sampling distribution depends on the unknown population.
If a statistic $\theta(F)$ has true sampling distribution under $F$, the bootstrap approximates this by the distribution of $\theta(\hat{F}_n)$ computed via simulation from $\hat{F}_n$. This is the plug-in principle: replace unknown quantities with their empirical estimates and simulate.
The bootstrap algorithm is remarkably simple to implement, belying its deep theoretical foundations. Let $X = (X_1, X_2, \ldots, X_n)$ be our observed sample and $\hat{\theta} = s(X)$ be a statistic of interest computed from this sample.
The Non-Parametric Bootstrap Algorithm:
Generate a bootstrap sample: Draw $n$ observations with replacement from the original sample $X$, creating $X^* = (X_1^, X_2^, \ldots, X_n^*)$
Compute the bootstrap statistic: Calculate $\hat{\theta}^* = s(X^*)$ on the bootstrap sample
Repeat: Perform steps 1-2 a total of $B$ times, generating bootstrap statistics $\hat{\theta}^_1, \hat{\theta}^_2, \ldots, \hat{\theta}^*_B$
Analyze the bootstrap distribution: Use the collection of $\hat{\theta}^*_b$ values to estimate properties of the sampling distribution
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import numpy as npfrom typing import Callable, List, Tuple def bootstrap_resample(data: np.ndarray, statistic_fn: Callable[[np.ndarray], float], n_bootstrap: int = 10000, random_state: int = None) -> np.ndarray: """ Perform non-parametric bootstrap resampling. Parameters ---------- data : np.ndarray Original sample data of shape (n_samples,) or (n_samples, n_features) statistic_fn : Callable Function that computes the statistic of interest from a sample n_bootstrap : int Number of bootstrap iterations (B) random_state : int, optional Random seed for reproducibility Returns ------- np.ndarray Array of B bootstrap statistics The algorithm: 1. For each bootstrap iteration b = 1, ..., B: a. Draw n samples WITH REPLACEMENT from data b. Compute statistic on resampled data 2. Return all B statistics """ rng = np.random.RandomState(random_state) n_samples = len(data) bootstrap_statistics = np.zeros(n_bootstrap) for b in range(n_bootstrap): # Step 1: Sample WITH REPLACEMENT # Each observation has probability 1/n of being selected each time indices = rng.choice(n_samples, size=n_samples, replace=True) bootstrap_sample = data[indices] # Step 2: Compute statistic on bootstrap sample bootstrap_statistics[b] = statistic_fn(bootstrap_sample) return bootstrap_statistics def compute_bootstrap_standard_error(data: np.ndarray, statistic_fn: Callable[[np.ndarray], float], n_bootstrap: int = 10000) -> Tuple[float, float]: """ Compute bootstrap estimate of standard error. This is the most basic bootstrap application: estimating the standard deviation of the sampling distribution. Returns ------- Tuple[float, float] (original_statistic, bootstrap_standard_error) """ # Original statistic theta_hat = statistic_fn(data) # Bootstrap distribution theta_stars = bootstrap_resample(data, statistic_fn, n_bootstrap) # Bootstrap SE is just the standard deviation of bootstrap statistics se_boot = np.std(theta_stars, ddof=1) return theta_hat, se_boot # Example: Bootstrap SE of the sample meanif __name__ == "__main__": # Generate sample data np.random.seed(42) sample = np.random.exponential(scale=2.0, size=50) # Statistic: sample mean theta_hat, se_boot = compute_bootstrap_standard_error( sample, statistic_fn=np.mean, n_bootstrap=10000 ) # Compare with analytical SE (known for the mean) se_analytical = np.std(sample, ddof=1) / np.sqrt(len(sample)) print(f"Sample mean: {theta_hat:.4f}") print(f"Bootstrap SE: {se_boot:.4f}") print(f"Analytical SE: {se_analytical:.4f}") print(f"Ratio (Boot/Analytical): {se_boot/se_analytical:.4f}")The 'with replacement' aspect is crucial. Since we're treating the empirical distribution as the population, sampling with replacement is equivalent to drawing from this discrete distribution. Without replacement, we'd just get permutations of our sample—every bootstrap sample would have exactly the same composition, just in different order.
A critical property of bootstrap sampling concerns which observations appear in each bootstrap sample. When we draw $n$ samples with replacement from $n$ observations, not all original observations will appear. Understanding this probability structure is essential for methods like the out-of-bootstrap estimator.
Probability of Being Included:
Consider a specific observation $X_i$. The probability that $X_i$ is not selected in a single draw is $(n-1)/n$. Over $n$ independent draws:
$$P(X_i \text{ not in bootstrap sample}) = \left(\frac{n-1}{n}\right)^n = \left(1 - \frac{1}{n}\right)^n$$
As $n \to \infty$, this converges to a famous limit:
$$\lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n = e^{-1} \approx 0.368$$
This means approximately 36.8% of observations are excluded from each bootstrap sample.
| Sample Size (n) | Exact P(excluded) | Percent Excluded | Observations Excluded (Expected) |
|---|---|---|---|
| 10 | 0.3487 | 34.87% | ~3.5 observations |
| 25 | 0.3604 | 36.04% | ~9 observations |
| 50 | 0.3642 | 36.42% | ~18 observations |
| 100 | 0.3660 | 36.60% | ~37 observations |
| 500 | 0.3675 | 36.75% | ~184 observations |
| 1000 | 0.3677 | 36.77% | ~368 observations |
| ∞ | 0.3679 (1/e) | 36.79% | ~36.79% of n |
Expected Number of Unique Observations:
Conversely, the expected proportion of unique observations in a bootstrap sample is:
$$E\left[\frac{\text{unique obs}}{n}\right] = 1 - \left(1 - \frac{1}{n}\right)^n \approx 1 - e^{-1} \approx 0.632$$
This 0.632 figure is not merely theoretical curiosity—it directly motivates the .632 bootstrap estimator we'll study later. The included observations contribute to training, while excluded observations serve as natural test sets.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom collections import Counter def analyze_bootstrap_inclusion(n_samples: int, n_bootstrap: int = 10000, random_state: int = 42) -> dict: """ Empirically verify bootstrap inclusion probabilities. Parameters ---------- n_samples : int Size of the dataset n_bootstrap : int Number of bootstrap samples to generate random_state : int Random seed Returns ------- dict Statistics about inclusion patterns """ rng = np.random.RandomState(random_state) # Track inclusion across bootstrap samples inclusion_counts = np.zeros((n_bootstrap, n_samples)) unique_counts = np.zeros(n_bootstrap) for b in range(n_bootstrap): # Generate bootstrap indices indices = rng.choice(n_samples, size=n_samples, replace=True) # Count unique observations unique_counts[b] = len(np.unique(indices)) # Track which observations were included for idx in indices: inclusion_counts[b, idx] += 1 # Theoretical values theoretical_exclusion = (1 - 1/n_samples)**n_samples theoretical_inclusion = 1 - theoretical_exclusion # Empirical values empirical_unique_mean = np.mean(unique_counts) empirical_unique_std = np.std(unique_counts) # For each observation, count how often it was excluded never_included_per_bootstrap = np.sum(inclusion_counts == 0, axis=1) empirical_exclusion_rate = np.mean(never_included_per_bootstrap) / n_samples return { 'n_samples': n_samples, 'theoretical_exclusion': theoretical_exclusion, 'theoretical_inclusion': theoretical_inclusion, 'empirical_exclusion_rate': empirical_exclusion_rate, 'mean_unique_obs': empirical_unique_mean, 'std_unique_obs': empirical_unique_std, 'theoretical_unique': n_samples * theoretical_inclusion, } # Demonstrate the 1/e phenomenonif __name__ == "__main__": import math print("Bootstrap Inclusion Probability Analysis") print("=" * 60) print(f"Theoretical limit (n→∞): P(excluded) = 1/e ≈ {1/math.e:.6f}") print() for n in [10, 50, 100, 500, 1000]: results = analyze_bootstrap_inclusion(n, n_bootstrap=5000) print(f"n = {n:4d}: Theoretical = {results['theoretical_exclusion']:.4f}, " f"Empirical = {results['empirical_exclusion_rate']:.4f}, " f"Unique obs = {results['mean_unique_obs']:.1f} ± {results['std_unique_obs']:.1f}")The included observations don't appear exactly once—some appear multiple times. On average, each included observation appears 1/(1-e^{-1}) ≈ 1.58 times. This means bootstrap samples are not just subsets but weighted versions of the original data, which affects how we interpret bootstrap-based estimates.
The bootstrap's validity rests on deep results from mathematical statistics. Understanding these foundations helps us recognize when bootstrap methods will succeed and when they may fail.
The Glivenko-Cantelli Theorem:
The empirical distribution function $\hat{F}_n(x)$ converges uniformly to the true distribution $F(x)$ almost surely:
$$\sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{a.s.} 0 \text{ as } n \to \infty$$
This theorem establishes that the empirical distribution is a consistent estimator of the true distribution in the strongest sense. As our sample size grows, our plug-in approximation becomes arbitrarily accurate.
Bootstrap Consistency:
For the bootstrap to be valid, we need the bootstrap distribution to converge to the true sampling distribution. Formally, let $\hat{\theta}_n$ be our statistic and $\hat{\theta}_n^*$ be the bootstrap version. The bootstrap is consistent if:
$$\sup_t \left| P^(\sqrt{n}(\hat{\theta}_n^ - \hat{\theta}_n) \leq t) - P(\sqrt{n}(\hat{\theta}_n - \theta) \leq t) \right| \xrightarrow{P} 0$$
where $P^*$ denotes probability under the bootstrap distribution (conditional on the observed data).
The bootstrap is NOT universally valid. It can fail catastrophically for: (1) Extreme value statistics like the sample maximum—the bootstrap cannot generate values beyond the observed range; (2) Infinite variance distributions where the CLT doesn't apply; (3) Non-i.i.d. data without appropriate modifications (block bootstrap for time series); (4) Statistics whose limiting distribution depends on local properties of F that the empirical distribution cannot capture.
The Plug-in Principle in Detail:
The bootstrap fundamentally relies on the plug-in principle: if a parameter $\theta = \theta(F)$ is a functional of the distribution, we estimate it by $\hat{\theta} = \theta(\hat{F}_n)$. Similarly, if the standard error of $\hat{\theta}$ depends on $F$, we estimate it by plugging in $\hat{F}_n$.
For machine learning, this principle extends naturally. If we want to estimate the variance of a test error estimate, we:
The beauty of this approach is its generality—it works for any statistic, any model, any evaluation metric, without requiring analytical derivations.
In machine learning, we face a specific form of the statistical inference problem: estimating the generalization performance of a model trained on finite data. The bootstrap provides a powerful framework for this task, with several variants suited to different scenarios.
The Basic Bootstrap Error Estimate:
A naive application of the bootstrap to model evaluation proceeds as follows:
However, this estimates training error, not generalization error—the same observations used for training are used for evaluation. The result is severely optimistically biased.
Evaluating on the same data used for training gives optimistically biased error estimates. For flexible models that can interpolate the training data, this 'resubstitution error' can approach zero even when the true generalization error is substantial. The bootstrap's value for ML comes from leveraging the OUT-OF-BOOTSTRAP samples for evaluation.
The Leave-One-Out Bootstrap (LOOB) / Out-of-Bootstrap Error:
The key insight is that the ~36.8% of observations NOT in a bootstrap sample can serve as a test set. For each bootstrap iteration:
This gives an approximately unbiased estimate of generalization error, but it has its own issues—the training set size is effectively reduced (to ~63.2% of n on average), causing pessimistic bias.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom sklearn.base import clonefrom typing import List, Tuple, Callable def out_of_bootstrap_error(X: np.ndarray, y: np.ndarray, model, n_bootstrap: int = 200, scoring_fn: Callable = None, random_state: int = 42) -> dict: """ Compute the Out-of-Bootstrap (OOB) error estimate. For each bootstrap iteration: 1. Create bootstrap sample (with replacement) 2. Train model on bootstrap sample 3. Evaluate on observations NOT in bootstrap sample Parameters ---------- X : np.ndarray Feature matrix of shape (n_samples, n_features) y : np.ndarray Target vector of shape (n_samples,) model : sklearn estimator Model with fit() and predict() methods n_bootstrap : int Number of bootstrap iterations scoring_fn : Callable, optional Custom scoring function(y_true, y_pred) -> float Default: mean squared error for regression random_state : int Random seed Returns ------- dict Dictionary containing OOB error estimate and diagnostics """ if scoring_fn is None: # Default: MSE for regression scoring_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2) rng = np.random.RandomState(random_state) n_samples = len(y) # Track predictions for each observation when it's OOB oob_predictions = {i: [] for i in range(n_samples)} bootstrap_errors = [] oob_sizes = [] for b in range(n_bootstrap): # Generate bootstrap indices bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True) # Identify in-bag and out-of-bag observations in_bag = set(bootstrap_indices) out_of_bag = [i for i in range(n_samples) if i not in in_bag] if len(out_of_bag) == 0: continue # Rare case: all observations selected oob_sizes.append(len(out_of_bag)) # Train on bootstrap sample X_train = X[bootstrap_indices] y_train = y[bootstrap_indices] model_clone = clone(model) model_clone.fit(X_train, y_train) # Predict on OOB observations X_oob = X[out_of_bag] y_oob_true = y[out_of_bag] y_oob_pred = model_clone.predict(X_oob) # Store predictions for aggregation for i, obs_idx in enumerate(out_of_bag): oob_predictions[obs_idx].append(y_oob_pred[i]) # Compute error for this bootstrap iteration bootstrap_errors.append(scoring_fn(y_oob_true, y_oob_pred)) # Aggregate OOB predictions per observation aggregated_predictions = [] aggregated_true = [] prediction_counts = [] for i in range(n_samples): if len(oob_predictions[i]) > 0: aggregated_predictions.append(np.mean(oob_predictions[i])) aggregated_true.append(y[i]) prediction_counts.append(len(oob_predictions[i])) # Final OOB error estimate (using aggregated predictions) oob_error = scoring_fn( np.array(aggregated_true), np.array(aggregated_predictions) ) return { 'oob_error': oob_error, 'oob_error_std': np.std(bootstrap_errors), 'mean_bootstrap_error': np.mean(bootstrap_errors), 'mean_oob_size': np.mean(oob_sizes), 'mean_oob_fraction': np.mean(oob_sizes) / n_samples, 'mean_predictions_per_obs': np.mean(prediction_counts), 'observations_with_predictions': len(aggregated_true), } # Example usageif __name__ == "__main__": from sklearn.linear_model import Ridge from sklearn.datasets import make_regression # Generate sample data X, y = make_regression(n_samples=200, n_features=10, noise=10, random_state=42) # Compute OOB error model = Ridge(alpha=1.0) results = out_of_bootstrap_error(X, y, model, n_bootstrap=500) print("Out-of-Bootstrap Error Estimation") print("=" * 50) print(f"OOB Error (MSE): {results['oob_error']:.4f}") print(f"OOB Error Std: {results['oob_error_std']:.4f}") print(f"Mean OOB fraction: {results['mean_oob_fraction']:.4f}") print(f"Mean predictions per observation: {results['mean_predictions_per_obs']:.1f}") print(f"Observations covered: {results['observations_with_predictions']}/{len(y)}")A practical question: how many bootstrap samples $B$ should we generate? The answer depends on what quantity we're estimating and the precision required.
For Standard Error Estimation:
The bootstrap estimate of standard error is $\hat{SE}B = \sqrt{\frac{1}{B-1}\sum{b=1}^{B}(\hat{\theta}^_b - \bar{\theta}^)^2}$, where $\bar{\theta}^* = \frac{1}{B}\sum_{b=1}^{B}\hat{\theta}^*_b$.
The Monte Carlo error in this estimate (error due to using finite $B$) has approximate coefficient of variation:
$$CV(\hat{SE}_B) \approx \frac{1}{\sqrt{2B}}$$
For $B = 200$, the CV is about 5%—the bootstrap SE is estimated with ~5% relative error. For most practical purposes, $B = 200$–$1000$ is sufficient for standard errors.
| Application | Recommended B | Rationale |
|---|---|---|
| Standard Error Estimation | 200–500 | CV ≈ 3–5% is adequate for most uses |
| Confidence Intervals (Percentile) | 1,000–2,000 | Need accurate percentile estimation at tails |
| Bias Estimation | 500–1,000 | Bias requires stable mean estimation |
| BCa Confidence Intervals | 2,000–5,000 | Acceleration parameter estimation is sensitive |
| Machine Learning Model Comparison | 500–2,000 | Need stable ranking of models |
| Publication-Quality Results | 10,000+ | Minimize Monte Carlo variability |
For exploratory analysis and model development, B = 200–500 is typically sufficient. For final reported results, consider B = 2000 or more. When in doubt, run with increasing B values and check for stability: if results change substantially when doubling B, you need more iterations.
Computational Considerations:
The total computational cost scales as $O(B \times \text{cost of fitting and evaluating one model})$. For expensive models (deep neural networks, large gradient boosting ensembles), this cost can be prohibitive.
Strategies for reducing computational burden:
The standard non-parametric bootstrap assumes i.i.d. observations. When this assumption is violated, we need modified approaches.
Paired Data Bootstrap:
When observations are paired (e.g., before-after measurements, matched case-control), we must resample pairs together to preserve the dependence structure.
Clustered Data Bootstrap:
With grouped or clustered data (patients within hospitals, students within schools), resample at the cluster level—select clusters with replacement, then include all observations within selected clusters.
Stratified Bootstrap:
When samples are stratified (e.g., different demographic groups), resample within strata separately to preserve the stratification structure.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
import numpy as npfrom typing import List, Tuple def stratified_bootstrap(data: np.ndarray, strata: np.ndarray, statistic_fn, n_bootstrap: int = 1000, random_state: int = 42) -> np.ndarray: """ Perform stratified bootstrap resampling. Resamples WITHIN each stratum separately, preserving the stratification structure. Parameters ---------- data : np.ndarray Data array of shape (n_samples,) or (n_samples, n_features) strata : np.ndarray Stratum labels for each observation, shape (n_samples,) statistic_fn : Callable Function computing statistic from resampled data n_bootstrap : int Number of bootstrap iterations random_state : int Random seed Returns ------- np.ndarray Array of B bootstrap statistics """ rng = np.random.RandomState(random_state) unique_strata = np.unique(strata) # Indices for each stratum stratum_indices = {s: np.where(strata == s)[0] for s in unique_strata} bootstrap_stats = np.zeros(n_bootstrap) for b in range(n_bootstrap): resampled_indices = [] # Resample within each stratum for s in unique_strata: indices = stratum_indices[s] n_stratum = len(indices) # Sample with replacement within stratum selected = rng.choice(indices, size=n_stratum, replace=True) resampled_indices.extend(selected) # Extract resampled data resampled_data = data[resampled_indices] # Compute statistic bootstrap_stats[b] = statistic_fn(resampled_data) return bootstrap_stats def cluster_bootstrap(data: np.ndarray, clusters: np.ndarray, statistic_fn, n_bootstrap: int = 1000, random_state: int = 42) -> np.ndarray: """ Perform cluster bootstrap resampling. Resamples CLUSTERS with replacement, then includes all observations within selected clusters. Parameters ---------- data : np.ndarray Data array of shape (n_samples,) or (n_samples, n_features) clusters : np.ndarray Cluster labels for each observation, shape (n_samples,) statistic_fn : Callable Function computing statistic from resampled data n_bootstrap : int Number of bootstrap iterations random_state : int Random seed Returns ------- np.ndarray Array of B bootstrap statistics """ rng = np.random.RandomState(random_state) unique_clusters = np.unique(clusters) n_clusters = len(unique_clusters) # Indices for each cluster cluster_indices = {c: np.where(clusters == c)[0] for c in unique_clusters} bootstrap_stats = np.zeros(n_bootstrap) for b in range(n_bootstrap): # Sample clusters with replacement selected_clusters = rng.choice(unique_clusters, size=n_clusters, replace=True) # Collect all observations from selected clusters resampled_indices = [] for c in selected_clusters: resampled_indices.extend(cluster_indices[c]) # Extract resampled data resampled_data = data[resampled_indices] # Compute statistic bootstrap_stats[b] = statistic_fn(resampled_data) return bootstrap_stats # Example: Time series block bootstrapdef block_bootstrap(data: np.ndarray, block_size: int, statistic_fn, n_bootstrap: int = 1000, random_state: int = 42) -> np.ndarray: """ Perform block bootstrap for time series data. Resamples contiguous blocks to preserve temporal dependence. Uses the moving block bootstrap (MBB) approach. Parameters ---------- data : np.ndarray Time series of shape (n_timepoints,) block_size : int Size of each block (should reflect autocorrelation range) statistic_fn : Callable Function computing statistic from resampled series n_bootstrap : int Number of bootstrap iterations random_state : int Random seed Returns ------- np.ndarray Array of B bootstrap statistics """ rng = np.random.RandomState(random_state) n = len(data) # Number of blocks needed to cover at least n observations n_blocks_needed = int(np.ceil(n / block_size)) # Possible starting positions for blocks max_start = n - block_size bootstrap_stats = np.zeros(n_bootstrap) for b in range(n_bootstrap): # Sample block starting positions starts = rng.randint(0, max_start + 1, size=n_blocks_needed) # Construct resampled series from blocks resampled = [] for start in starts: resampled.extend(data[start:start + block_size]) # Trim to exactly n observations resampled = np.array(resampled[:n]) # Compute statistic bootstrap_stats[b] = statistic_fn(resampled) return bootstrap_statsFor time series block bootstrap, block size selection is critical. Too small blocks break temporal dependencies; too large blocks yield too few blocks for adequate resampling. A common approach: set block size approximately equal to the effective sample size n/τ where τ is the integrated autocorrelation time.
We've focused on the non-parametric bootstrap, which makes no assumptions about the data distribution. An alternative is the parametric bootstrap, which assumes a specific distributional form.
Non-Parametric Bootstrap:
Parametric Bootstrap:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npfrom scipy import stats def parametric_bootstrap_normal(data: np.ndarray, statistic_fn, n_bootstrap: int = 1000, random_state: int = 42) -> np.ndarray: """ Parametric bootstrap assuming Normal distribution. 1. Estimate μ and σ² from data 2. Sample from N(μ̂, σ̂²) 3. Compute statistic on parametric samples Parameters ---------- data : np.ndarray Original sample statistic_fn : Callable Statistic to compute n_bootstrap : int Number of iterations random_state : int Random seed Returns ------- np.ndarray Bootstrap statistics """ rng = np.random.RandomState(random_state) n = len(data) # Fit Normal distribution (MLE) mu_hat = np.mean(data) sigma_hat = np.std(data, ddof=1) # Unbiased estimate bootstrap_stats = np.zeros(n_bootstrap) for b in range(n_bootstrap): # Generate parametric sample parametric_sample = rng.normal(mu_hat, sigma_hat, size=n) # Compute statistic bootstrap_stats[b] = statistic_fn(parametric_sample) return bootstrap_stats def model_residual_bootstrap(X: np.ndarray, y: np.ndarray, model, n_bootstrap: int = 500, random_state: int = 42) -> dict: """ Residual bootstrap for regression models. A semi-parametric approach: 1. Fit model, compute residuals 2. Resample residuals (non-parametric for residuals) 3. Reconstruct responses with resampled residuals 4. Refit model on reconstructed data This preserves the X distribution while allowing residual variation. Parameters ---------- X : np.ndarray Feature matrix y : np.ndarray Response vector model : sklearn-like estimator Model with fit/predict methods n_bootstrap : int Number of bootstrap iterations random_state : int Random seed Returns ------- dict Bootstrap coefficient distributions """ from sklearn.base import clone rng = np.random.RandomState(random_state) n = len(y) # Fit original model model.fit(X, y) y_pred = model.predict(X) residuals = y - y_pred # Center residuals (to remove any bias) residuals_centered = residuals - np.mean(residuals) # Store bootstrap coefficients if hasattr(model, 'coef_'): n_coefs = len(model.coef_) if hasattr(model.coef_, '__len__') else 1 bootstrap_coefs = np.zeros((n_bootstrap, n_coefs)) else: bootstrap_coefs = None bootstrap_intercepts = np.zeros(n_bootstrap) for b in range(n_bootstrap): # Resample residuals resampled_indices = rng.choice(n, size=n, replace=True) resampled_residuals = residuals_centered[resampled_indices] # Construct new response y_star = y_pred + resampled_residuals # Refit model model_b = clone(model) model_b.fit(X, y_star) # Store coefficients if hasattr(model_b, 'coef_'): bootstrap_coefs[b] = model_b.coef_.flatten() if hasattr(model_b, 'intercept_'): bootstrap_intercepts[b] = model_b.intercept_ return { 'coefficients': bootstrap_coefs, 'intercepts': bootstrap_intercepts, 'coef_means': np.mean(bootstrap_coefs, axis=0) if bootstrap_coefs is not None else None, 'coef_stds': np.std(bootstrap_coefs, axis=0) if bootstrap_coefs is not None else None, }Use non-parametric bootstrap when: (1) distribution is unknown, (2) you want robustness over efficiency, (3) sample size is moderate to large. Use parametric bootstrap when: (1) distributional model is well-justified, (2) sample size is small, (3) you need to extrapolate beyond observed range. Use residual bootstrap for regression when you trust the model form but not error distribution.
Bootstrap resampling represents a paradigm shift in statistical inference—from analytical derivation to computational simulation. By treating our sample as if it were the population and resampling from it, we can estimate the sampling distribution of any statistic without requiring distributional assumptions.
What's Next:
In the next pages, we'll explore specific bootstrap estimators designed for machine learning:
These methods address the specific challenges of estimating generalization performance when the same data used for training informs our error estimates.
You now understand the fundamental mechanics and theory of bootstrap resampling. This foundation is essential for the more sophisticated estimators we'll develop in subsequent pages. The bootstrap's power lies in its generality—once you understand resampling from the empirical distribution, you can estimate the uncertainty of virtually any quantity.