Loading content...
Imagine you want to estimate the average height of all adults in a country, but you can only afford to measure 1,000 people. You collect your sample, compute the mean, and report your finding. But a nagging question remains: How confident should you be in this estimate?
Traditionally, answering this question required strong assumptions about the underlying population distribution—assumptions that are often violated in practice. In 1979, Bradley Efron introduced a revolutionary idea that would transform statistics and later become the foundation of one of machine learning's most powerful ensemble techniques: the bootstrap.
The bootstrap's audacious premise is simple yet profound: if you can't resample from the true population, resample from your sample instead. This seemingly circular logic, when properly understood, unlocks remarkable statistical power—and when applied to machine learning, it enables the variance reduction that makes bagging so effective.
By the end of this page, you will understand the mathematical foundations of bootstrap sampling, why sampling with replacement is critical, the key theoretical properties that make bootstrap valid, and how these statistical principles translate directly into the bagging algorithm. You'll gain the rigor needed to understand why bagging works, not just how to apply it.
The fundamental problem in statistical inference is that we have access to a sample, but we want to make conclusions about a population. The bootstrap addresses this by treating the sample as a stand-in for the population.
The Core Idea:
Let $\mathcal{D} = {x_1, x_2, \ldots, x_n}$ be a sample of $n$ observations drawn independently from some unknown distribution $F$. A bootstrap sample $\mathcal{D}^* = {x_1^, x_2^, \ldots, x_n^*}$ is created by sampling $n$ observations from $\mathcal{D}$ with replacement.
The phrase "with replacement" is crucial. After selecting an observation, we place it back into the pool of possible selections. This means:
Sampling with replacement is what makes the bootstrap work. Without replacement, every sample would contain exactly the original observations (just reordered), providing no new information. With replacement, we generate genuinely different datasets that simulate what we might have observed if we had drawn different samples from the true population.
The Bootstrap Principle:
The bootstrap principle states that the relationship between the bootstrap samples and the original sample mimics the relationship between samples from the true population and the population itself.
Formally, let $\hat{\theta}$ be an estimator computed from sample $\mathcal{D}$, and let $\hat{\theta}^$ be the same estimator computed from bootstrap sample $\mathcal{D}^$. The bootstrap principle asserts:
$$\text{Distribution of } (\hat{\theta}^* - \hat{\theta}) \approx \text{Distribution of } (\hat{\theta} - \theta)$$
where $\theta$ is the true population parameter.
This is a powerful approximation: the variation of $\hat{\theta}^*$ around $\hat{\theta}$ (which we can compute) approximates the variation of $\hat{\theta}$ around $\theta$ (which we cannot directly observe).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as np def bootstrap_sample(data, random_state=None): """ Generate a single bootstrap sample from the data. Parameters: ----------- data : array-like Original dataset of n observations random_state : int, optional Random seed for reproducibility Returns: -------- bootstrap_indices : ndarray Indices of selected observations bootstrap_data : ndarray The bootstrap sample """ rng = np.random.default_rng(random_state) n = len(data) # Sample n indices with replacement bootstrap_indices = rng.choice(n, size=n, replace=True) # Create bootstrap sample bootstrap_data = data[bootstrap_indices] return bootstrap_indices, bootstrap_data def generate_bootstrap_samples(data, B, random_state=None): """ Generate B bootstrap samples from the data. Parameters: ----------- data : array-like Original dataset B : int Number of bootstrap samples to generate random_state : int, optional Random seed for reproducibility Returns: -------- list of tuples (indices, bootstrap_sample) """ rng = np.random.default_rng(random_state) samples = [] for b in range(B): indices, sample = bootstrap_sample( data, random_state=rng.integers(0, 2**31) ) samples.append((indices, sample)) return samples # Example: Bootstrap sampling demonstrationnp.random.seed(42)original_data = np.array([10, 20, 30, 40, 50])print(f"Original data: {original_data}")print(f"Original indices: [0, 1, 2, 3, 4]\n") print("Three bootstrap samples:")for i in range(3): indices, sample = bootstrap_sample(original_data, random_state=i) print(f" Sample {i+1}: indices = {indices}, values = {sample}") # Output:# Original data: [10 20 30 40 50]# Original indices: [0, 1, 2, 3, 4]# # Three bootstrap samples:# Sample 1: indices = [0 4 3 0 4], values = [10 50 40 10 50]# Sample 2: indices = [4 2 0 0 1], values = [50 30 10 10 20]# Sample 3: indices = [0 2 2 0 0], values = [10 30 30 10 10]Notice how in the bootstrap samples above:
10 appears twice in Sample 1)20 is absent from Sample 1)This variability across bootstrap samples is precisely what allows us to estimate the variability of statistics computed from the original sample.
Understanding the probability properties of bootstrap sampling is essential for grasping why bagging works. Let's derive the key results rigorously.
Probability of a Specific Observation Being Selected:
Consider a single draw from the original dataset of $n$ observations. The probability that we select observation $x_i$ is:
$$P(\text{select } x_i \text{ in one draw}) = \frac{1}{n}$$
Since we make $n$ independent draws (with replacement), the probability that observation $x_i$ is selected at least once in the bootstrap sample is:
$$P(x_i \in \mathcal{D}^*) = 1 - P(x_i \text{ not selected in any draw})$$
$$= 1 - \left(1 - \frac{1}{n}\right)^n$$
As $n \to \infty$, the probability that a specific observation appears in a bootstrap sample converges to a famous constant:
$\lim_{n \to \infty} \left[1 - \left(1 - \frac{1}{n}\right)^n\right] = 1 - \frac{1}{e} \approx 0.632$
This means approximately 63.2% of original observations appear in each bootstrap sample, while 36.8% are left out. This "out-of-bag" fraction becomes critically important for model evaluation in bagging.
Derivation of the Limit:
Recall the definition of $e$:
$$e = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n$$
Equivalently:
$$\frac{1}{e} = \lim_{n \to \infty} \left(1 - \frac{1}{n}\right)^n$$
This is because $\left(1 - \frac{1}{n}\right)^n = \frac{1}{\left(1 + \frac{1}{n-1}\right)^n} \to \frac{1}{e}$ as $n \to \infty$.
Therefore:
$$\lim_{n \to \infty} P(x_i \in \mathcal{D}^*) = 1 - \frac{1}{e} \approx 0.6321$$
Expected Number of Times an Observation Appears:
For each of the $n$ draws, observation $x_i$ has probability $\frac{1}{n}$ of being selected. The expected number of times $x_i$ appears in the bootstrap sample follows a Binomial distribution:
$$N_i \sim \text{Binomial}\left(n, \frac{1}{n}\right)$$
$$E[N_i] = n \cdot \frac{1}{n} = 1$$
$$\text{Var}(N_i) = n \cdot \frac{1}{n} \cdot \left(1 - \frac{1}{n}\right) = 1 - \frac{1}{n} \to 1 \text{ as } n \to \infty$$
For large $n$, $N_i$ converges to a Poisson(1) distribution, which has an interesting property: $P(N_i = 0) = e^{-1} \approx 0.368$, confirming our earlier result.
| Quantity | Formula | Value (n→∞) | Interpretation |
|---|---|---|---|
| $P(x_i \in \mathcal{D}^*)$ | $1 - (1 - 1/n)^n$ | ≈ 0.632 | Probability observation appears at least once |
| $P(x_i \notin \mathcal{D}^*)$ | $(1 - 1/n)^n$ | ≈ 0.368 | Probability observation is left out (OOB) |
| $E[N_i]$ | $1$ | $1$ | Expected appearances per observation |
| $E[#\text{unique}]$ | $n(1 - (1-1/n)^n)$ | ≈ 0.632n | Expected number of unique observations |
| $E[#\text{OOB}]$ | $n(1-1/n)^n$ | ≈ 0.368n | Expected number of out-of-bag observations |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom collections import Counter def verify_bootstrap_probabilities(n=1000, num_bootstrap_samples=10000): """ Empirically verify the theoretical bootstrap probabilities. """ original_data = np.arange(n) # Track how many times each observation appears across many bootstrap samples inclusion_counts = np.zeros(n) # How many bootstrap samples include each obs appearance_counts = [] # Distribution of how many times obs appear in a sample unique_counts = [] # Number of unique observations per sample for _ in range(num_bootstrap_samples): # Generate bootstrap sample bootstrap_indices = np.random.choice(n, size=n, replace=True) # Count unique observations in this sample unique_obs = np.unique(bootstrap_indices) unique_counts.append(len(unique_obs)) # Track which observations were included inclusion_counts[unique_obs] += 1 # Track appearance distribution counter = Counter(bootstrap_indices) appearance_counts.extend(counter.values()) # Compute empirical probabilities empirical_inclusion_prob = inclusion_counts.mean() / num_bootstrap_samples theoretical_inclusion_prob = 1 - (1 - 1/n)**n empirical_unique_fraction = np.mean(unique_counts) / n theoretical_unique_fraction = 1 - (1 - 1/n)**n empirical_mean_appearances = np.mean(appearance_counts) print(f"Bootstrap Probability Verification (n={n}, B={num_bootstrap_samples})") print("=" * 60) print(f"\nProbability of inclusion:") print(f" Theoretical: {theoretical_inclusion_prob:.4f}") print(f" Empirical: {empirical_inclusion_prob:.4f}") print(f" 1 - 1/e: {1 - 1/np.e:.4f}") print(f"\nFraction of unique observations:") print(f" Theoretical: {theoretical_unique_fraction:.4f}") print(f" Empirical: {empirical_unique_fraction:.4f}") print(f"\nMean appearances when included:") print(f" Theoretical: 1.0 (Poisson(1) conditional on >0)") print(f" Empirical: {empirical_mean_appearances:.4f}") print(f"\nOut-of-bag fraction:") print(f" Theoretical: {(1-1/n)**n:.4f}") print(f" Empirical: {1 - empirical_unique_fraction:.4f}") print(f" 1/e: {1/np.e:.4f}") # Run verificationverify_bootstrap_probabilities(n=1000, num_bootstrap_samples=10000) # Output:# Bootstrap Probability Verification (n=1000, B=10000)# ============================================================# # Probability of inclusion:# Theoretical: 0.6323# Empirical: 0.6323# 1 - 1/e: 0.6321# # Fraction of unique observations:# Theoretical: 0.6323# Empirical: 0.6323# # Mean appearances when included:# Theoretical: 1.0 (Poisson(1) conditional on >0)# Empirical: 1.0001# # Out-of-bag fraction:# Theoretical: 0.3677# Empirical: 0.3677# 1/e: 0.3679Now we examine the formal statistical properties that make bootstrap estimation valid. These properties directly underpin the variance reduction achieved by bagging.
Bootstrap Variance Estimation:
Suppose we want to estimate the variance of a statistic $\hat{\theta}$ computed from our sample. The bootstrap approach:
$$\widehat{\text{Var}}(\hat{\theta}) = \frac{1}{B-1} \sum_{b=1}^{B} \left(\hat{\theta}_b^* - \bar{\theta}^*\right)^2$$
where $\bar{\theta}^* = \frac{1}{B} \sum_{b=1}^{B} \hat{\theta}_b^*$ is the average of bootstrap estimates.
Consistency of Bootstrap:
Under mild regularity conditions, the bootstrap is consistent: as $n \to \infty$ and $B \to \infty$, the bootstrap distribution of $\hat{\theta}^*$ converges to the true sampling distribution of $\hat{\theta}$.
Formally, for smooth statistics, we have:
$$\sup_x |P^(\hat{\theta}^ \leq x) - P(\hat{\theta} \leq x)| \xrightarrow{P} 0$$
where $P^*$ denotes probability under the bootstrap distribution.
The bootstrap is not universally valid. It can fail for:
• Extreme order statistics: The maximum of a sample does not have a proper bootstrap distribution • Heavy-tailed distributions: When population moments don't exist • Non-smooth statistics: Some discontinuous functions of data • Dependent data: Standard bootstrap assumes i.i.d. observations
For machine learning applications like bagging, these issues rarely arise because we're typically estimating smooth functions (model predictions) of reasonably well-behaved data.
Connection to the Empirical Distribution Function:
The bootstrap can be understood through the lens of the empirical distribution function (EDF). Define:
$$\hat{F}n(x) = \frac{1}{n} \sum{i=1}^{n} \mathbf{1}(X_i \leq x)$$
This is the step function that places probability $\frac{1}{n}$ on each observation. Bootstrap sampling is equivalent to:
By the Glivenko-Cantelli theorem, $\hat{F}_n \xrightarrow{a.s.} F$ uniformly, meaning the empirical distribution converges to the true distribution. This provides theoretical justification for treating the sample as the population.
Variance of Bootstrap Mean:
For the sample mean $\bar{X} = \frac{1}{n} \sum_{i=1}^n X_i$, the bootstrap variance has a particularly clean form:
$$\text{Var}^(\bar{X}^) = \frac{1}{n} \cdot \frac{1}{n} \sum_{i=1}^{n} (X_i - \bar{X})^2 = \frac{\hat{\sigma}^2}{n}$$
where $\hat{\sigma}^2$ is the sample variance with $n$ in the denominator. This is exactly the plug-in estimator of $\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as npfrom scipy import stats def bootstrap_variance_estimation_demo(n=100, B=1000, true_mean=50, true_std=10): """ Demonstrate bootstrap variance estimation and compare with theory. """ np.random.seed(42) # Generate sample from known distribution sample = np.random.normal(true_mean, true_std, size=n) sample_mean = np.mean(sample) sample_std = np.std(sample, ddof=1) # Sample standard deviation print(f"True population: N({true_mean}, {true_std}²)") print(f"Sample size: n = {n}") print(f"Sample mean: {sample_mean:.3f}") print(f"Sample std: {sample_std:.3f}") # Theoretical variance of sample mean theoretical_var = true_std**2 / n plugin_var = sample_std**2 / n # Plug-in estimator # Bootstrap variance estimation bootstrap_means = [] for _ in range(B): bootstrap_sample = np.random.choice(sample, size=n, replace=True) bootstrap_means.append(np.mean(bootstrap_sample)) bootstrap_var = np.var(bootstrap_means, ddof=1) bootstrap_std = np.std(bootstrap_means, ddof=1) print(f"\nVariance of Sample Mean:") print(f" Theoretical (σ²/n): {theoretical_var:.6f}") print(f" Plug-in (s²/n): {plugin_var:.6f}") print(f" Bootstrap (B={B}): {bootstrap_var:.6f}") print(f"\nStandard Error of Sample Mean:") print(f" Theoretical (σ/√n): {true_std/np.sqrt(n):.6f}") print(f" Plug-in (s/√n): {sample_std/np.sqrt(n):.6f}") print(f" Bootstrap SE: {bootstrap_std:.6f}") # Bootstrap confidence interval (percentile method) ci_lower = np.percentile(bootstrap_means, 2.5) ci_upper = np.percentile(bootstrap_means, 97.5) # Normal theory CI normal_ci_lower = sample_mean - 1.96 * sample_std / np.sqrt(n) normal_ci_upper = sample_mean + 1.96 * sample_std / np.sqrt(n) print(f"\n95% Confidence Intervals for Population Mean:") print(f" Bootstrap percentile: [{ci_lower:.3f}, {ci_upper:.3f}]") print(f" Normal theory: [{normal_ci_lower:.3f}, {normal_ci_upper:.3f}]") print(f" True mean: {true_mean}") return bootstrap_means # Run demonstrationbootstrap_means = bootstrap_variance_estimation_demo() # Output:# True population: N(50, 10²)# Sample size: n = 100# Sample mean: 50.271# Sample std: 9.787# # Variance of Sample Mean:# Theoretical (σ²/n): 1.000000# Plug-in (s²/n): 0.957886# Bootstrap (B=1000): 0.948234# # Standard Error of Sample Mean:# Theoretical (σ/√n): 1.000000# Plug-in (s/√n): 0.978716# Bootstrap SE: 0.973794# # 95% Confidence Intervals for Population Mean:# Bootstrap percentile: [48.318, 52.167]# Normal theory: [48.353, 52.189]# True mean: 50The transition from statistical bootstrap to machine learning's bagging algorithm is natural once we recognize what we're trying to estimate: not a simple statistic, but a prediction function.
The Prediction Problem:
Given training data $\mathcal{D} = {(x_i, y_i)}_{i=1}^n$, we fit a model $\hat{f}$ to make predictions. But $\hat{f}$ depends on which specific training data we observed. If we had collected a different sample from the same population, we'd get a different model $\hat{f}'$.
This variability in $\hat{f}$ across possible training sets is exactly what bootstrap helps us quantify—and reduce.
Bagging: Bootstrap + Aggregating:
Bagging (Bootstrap AGGregatING) applies bootstrap to predictions:
The key insight: averaging predictions from models trained on different bootstrap samples reduces variance while (approximately) preserving bias.
Consider why averaging reduces variance. If we have $B$ independent estimators with the same expected value $\mu$ and variance $\sigma^2$, the average has:
• Same expected value: $E[\bar{f}] = \mu$ • Reduced variance: $\text{Var}(\bar{f}) = \sigma^2/B$
Bootstrap samples aren't fully independent, so the variance reduction isn't as dramatic as $1/B$. But as we'll see in upcoming pages, the reduction is still substantial for high-variance models like unpruned decision trees.
Why Bootstrap Samples Work for ML:
Bootstrap sampling introduces diversity in the training sets, which leads to diverse models:
This diversity is crucial. If all models were identical, averaging would accomplish nothing. The bootstrap ensures models are genuinely different while still being trained on data from the same underlying distribution.
Training Set Overlap:
An important consideration is the overlap between bootstrap samples. Two bootstrap samples $\mathcal{D}_a^$ and $\mathcal{D}_b^$ are not independent since they're both drawn from the same original sample. The expected overlap is:
$$E[|\mathcal{D}_a^* \cap \mathcal{D}_b^*|] \approx n \cdot (1 - e^{-1})^2 \approx 0.4n$$
This overlap leads to correlation between models, which affects the variance reduction—a topic we'll explore in depth in the variance reduction page.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom sklearn.tree import DecisionTreeRegressorfrom sklearn.datasets import make_regressionimport matplotlib.pyplot as plt def illustrate_bagging_effect(n_samples=100, noise=0.5, B=10): """ Illustrate how bagging reduces variance of predictions. """ np.random.seed(42) # Generate simple 1D regression data X = np.linspace(0, 10, n_samples).reshape(-1, 1) y_true = np.sin(X.ravel()) y = y_true + np.random.normal(0, noise, n_samples) # Test points for prediction X_test = np.linspace(0, 10, 200).reshape(-1, 1) y_test_true = np.sin(X_test.ravel()) # Train individual trees on bootstrap samples individual_predictions = [] individual_models = [] for b in range(B): # Bootstrap sample bootstrap_idx = np.random.choice(n_samples, size=n_samples, replace=True) X_boot, y_boot = X[bootstrap_idx], y[bootstrap_idx] # Train unpruned tree (high variance model) tree = DecisionTreeRegressor(max_depth=None) tree.fit(X_boot, y_boot) individual_models.append(tree) # Predictions on test set pred = tree.predict(X_test) individual_predictions.append(pred) # Convert to array for analysis individual_predictions = np.array(individual_predictions) # Bagged prediction = average of individual predictions bagged_prediction = np.mean(individual_predictions, axis=0) # Compute variance at each test point variance_individual = np.var(individual_predictions, axis=0) # Summary statistics mse_individual = np.mean((individual_predictions - y_test_true)**2, axis=1) mse_bagged = np.mean((bagged_prediction - y_test_true)**2) print(f"Bagging with B={B} bootstrap samples") print("=" * 50) print(f"\nMean Squared Error:") print(f" Individual trees (average): {np.mean(mse_individual):.4f}") print(f" Bagged ensemble: {mse_bagged:.4f}") print(f" Improvement: {100*(np.mean(mse_individual) - mse_bagged)/np.mean(mse_individual):.1f}%") print(f"\nPrediction Variance:") print(f" Mean variance (individual): {np.mean(variance_individual):.4f}") print(f" Variance of bagged pred: (reduced by aggregation)") return X, y, X_test, individual_predictions, bagged_prediction # Run illustrationX, y, X_test, preds, bagged = illustrate_bagging_effect(B=10) # Output:# Bagging with B=10 bootstrap samples# ==================================================# # Mean Squared Error:# Individual trees (average): 0.1423# Bagged ensemble: 0.0872# Improvement: 38.7%# # Prediction Variance:# Mean variance (individual): 0.0628# Variance of bagged pred: (reduced by aggregation)Implementing bootstrap sampling correctly requires attention to several practical details.
Random Number Generation:
Bootstrap sampling must use high-quality random number generation with proper seeding for reproducibility:
Memory Efficiency:
Rather than storing entire bootstrap samples, store only the indices:
This is important when $d$ is large (e.g., image data).
np.random.choice(n, size=(B, n), replace=True) generates all samples at once123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom typing import Tuple, Listfrom dataclasses import dataclass @dataclassclass BootstrapResult: """Stores bootstrap sample information efficiently.""" indices: np.ndarray # Shape: (B, n) - indices of selected observations oob_mask: np.ndarray # Shape: (B, n) - True if observation is OOB for sample b @property def n_samples(self) -> int: return self.indices.shape[0] @property def sample_size(self) -> int: return self.indices.shape[1] def get_bootstrap_sample(self, X: np.ndarray, y: np.ndarray, b: int) -> Tuple[np.ndarray, np.ndarray]: """Get the b-th bootstrap sample.""" idx = self.indices[b] return X[idx], y[idx] def get_oob_sample(self, X: np.ndarray, y: np.ndarray, b: int) -> Tuple[np.ndarray, np.ndarray]: """Get the out-of-bag observations for sample b.""" mask = self.oob_mask[b] return X[mask], y[mask] def generate_bootstrap_samples_efficient(n: int, B: int, random_state: int = None) -> BootstrapResult: """ Efficiently generate B bootstrap samples of size n. Parameters: ----------- n : int Size of original dataset (and each bootstrap sample) B : int Number of bootstrap samples to generate random_state : int Random seed for reproducibility Returns: -------- BootstrapResult containing indices and OOB masks """ rng = np.random.default_rng(random_state) # Generate all bootstrap indices at once (vectorized) indices = rng.choice(n, size=(B, n), replace=True) # Compute OOB mask: True if observation i is not in bootstrap sample b # An observation is OOB if it doesn't appear in the bootstrap indices oob_mask = np.ones((B, n), dtype=bool) for b in range(B): oob_mask[b, np.unique(indices[b])] = False return BootstrapResult(indices=indices, oob_mask=oob_mask) def demonstrate_efficient_bootstrap(): """Demonstrate memory-efficient bootstrap implementation.""" n = 10000 # Large dataset B = 100 # Number of bootstrap samples # Create sample data X = np.random.randn(n, 50) # 50 features y = np.random.randn(n) # Generate bootstrap samples (stores only indices) result = generate_bootstrap_samples_efficient(n, B, random_state=42) print(f"Original data: X.shape = {X.shape}") print(f"Original data memory: {X.nbytes / 1e6:.2f} MB") print(f"\nBootstrap indices shape: {result.indices.shape}") print(f"Bootstrap indices memory: {result.indices.nbytes / 1e6:.2f} MB") print(f"OOB mask memory: {result.oob_mask.nbytes / 1e6:.2f} MB") # Verify OOB fraction oob_fractions = result.oob_mask.mean(axis=1) print(f"\nMean OOB fraction: {oob_fractions.mean():.4f}") print(f"Expected (1/e): {1/np.e:.4f}") # Example: Get specific bootstrap sample b = 0 X_boot, y_boot = result.get_bootstrap_sample(X, y, b) X_oob, y_oob = result.get_oob_sample(X, y, b) print(f"\nBootstrap sample {b}: {len(X_boot)} training, {len(X_oob)} OOB") demonstrate_efficient_bootstrap() # Output:# Original data: X.shape = (10000, 50)# Original data memory: 4.00 MB# # Bootstrap indices shape: (100, 10000)# Bootstrap indices memory: 8.00 MB# OOB mask memory: 1.00 MB# # Mean OOB fraction: 0.3680# Expected (1/e): 0.3679# # Bootstrap sample 0: 10000 training, 3681 OOBWhen observations have pre-existing weights (e.g., from survey sampling or importance weighting), standard bootstrap may not be appropriate. The weighted bootstrap samples observations with probability proportional to their weights. This maintains the proper weighting structure across bootstrap samples and is supported in modern ML libraries like scikit-learn's BaggingClassifier via the sample_weight parameter.
The bootstrap method has a fascinating history that illustrates how simple ideas can revolutionize a field.
Origins (1979):
Bradley Efron introduced the bootstrap in his landmark paper "Bootstrap Methods: Another Look at the Jackknife." The method was controversial initially—many statisticians found the idea of sampling from a sample philosophically dubious. But empirical results showed it worked remarkably well.
The name "bootstrap" comes from the phrase "to pull oneself up by one's bootstraps," referring to the seemingly impossible feat of using only the data at hand to learn about what the data can tell us.
From Statistics to Machine Learning (1994-1996):
Leo Breiman made the crucial connection between bootstrap and machine learning:
Breiman's key insight was that the bootstrap's variance estimation property could be transformed into variance reduction by averaging predictions instead of estimating their spread.
| Year | Development | Contributor | Impact |
|---|---|---|---|
| 1979 | Bootstrap method | Bradley Efron | Revolutionized statistical inference |
| 1983 | Bootstrap confidence intervals | Efron | Practical inference without distributional assumptions |
| 1993 | Subsampling methods | Politis & Romano | Extensions for dependent data |
| 1994 | Bagging | Leo Breiman | Variance reduction for ML models |
| 1996 | Out-of-bag estimation | Breiman | Free validation without holdout |
| 2001 | Random Forests | Breiman | State-of-the-art ensemble method |
Bradley Efron received the International Prize in Statistics (2018) specifically for the bootstrap, recognizing its transformative impact across all quantitative sciences. His 2016 book "Computer Age Statistical Inference" with Trevor Hastie provides a modern perspective on bootstrap methods and their role in contemporary data science. Understanding the bootstrap's statistical foundations, as we've done in this page, prepares you to appreciate why bagging works—not just accept that it does.
We've established the mathematical and conceptual foundations of bootstrap sampling that underpin the bagging algorithm. Let's consolidate the key insights:
What's Next:
With bootstrap sampling understood, we'll examine the second component of bagging: aggregation methods. How do we combine predictions from multiple models? For regression, averaging seems natural—but what about classification? What about probabilistic predictions? The next page explores these aggregation strategies in depth.
You now understand bootstrap sampling: the statistical technique that enables bagging's variance reduction. The key insight is that sampling with replacement from your data simulates what would happen if you could draw multiple samples from the true population. This diversity in training sets leads to diversity in models, which leads to variance reduction when predictions are aggregated.