Loading learning content...
Confidence intervals represent one of the most fundamental tools in statistical inference—they quantify our uncertainty about unknown quantities. Traditionally, constructing confidence intervals required knowing or assuming the sampling distribution of a statistic, often limiting us to simple cases like the sample mean.
The bootstrap liberates us from this constraint. By simulating the sampling distribution through resampling, we can construct confidence intervals for any statistic: medians, correlation coefficients, regression parameters, model accuracy, AUC, and beyond. This universality is the bootstrap's greatest practical contribution.
However, not all bootstrap confidence interval methods are equal. In this page, we'll explore the major approaches—from the simple percentile method to the sophisticated BCa (bias-corrected and accelerated) intervals—understanding when each excels and when each fails.
By the end of this page, you will understand: (1) The percentile method—simple but limited; (2) The basic bootstrap interval and its relationship to pivotal statistics; (3) The BCa (bias-corrected and accelerated) method—the gold standard; (4) Bootstrap-t intervals for variance stabilization; (5) Coverage probability and interval calibration; (6) Practical guidelines for choosing the right method.
The percentile method is the simplest and most intuitive approach to bootstrap confidence intervals. It directly uses quantiles of the bootstrap distribution.
Algorithm:
$$CI_{\text{percentile}} = \left[\hat{\theta}^_{(\alpha/2)}, \hat{\theta}^_{(1-\alpha/2)}\right]$$
For a 95% CI ($\alpha = 0.05$), this means taking the 2.5th and 97.5th percentiles.
Intuition: If the bootstrap distribution accurately represents the sampling distribution of $\hat{\theta}$, then the central 95% of the bootstrap distribution should capture the true $\theta$ 95% of the time.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npfrom typing import Callable, Tuple def percentile_interval(data: np.ndarray, statistic_fn: Callable[[np.ndarray], float], confidence_level: float = 0.95, n_bootstrap: int = 10000, random_state: int = 42) -> Tuple[float, float, np.ndarray]: """ Compute bootstrap percentile confidence interval. The simplest bootstrap CI: take quantiles of the bootstrap distribution. Parameters ---------- data : np.ndarray Original sample statistic_fn : Callable Function computing statistic from sample confidence_level : float Confidence level (default 0.95) n_bootstrap : int Number of bootstrap samples random_state : int Random seed Returns ------- Tuple[float, float, np.ndarray] (lower_bound, upper_bound, bootstrap_statistics) """ rng = np.random.RandomState(random_state) n = len(data) # Generate bootstrap statistics bootstrap_stats = np.zeros(n_bootstrap) for b in range(n_bootstrap): sample = data[rng.choice(n, size=n, replace=True)] bootstrap_stats[b] = statistic_fn(sample) # Compute percentile interval alpha = 1 - confidence_level lower = np.percentile(bootstrap_stats, 100 * alpha / 2) upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2)) return lower, upper, bootstrap_stats def demonstrate_percentile_interval(): """ Demonstrate the percentile method for the sample median. The median is a great example because its sampling distribution is asymptotically normal but harder to derive analytically than the mean. """ # Generate sample np.random.seed(42) data = np.random.exponential(scale=2.0, size=100) # Skewed distribution # Point estimate sample_median = np.median(data) # Percentile interval for median lower, upper, boot_stats = percentile_interval( data, np.median, confidence_level=0.95, n_bootstrap=10000 ) print("Percentile Bootstrap CI for the Median") print("=" * 50) print(f"Sample median: {sample_median:.4f}") print(f"95% CI: [{lower:.4f}, {upper:.4f}]") print(f"CI width: {upper - lower:.4f}") print(f"Bootstrap SD: {np.std(boot_stats):.4f}") print(f"Bootstrap mean: {np.mean(boot_stats):.4f}") print(f"Bias (mean - point): {np.mean(boot_stats) - sample_median:.4f}") return lower, upper, sample_median, boot_stats if __name__ == "__main__": demonstrate_percentile_interval()The percentile method works well when the bootstrap distribution is symmetric and unbiased. However, it can have poor coverage (actual coverage differs from nominal 95%) when: (1) The statistic is biased; (2) The bootstrap distribution is skewed; (3) The standard error depends on the parameter value. These limitations motivate the more sophisticated methods we'll study next.
The basic bootstrap interval (also called the reverse percentile or pivotal interval) takes a different approach motivated by the concept of pivotal quantities.
Key Insight:
The bootstrap approximates the distribution of $\hat{\theta} - \theta$ (the error in our estimate). If we knew this distribution, we could construct an interval by 'inverting' it.
Let $\delta_\alpha = \hat{\theta}^_{(\alpha)} - \hat{\theta}$ be the $\alpha$-quantile of the bootstrap errors $\hat{\theta}^ - \hat{\theta}$.
Then by symmetry of the interval construction: $$\theta \in \left[\hat{\theta} - \delta_{1-\alpha/2}, \hat{\theta} - \delta_{\alpha/2}\right]$$
The Formula:
Substituting, the basic bootstrap interval is: $$CI_{\text{basic}} = \left[2\hat{\theta} - \hat{\theta}^_{(1-\alpha/2)}, 2\hat{\theta} - \hat{\theta}^_{(\alpha/2)}\right]$$
Notice this 'reflects' the percentile interval around the point estimate $\hat{\theta}$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom typing import Callable, Tuple, Dict def basic_bootstrap_interval(data: np.ndarray, statistic_fn: Callable[[np.ndarray], float], confidence_level: float = 0.95, n_bootstrap: int = 10000, random_state: int = 42) -> Dict: """ Compute the basic (reverse percentile) bootstrap interval. This method 'reflects' bootstrap quantiles around the point estimate. It's based on the distribution of (θ̂* - θ̂) approximating (θ̂ - θ). Formula: CI = [2θ̂ - θ̂*_{1-α/2}, 2θ̂ - θ̂*_{α/2}] Parameters ---------- data : np.ndarray Original sample statistic_fn : Callable Function computing statistic confidence_level : float Confidence level n_bootstrap : int Number of bootstrap samples random_state : int Random seed Returns ------- Dict Contains both basic and percentile intervals for comparison """ rng = np.random.RandomState(random_state) n = len(data) # Point estimate theta_hat = statistic_fn(data) # Generate bootstrap statistics bootstrap_stats = np.zeros(n_bootstrap) for b in range(n_bootstrap): sample = data[rng.choice(n, size=n, replace=True)] bootstrap_stats[b] = statistic_fn(sample) alpha = 1 - confidence_level # Percentile interval (for comparison) q_lower = np.percentile(bootstrap_stats, 100 * alpha / 2) q_upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2)) # Basic (reverse) interval # CI = [2θ̂ - θ̂*_{1-α/2}, 2θ̂ - θ̂*_{α/2}] basic_lower = 2 * theta_hat - q_upper basic_upper = 2 * theta_hat - q_lower return { 'point_estimate': theta_hat, 'percentile': (q_lower, q_upper), 'basic': (basic_lower, basic_upper), 'bootstrap_mean': np.mean(bootstrap_stats), 'bootstrap_sd': np.std(bootstrap_stats), 'bias': np.mean(bootstrap_stats) - theta_hat, } def compare_percentile_vs_basic(): """ Compare percentile and basic intervals on a skewed statistic. For asymmetric distributions, these methods give different intervals, and the difference reveals bias. """ np.random.seed(42) # Exponential data - sample mean is biased estimator of population mean data = np.random.exponential(scale=2.0, size=50) true_mean = 2.0 # True population mean result = basic_bootstrap_interval(data, np.mean, n_bootstrap=10000) print("Comparison: Percentile vs Basic Bootstrap Interval") print("=" * 60) print(f"True parameter: {true_mean:.4f}") print(f"Point estimate: {result['point_estimate']:.4f}") print(f"Bootstrap bias: {result['bias']:.4f}") print() print(f"Percentile CI: [{result['percentile'][0]:.4f}, {result['percentile'][1]:.4f}]") print(f"Basic CI: [{result['basic'][0]:.4f}, {result['basic'][1]:.4f}]") print() # When bias is near zero, these should be similar # The difference indicates the magnitude of correction print(f"Lower bound difference: {result['basic'][0] - result['percentile'][0]:.4f}") print(f"Upper bound difference: {result['basic'][1] - result['percentile'][1]:.4f}") return result if __name__ == "__main__": compare_percentile_vs_basic()When to Use Basic vs. Percentile:
| Scenario | Percentile | Basic |
|---|---|---|
| Symmetric, unbiased statistic | ✓ Similar | ✓ Similar |
| Biased statistic | Poor coverage | Better coverage |
| Skewed bootstrap distribution | May be wrong direction | May overcorrect |
| Transformation exists | Use transformed | Use transformed |
In practice, both methods have theoretical limitations that motivate the BCa method.
The basic interval effectively says: 'The bootstrap errors (θ̂* - θ̂) approximate the true errors (θ̂ - θ). So if the bootstrap says values above θ̂ are more likely, the true θ is probably below θ̂.' This reflection handles bias but not varying standard error.
The BCa (bias-corrected and accelerated) method is considered the gold standard for bootstrap confidence intervals. It addresses two key issues:
The BCa method adjusts the percentiles used for the interval based on two correction factors.
The BCa Interval:
$$CI_{\text{BCa}} = \left[\hat{\theta}^_{(\alpha_1)}, \hat{\theta}^_{(\alpha_2)}\right]$$
where the adjusted percentile levels are:
$$\alpha_1 = \Phi\left(\hat{z}_0 + \frac{\hat{z}0 + z{\alpha/2}}{1 - \hat{a}(\hat{z}0 + z{\alpha/2})}\right)$$
$$\alpha_2 = \Phi\left(\hat{z}_0 + \frac{\hat{z}0 + z{1-\alpha/2}}{1 - \hat{a}(\hat{z}0 + z{1-\alpha/2})}\right)$$
Here:
The bias-correction factor ẑ₀ measures the median bias of the bootstrap distribution: how often bootstrap statistics exceed the original estimate. The acceleration factor â measures how the standard error of the statistic changes with the parameter value—a form of 'curvature' in the problem.
Computing the Correction Factors:
Bias-correction factor $\hat{z}_0$: $$\hat{z}_0 = \Phi^{-1}\left(\frac{#{\hat{\theta}^*_b < \hat{\theta}}}{B}\right)$$
This is the inverse-normal of the proportion of bootstrap statistics below the original estimate. If the bootstrap is unbiased, this proportion is 0.5 and $\hat{z}_0 = 0$.
Acceleration factor $\hat{a}$:
The acceleration factor is estimated using the jackknife: $$\hat{a} = \frac{\sum_{i=1}^{n}(\hat{\theta}{(\cdot)} - \hat{\theta}{(i)})^3}{6\left[\sum_{i=1}^{n}(\hat{\theta}{(\cdot)} - \hat{\theta}{(i)})^2\right]^{3/2}}$$
where:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164
import numpy as npfrom scipy import statsfrom typing import Callable, Tuple, Dict def bca_interval(data: np.ndarray, statistic_fn: Callable[[np.ndarray], float], confidence_level: float = 0.95, n_bootstrap: int = 10000, random_state: int = 42) -> Dict: """ Compute the BCa (bias-corrected and accelerated) bootstrap interval. The BCa method adjusts percentiles based on: - z0: bias correction (median bias of bootstrap distribution) - a: acceleration (how SE changes with parameter value) Parameters ---------- data : np.ndarray Original sample statistic_fn : Callable Function computing statistic confidence_level : float Confidence level n_bootstrap : int Number of bootstrap samples random_state : int Random seed Returns ------- Dict BCa interval, percentile interval (for comparison), and diagnostics """ rng = np.random.RandomState(random_state) n = len(data) alpha = 1 - confidence_level # ========================================= # Step 1: Point estimate # ========================================= theta_hat = statistic_fn(data) # ========================================= # Step 2: Bootstrap distribution # ========================================= bootstrap_stats = np.zeros(n_bootstrap) for b in range(n_bootstrap): sample = data[rng.choice(n, size=n, replace=True)] bootstrap_stats[b] = statistic_fn(sample) # ========================================= # Step 3: Bias-correction factor (z0) # ========================================= # Proportion of bootstrap stats below original estimate proportion_below = np.mean(bootstrap_stats < theta_hat) # Handle edge cases if proportion_below == 0: proportion_below = 1 / (2 * n_bootstrap) elif proportion_below == 1: proportion_below = 1 - 1 / (2 * n_bootstrap) z0 = stats.norm.ppf(proportion_below) # ========================================= # Step 4: Acceleration factor (a) via jackknife # ========================================= jackknife_stats = np.zeros(n) for i in range(n): # Leave one out jackknife_sample = np.delete(data, i) jackknife_stats[i] = statistic_fn(jackknife_sample) jackknife_mean = np.mean(jackknife_stats) # Compute acceleration using skewness of jackknife influence numerator = np.sum((jackknife_mean - jackknife_stats)**3) denominator = 6 * (np.sum((jackknife_mean - jackknife_stats)**2)**(1.5)) if denominator == 0: a = 0 # No acceleration (e.g., constant statistic) else: a = numerator / denominator # ========================================= # Step 5: Compute adjusted percentiles # ========================================= z_alpha_low = stats.norm.ppf(alpha / 2) z_alpha_high = stats.norm.ppf(1 - alpha / 2) # Adjusted percentile levels def adjusted_alpha(z_alpha): numerator = z0 + z_alpha denominator = 1 - a * (z0 + z_alpha) if denominator <= 0: # Edge case: return extreme percentile return 0.999 if z_alpha > 0 else 0.001 return stats.norm.cdf(z0 + numerator / denominator) alpha_1 = adjusted_alpha(z_alpha_low) alpha_2 = adjusted_alpha(z_alpha_high) # Ensure valid percentiles alpha_1 = np.clip(alpha_1, 0.001, 0.999) alpha_2 = np.clip(alpha_2, 0.001, 0.999) # ========================================= # Step 6: Extract BCa interval # ========================================= bca_lower = np.percentile(bootstrap_stats, 100 * alpha_1) bca_upper = np.percentile(bootstrap_stats, 100 * alpha_2) # Percentile interval for comparison pct_lower = np.percentile(bootstrap_stats, 100 * alpha / 2) pct_upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2)) return { 'bca': (bca_lower, bca_upper), 'percentile': (pct_lower, pct_upper), 'point_estimate': theta_hat, 'z0': z0, 'acceleration': a, 'alpha_adjusted': (alpha_1, alpha_2), 'alpha_original': (alpha/2, 1 - alpha/2), 'bootstrap_mean': np.mean(bootstrap_stats), 'bootstrap_sd': np.std(bootstrap_stats), } def demonstrate_bca(): """ Demonstrate BCa on a skewed statistic where correction matters. """ np.random.seed(42) # Draw from an exponential - the sample variance is a skewed statistic data = np.random.exponential(scale=2.0, size=50) # Statistic: sample variance result = bca_interval(data, np.var, n_bootstrap=10000) print("BCa Bootstrap Confidence Interval for Sample Variance") print("=" * 60) print(f"Point estimate: {result['point_estimate']:.4f}") print(f"Bootstrap mean: {result['bootstrap_mean']:.4f}") print() print("Correction factors:") print(f" z0 (bias): {result['z0']:.4f}") print(f" a (acceleration): {result['acceleration']:.4f}") print() print("Adjusted percentiles:") print(f" Original: ({result['alpha_original'][0]:.4f}, {result['alpha_original'][1]:.4f})") print(f" Adjusted: ({result['alpha_adjusted'][0]:.4f}, {result['alpha_adjusted'][1]:.4f})") print() print("Confidence Intervals:") print(f" Percentile: [{result['percentile'][0]:.4f}, {result['percentile'][1]:.4f}]") print(f" BCa: [{result['bca'][0]:.4f}, {result['bca'][1]:.4f}]") return result if __name__ == "__main__": demonstrate_bca()For symmetric statistics like the sample mean from symmetric populations, BCa reduces to the percentile method (z0 ≈ 0, a ≈ 0). For skewed statistics (variance, correlation, regression coefficients), BCa can provide substantially better coverage than simpler methods.
The bootstrap-t (or studentized bootstrap) method constructs intervals using a bootstrap analog of the t-statistic. It's particularly effective when the standard error of the statistic is well-estimated.
The Key Idea:
Instead of bootstrapping $\hat{\theta}$ directly, we bootstrap the t-statistic: $$t^* = \frac{\hat{\theta}^* - \hat{\theta}}{\widehat{SE}^*}$$
where $\widehat{SE}^*$ is the estimated standard error from the bootstrap sample (often requiring a nested bootstrap or analytical formula).
The Bootstrap-t Interval:
$$CI_{\text{boot-t}} = \left[\hat{\theta} - t^_{(1-\alpha/2)} \cdot \widehat{SE}, \hat{\theta} - t^_{(\alpha/2)} \cdot \widehat{SE}\right]$$
Note the reversal of quantiles (similar to the basic interval).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
import numpy as npfrom typing import Callable, Dict, Optional def bootstrap_t_interval(data: np.ndarray, statistic_fn: Callable[[np.ndarray], float], se_fn: Optional[Callable[[np.ndarray], float]] = None, confidence_level: float = 0.95, n_bootstrap: int = 1000, n_inner_bootstrap: int = 50, random_state: int = 42) -> Dict: """ Compute the bootstrap-t (studentized) confidence interval. This method bootstraps the t-statistic (θ̂* - θ̂)/SE* rather than θ̂ directly. It requires estimating SE for each bootstrap sample. Parameters ---------- data : np.ndarray Original sample statistic_fn : Callable Function computing statistic se_fn : Callable, optional Function computing SE analytically. If None, uses nested bootstrap. confidence_level : float Confidence level n_bootstrap : int Outer bootstrap samples n_inner_bootstrap : int Inner bootstrap for SE estimation (if se_fn is None) random_state : int Random seed Returns ------- Dict Bootstrap-t interval and diagnostics """ rng = np.random.RandomState(random_state) n = len(data) alpha = 1 - confidence_level # Original estimate and SE theta_hat = statistic_fn(data) if se_fn is not None: se_original = se_fn(data) else: # Estimate SE via bootstrap inner_stats = [] for _ in range(200): inner_sample = data[rng.choice(n, size=n, replace=True)] inner_stats.append(statistic_fn(inner_sample)) se_original = np.std(inner_stats, ddof=1) # Compute t-statistics for each bootstrap sample t_stats = [] for b in range(n_bootstrap): # Outer bootstrap sample boot_indices = rng.choice(n, size=n, replace=True) boot_sample = data[boot_indices] theta_star = statistic_fn(boot_sample) # SE for this bootstrap sample if se_fn is not None: se_star = se_fn(boot_sample) else: # Nested bootstrap for SE nested_stats = [] for _ in range(n_inner_bootstrap): nested_sample = boot_sample[rng.choice(n, size=n, replace=True)] nested_stats.append(statistic_fn(nested_sample)) se_star = np.std(nested_stats, ddof=1) # Avoid division by zero if se_star > 0: t_star = (theta_star - theta_hat) / se_star t_stats.append(t_star) t_stats = np.array(t_stats) # Bootstrap-t interval # CI = [θ̂ - t*_{1-α/2} · SE, θ̂ - t*_{α/2} · SE] t_lower = np.percentile(t_stats, 100 * (1 - alpha / 2)) t_upper = np.percentile(t_stats, 100 * alpha / 2) # Note the reversal! ci_lower = theta_hat - t_lower * se_original ci_upper = theta_hat - t_upper * se_original return { 'bootstrap_t': (ci_lower, ci_upper), 'point_estimate': theta_hat, 'se': se_original, 't_quantiles': (t_lower, t_upper), 't_mean': np.mean(t_stats), 't_std': np.std(t_stats), 'n_t_stats': len(t_stats), } def analytical_se_mean(data: np.ndarray) -> float: """Standard error of the mean (analytical formula).""" return np.std(data, ddof=1) / np.sqrt(len(data)) def demonstrate_bootstrap_t(): """ Demonstrate bootstrap-t interval for the sample mean. For the mean, we can use the analytical SE formula. """ np.random.seed(42) # Sample from heavy-tailed distribution (t with 3 df) from scipy.stats import t as t_dist data = t_dist.rvs(df=3, size=30) result = bootstrap_t_interval( data, np.mean, se_fn=analytical_se_mean, n_bootstrap=2000 ) print("Bootstrap-t Confidence Interval") print("=" * 50) print(f"Sample mean: {result['point_estimate']:.4f}") print(f"SE(mean): {result['se']:.4f}") print() print(f"t-quantiles: ({result['t_quantiles'][0]:.4f}, {result['t_quantiles'][1]:.4f})") print(f"t-mean: {result['t_mean']:.4f}") print(f"t-std: {result['t_std']:.4f}") print() print(f"Bootstrap-t CI: [{result['bootstrap_t'][0]:.4f}, {result['bootstrap_t'][1]:.4f}]") return result if __name__ == "__main__": demonstrate_bootstrap_t()Bootstrap-t can be computationally expensive when SE must be estimated via nested bootstrap. For B=1000 outer iterations and b=50 inner iterations, you need 50,000 model fits. Use analytical SE formulas when available (e.g., for the mean), or consider BCa as an alternative.
A confidence interval's coverage probability is the frequency with which it contains the true parameter value. For a 95% CI, we expect 95% coverage. When actual coverage differs from nominal, the interval is miscalibrated.
Types of Coverage Error:
Assessing Coverage:
In practice, we can't directly measure coverage (we don't know the true parameter). However, we can:
| Method | Nominal 95% | Actual Coverage (typical) | Best Case | Worst Case |
|---|---|---|---|---|
| Percentile | 95% | 88-94% | Symmetric statistics | Biased/skewed statistics |
| Basic | 95% | 90-95% | Pivotal statistics | Non-pivotal, skewed |
| BCa | 95% | 93-96% | Broad range | Extreme skew, small n |
| Bootstrap-t | 95% | 94-96% | Pivotal w/ good SE | Poor SE estimation |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npfrom typing import Callable, Dict def coverage_simulation(true_theta: float, data_generator: Callable[[int], np.ndarray], statistic_fn: Callable[[np.ndarray], float], interval_fn: Callable, n_samples: int = 50, n_simulations: int = 1000, confidence_level: float = 0.95) -> Dict: """ Estimate coverage probability of a confidence interval method. Parameters ---------- true_theta : float True parameter value data_generator : Callable Function that generates samples: data_generator(n) -> np.ndarray statistic_fn : Callable Statistic function interval_fn : Callable Function computing CI: interval_fn(data, statistic_fn) -> (lower, upper) n_samples : int Sample size for each simulation n_simulations : int Number of simulation repetitions confidence_level : float Nominal confidence level Returns ------- Dict Coverage probability and diagnostics """ coverage_count = 0 interval_widths = [] lower_bounds = [] upper_bounds = [] for sim in range(n_simulations): # Generate data np.random.seed(sim) data = data_generator(n_samples) # Compute interval lower, upper = interval_fn(data, statistic_fn) # Check coverage if lower <= true_theta <= upper: coverage_count += 1 interval_widths.append(upper - lower) lower_bounds.append(lower) upper_bounds.append(upper) coverage_prob = coverage_count / n_simulations # Standard error of coverage estimate se_coverage = np.sqrt(coverage_prob * (1 - coverage_prob) / n_simulations) return { 'coverage_probability': coverage_prob, 'se_coverage': se_coverage, 'nominal_coverage': confidence_level, 'coverage_error': coverage_prob - confidence_level, 'mean_width': np.mean(interval_widths), 'std_width': np.std(interval_widths), 'mean_lower': np.mean(lower_bounds), 'mean_upper': np.mean(upper_bounds), 'n_simulations': n_simulations, 'n_samples': n_samples, } def compare_ci_methods_coverage(): """ Compare coverage of different bootstrap CI methods. """ # True parameter: population mean = 2 (exponential scale) true_mean = 2.0 # Data generator: exponential (skewed) def gen_exp(n): return np.random.exponential(scale=true_mean, size=n) # CI methods using simplified implementations def percentile_ci(data, stat_fn, n_boot=500): stats = [] for _ in range(n_boot): sample = data[np.random.choice(len(data), len(data), replace=True)] stats.append(stat_fn(sample)) return np.percentile(stats, 2.5), np.percentile(stats, 97.5) def basic_ci(data, stat_fn, n_boot=500): theta = stat_fn(data) stats = [] for _ in range(n_boot): sample = data[np.random.choice(len(data), len(data), replace=True)] stats.append(stat_fn(sample)) q_low = np.percentile(stats, 97.5) q_high = np.percentile(stats, 2.5) return 2*theta - q_low, 2*theta - q_high print("Coverage Simulation: Exponential Data, Sample Mean") print("=" * 60) print(f"True parameter (exponential mean): {true_mean}") print() for name, ci_fn in [("Percentile", percentile_ci), ("Basic", basic_ci)]: result = coverage_simulation( true_theta=true_mean, data_generator=gen_exp, statistic_fn=np.mean, interval_fn=ci_fn, n_samples=30, n_simulations=500 ) print(f"{name} Method:") print(f" Coverage: {result['coverage_probability']:.3f} " f"(±{2*result['se_coverage']:.3f})") print(f" Coverage error: {result['coverage_error']:+.3f}") print(f" Mean width: {result['mean_width']:.4f}") print() if __name__ == "__main__": compare_ci_methods_coverage()For most applications, BCa intervals provide the best balance of coverage accuracy and computational cost. Use at least B=2000 bootstrap samples for reliable interval estimation. When in doubt, report both the point estimate and interval width so readers can assess precision.
Different bootstrap CI methods excel in different situations. Here's a practical decision framework.
Decision Flowchart:
Do you need a quick, rough interval?
├── YES → Use PERCENTILE
└── NO → Is an analytical SE formula available?
├── YES → Use BOOTSTRAP-T
└── NO → Use BCa
└── If B < 1000 possible, use PERCENTILE with caveat
Special Cases:
| Statistic | Recommended Method | Notes |
|---|---|---|
| Sample mean | Bootstrap-t or Normal theory | SE formula known |
| Sample median | BCa | SE depends on density at median |
| Correlation coefficient | BCa | Bounded, often skewed |
| Regression coefficients | Bootstrap-t or BCa | Depends on assumptions |
| AUC/Accuracy | BCa | Bounded [0,1], often high |
| Quantiles | BCa | SE depends on density |
No bootstrap method works well when: (1) Sample size is very small (n < 15); (2) The parameter is near a boundary (e.g., correlation near ±1); (3) The statistic is highly discrete; (4) The data distribution has no variance. In these cases, consider exact methods, transformations, or reporting ranges instead of formal intervals.
In machine learning, we often want confidence intervals for performance metrics: accuracy, AUC, F1 score, RMSE, etc. Bootstrap provides a natural framework, but there are important considerations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175
import numpy as npfrom sklearn.base import clonefrom sklearn.metrics import accuracy_score, roc_auc_scorefrom scipy import statsfrom typing import Dict, Callable def bootstrap_metric_ci(X: np.ndarray, y: np.ndarray, model, metric_fn: Callable, n_bootstrap: int = 1000, confidence_level: float = 0.95, method: str = 'percentile', random_state: int = 42) -> Dict: """ Compute bootstrap CI for a model performance metric. Two approaches are possible: 1. Resample predictions: Faster, assumes model is fixed 2. Resample and retrain: Captures model variance, slower This implementation uses approach 2 (resample + retrain). Parameters ---------- X : np.ndarray Features y : np.ndarray Labels model : sklearn estimator Model to evaluate metric_fn : Callable Metric function(y_true, y_pred) -> float n_bootstrap : int Number of bootstrap samples confidence_level : float Confidence level method : str 'percentile', 'basic', or 'bca' random_state : int Random seed Returns ------- Dict CI and diagnostics """ rng = np.random.RandomState(random_state) n = len(y) alpha = 1 - confidence_level # Train on full data for point estimate model_full = clone(model) model_full.fit(X, y) y_pred_full = model_full.predict(X) point_estimate = metric_fn(y, y_pred_full) # Bootstrap bootstrap_metrics = [] for b in range(n_bootstrap): # Resample indices = rng.choice(n, size=n, replace=True) X_boot, y_boot = X[indices], y[indices] # Out-of-bag for evaluation (optional but recommended) oob_mask = np.ones(n, dtype=bool) oob_mask[np.unique(indices)] = False if np.sum(oob_mask) < 5: # Not enough OOB samples, use in-bag X_test, y_test = X_boot, y_boot else: X_test, y_test = X[oob_mask], y[oob_mask] # Train and predict model_b = clone(model) model_b.fit(X_boot, y_boot) y_pred_b = model_b.predict(X_test) # Compute metric try: metric_value = metric_fn(y_test, y_pred_b) bootstrap_metrics.append(metric_value) except: # Some metrics may fail if classes are missing pass bootstrap_metrics = np.array(bootstrap_metrics) # Compute interval based on method if method == 'percentile': lower = np.percentile(bootstrap_metrics, 100 * alpha / 2) upper = np.percentile(bootstrap_metrics, 100 * (1 - alpha / 2)) elif method == 'basic': q_low = np.percentile(bootstrap_metrics, 100 * alpha / 2) q_high = np.percentile(bootstrap_metrics, 100 * (1 - alpha / 2)) lower = 2 * point_estimate - q_high upper = 2 * point_estimate - q_low elif method == 'bca': # Simplified BCa (full implementation would use jackknife) z0 = stats.norm.ppf(np.mean(bootstrap_metrics < point_estimate)) z0 = np.clip(z0, -3, 3) # Approximate acceleration (would need jackknife for exact) a = 0 # Simplification z_low = stats.norm.ppf(alpha / 2) z_high = stats.norm.ppf(1 - alpha / 2) alpha_1 = stats.norm.cdf(z0 + (z0 + z_low) / (1 - a * (z0 + z_low))) alpha_2 = stats.norm.cdf(z0 + (z0 + z_high) / (1 - a * (z0 + z_high))) alpha_1 = np.clip(alpha_1, 0.001, 0.999) alpha_2 = np.clip(alpha_2, 0.001, 0.999) lower = np.percentile(bootstrap_metrics, 100 * alpha_1) upper = np.percentile(bootstrap_metrics, 100 * alpha_2) else: raise ValueError(f"Unknown method: {method}") return { 'point_estimate': point_estimate, 'ci': (lower, upper), 'ci_width': upper - lower, 'method': method, 'bootstrap_mean': np.mean(bootstrap_metrics), 'bootstrap_std': np.std(bootstrap_metrics), 'n_bootstrap': len(bootstrap_metrics), } # Example: Comparing models with CIsdef compare_models_with_ci(X, y, models_dict, metric_fn, n_bootstrap=500): """ Compare multiple models with bootstrap CIs for fair comparison. """ results = {} for name, model in models_dict.items(): result = bootstrap_metric_ci(X, y, model, metric_fn, n_bootstrap) results[name] = result return results if __name__ == "__main__": from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier X, y = make_classification(n_samples=200, n_features=20, random_state=42) models = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Decision Tree': DecisionTreeClassifier(max_depth=5), 'Random Forest': RandomForestClassifier(n_estimators=50), } results = compare_models_with_ci(X, y, models, accuracy_score, n_bootstrap=300) print("Model Comparison with Bootstrap 95% CIs") print("=" * 60) for name, result in results.items(): ci = result['ci'] print(f"{name}:") print(f" Accuracy: {result['point_estimate']:.4f}") print(f" 95% CI: [{ci[0]:.4f}, {ci[1]:.4f}]") print(f" Width: {result['ci_width']:.4f}") print()When computing CIs for model metrics, decide whether to resample predictions only (faster, ignores training variance) or resample-and-retrain (slower, captures full uncertainty). For model comparisons and publication, resample-and-retrain is preferred. For quick diagnostics, resampling predictions may suffice.
Bootstrap confidence intervals represent the practical culmination of bootstrap methodology—enabling rigorous uncertainty quantification for virtually any statistic without requiring analytical derivations.
Module Complete:
This completes our comprehensive treatment of bootstrap methods for model evaluation. You now have mastery of:
These methods form a powerful toolkit for estimating and characterizing the generalization performance of machine learning models, enabling you to move beyond point estimates to well-calibrated uncertainty quantification.
You now have comprehensive understanding of bootstrap methods for model evaluation—from basic resampling to sophisticated confidence interval construction. This knowledge enables you to estimate generalization performance, quantify uncertainty, and make principled model comparisons in any machine learning context.