Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

5 / 5

Bootstrap Confidence Intervals

Quantifying Uncertainty Without Assumptions

Confidence intervals represent one of the most fundamental tools in statistical inference—they quantify our uncertainty about unknown quantities. Traditionally, constructing confidence intervals required knowing or assuming the sampling distribution of a statistic, often limiting us to simple cases like the sample mean.

The bootstrap liberates us from this constraint. By simulating the sampling distribution through resampling, we can construct confidence intervals for any statistic: medians, correlation coefficients, regression parameters, model accuracy, AUC, and beyond. This universality is the bootstrap's greatest practical contribution.

However, not all bootstrap confidence interval methods are equal. In this page, we'll explore the major approaches—from the simple percentile method to the sophisticated BCa (bias-corrected and accelerated) intervals—understanding when each excels and when each fails.

What You Will Master

By the end of this page, you will understand: (1) The percentile method—simple but limited; (2) The basic bootstrap interval and its relationship to pivotal statistics; (3) The BCa (bias-corrected and accelerated) method—the gold standard; (4) Bootstrap-t intervals for variance stabilization; (5) Coverage probability and interval calibration; (6) Practical guidelines for choosing the right method.

The Percentile Method

The percentile method is the simplest and most intuitive approach to bootstrap confidence intervals. It directly uses quantiles of the bootstrap distribution.

Algorithm:

Generate $B$ bootstrap samples and compute bootstrap statistics $\hat{\theta}_1^, \hat{\theta}_2^, \ldots, \hat{\theta}_B^*$
For a $(1-\alpha)$ confidence interval, take the $\alpha/2$ and $(1-\alpha/2)$ quantiles of the bootstrap distribution:

$$CI_{\text{percentile}} = \left[\hat{\theta}^_{(\alpha/2)}, \hat{\theta}^_{(1-\alpha/2)}\right]$$

For a 95% CI ($\alpha = 0.05$), this means taking the 2.5th and 97.5th percentiles.

Intuition: If the bootstrap distribution accurately represents the sampling distribution of $\hat{\theta}$, then the central 95% of the bootstrap distribution should capture the true $\theta$ 95% of the time.

percentile_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from typing import Callable, Tuple
 
def percentile_interval(data: np.ndarray,
                         statistic_fn: Callable[[np.ndarray], float],
                         confidence_level: float = 0.95,
                         n_bootstrap: int = 10000,
                         random_state: int = 42) -> Tuple[float, float, np.ndarray]:
    """
    Compute bootstrap percentile confidence interval.
    
    The simplest bootstrap CI: take quantiles of the bootstrap distribution.
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic from sample
    confidence_level : float
        Confidence level (default 0.95)
    n_bootstrap : int
        Number of bootstrap samples
    random_state : int
        Random seed
    
    Returns
    -------
    Tuple[float, float, np.ndarray]
        (lower_bound, upper_bound, bootstrap_statistics)
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Generate bootstrap statistics
    bootstrap_stats = np.zeros(n_bootstrap)
    for b in range(n_bootstrap):
        sample = data[rng.choice(n, size=n, replace=True)]
        bootstrap_stats[b] = statistic_fn(sample)
    
    # Compute percentile interval
    alpha = 1 - confidence_level
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    
    return lower, upper, bootstrap_stats
 
 
def demonstrate_percentile_interval():
    """
    Demonstrate the percentile method for the sample median.
    
    The median is a great example because its sampling distribution
    is asymptotically normal but harder to derive analytically
    than the mean.
    """
    # Generate sample
    np.random.seed(42)
    data = np.random.exponential(scale=2.0, size=100)  # Skewed distribution
    
    # Point estimate
    sample_median = np.median(data)
    
    # Percentile interval for median
    lower, upper, boot_stats = percentile_interval(
        data, 
        np.median, 
        confidence_level=0.95,
        n_bootstrap=10000
    )
    
    print("Percentile Bootstrap CI for the Median")
    print("=" * 50)
    print(f"Sample median: {sample_median:.4f}")
    print(f"95% CI: [{lower:.4f}, {upper:.4f}]")
    print(f"CI width: {upper - lower:.4f}")
    print(f"Bootstrap SD: {np.std(boot_stats):.4f}")
    print(f"Bootstrap mean: {np.mean(boot_stats):.4f}")
    print(f"Bias (mean - point): {np.mean(boot_stats) - sample_median:.4f}")
    
    return lower, upper, sample_median, boot_stats
 
 
if __name__ == "__main__":
    demonstrate_percentile_interval()

Limitations of the Percentile Method

The percentile method works well when the bootstrap distribution is symmetric and unbiased. However, it can have poor coverage (actual coverage differs from nominal 95%) when: (1) The statistic is biased; (2) The bootstrap distribution is skewed; (3) The standard error depends on the parameter value. These limitations motivate the more sophisticated methods we'll study next.

The Basic (Reverse Percentile) Bootstrap Interval

The basic bootstrap interval (also called the reverse percentile or pivotal interval) takes a different approach motivated by the concept of pivotal quantities.

Key Insight:

The bootstrap approximates the distribution of $\hat{\theta} - \theta$ (the error in our estimate). If we knew this distribution, we could construct an interval by 'inverting' it.

Let $\delta_\alpha = \hat{\theta}^_{(\alpha)} - \hat{\theta}$ be the $\alpha$-quantile of the bootstrap errors $\hat{\theta}^ - \hat{\theta}$.

Then by symmetry of the interval construction: $$\theta \in \left[\hat{\theta} - \delta_{1-\alpha/2}, \hat{\theta} - \delta_{\alpha/2}\right]$$

The Formula:

Substituting, the basic bootstrap interval is: $$CI_{\text{basic}} = \left[2\hat{\theta} - \hat{\theta}^_{(1-\alpha/2)}, 2\hat{\theta} - \hat{\theta}^_{(\alpha/2)}\right]$$

Notice this 'reflects' the percentile interval around the point estimate $\hat{\theta}$.

basic_bootstrap_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from typing import Callable, Tuple, Dict
 
def basic_bootstrap_interval(data: np.ndarray,
                              statistic_fn: Callable[[np.ndarray], float],
                              confidence_level: float = 0.95,
                              n_bootstrap: int = 10000,
                              random_state: int = 42) -> Dict:
    """
    Compute the basic (reverse percentile) bootstrap interval.
    
    This method 'reflects' bootstrap quantiles around the point estimate.
    It's based on the distribution of (θ̂* - θ̂) approximating (θ̂ - θ).
    
    Formula: CI = [2θ̂ - θ̂*_{1-α/2}, 2θ̂ - θ̂*_{α/2}]
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic
    confidence_level : float
        Confidence level
    n_bootstrap : int
        Number of bootstrap samples
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        Contains both basic and percentile intervals for comparison
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Point estimate
    theta_hat = statistic_fn(data)
    
    # Generate bootstrap statistics
    bootstrap_stats = np.zeros(n_bootstrap)
    for b in range(n_bootstrap):
        sample = data[rng.choice(n, size=n, replace=True)]
        bootstrap_stats[b] = statistic_fn(sample)
    
    alpha = 1 - confidence_level
    
    # Percentile interval (for comparison)
    q_lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    q_upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    
    # Basic (reverse) interval
    # CI = [2θ̂ - θ̂*_{1-α/2}, 2θ̂ - θ̂*_{α/2}]
    basic_lower = 2 * theta_hat - q_upper
    basic_upper = 2 * theta_hat - q_lower
    
    return {
        'point_estimate': theta_hat,
        'percentile': (q_lower, q_upper),
        'basic': (basic_lower, basic_upper),
        'bootstrap_mean': np.mean(bootstrap_stats),
        'bootstrap_sd': np.std(bootstrap_stats),
        'bias': np.mean(bootstrap_stats) - theta_hat,
    }
 
 
def compare_percentile_vs_basic():
    """
    Compare percentile and basic intervals on a skewed statistic.
    
    For asymmetric distributions, these methods give different
    intervals, and the difference reveals bias.
    """
    np.random.seed(42)
    
    # Exponential data - sample mean is biased estimator of population mean
    data = np.random.exponential(scale=2.0, size=50)
    true_mean = 2.0  # True population mean
    
    result = basic_bootstrap_interval(data, np.mean, n_bootstrap=10000)
    
    print("Comparison: Percentile vs Basic Bootstrap Interval")
    print("=" * 60)
    print(f"True parameter: {true_mean:.4f}")
    print(f"Point estimate: {result['point_estimate']:.4f}")
    print(f"Bootstrap bias: {result['bias']:.4f}")
    print()
    print(f"Percentile CI:  [{result['percentile'][0]:.4f}, {result['percentile'][1]:.4f}]")
    print(f"Basic CI:       [{result['basic'][0]:.4f}, {result['basic'][1]:.4f}]")
    print()
    
    # When bias is near zero, these should be similar
    # The difference indicates the magnitude of correction
    print(f"Lower bound difference: {result['basic'][0] - result['percentile'][0]:.4f}")
    print(f"Upper bound difference: {result['basic'][1] - result['percentile'][1]:.4f}")
    
    return result
 
 
if __name__ == "__main__":
    compare_percentile_vs_basic()

When to Use Basic vs. Percentile:

Scenario	Percentile	Basic
Symmetric, unbiased statistic	✓ Similar	✓ Similar
Biased statistic	Poor coverage	Better coverage
Skewed bootstrap distribution	May be wrong direction	May overcorrect
Transformation exists	Use transformed	Use transformed

In practice, both methods have theoretical limitations that motivate the BCa method.

The Reflection Principle

The basic interval effectively says: 'The bootstrap errors (θ̂* - θ̂) approximate the true errors (θ̂ - θ). So if the bootstrap says values above θ̂ are more likely, the true θ is probably below θ̂.' This reflection handles bias but not varying standard error.

The BCa (Bias-Corrected and Accelerated) Method

The BCa (bias-corrected and accelerated) method is considered the gold standard for bootstrap confidence intervals. It addresses two key issues:

Bias: The bootstrap distribution may be systematically shifted from the true sampling distribution
Skewness/Acceleration: The standard error of $\hat{\theta}$ may depend on the value of $\theta$ itself

The BCa method adjusts the percentiles used for the interval based on two correction factors.

The BCa Interval:

$$CI_{\text{BCa}} = \left[\hat{\theta}^_{(\alpha_1)}, \hat{\theta}^_{(\alpha_2)}\right]$$

where the adjusted percentile levels are:

$$\alpha_1 = \Phi\left(\hat{z}_0 + \frac{\hat{z}0 + z{\alpha/2}}{1 - \hat{a}(\hat{z}0 + z{\alpha/2})}\right)$$

$$\alpha_2 = \Phi\left(\hat{z}_0 + \frac{\hat{z}0 + z{1-\alpha/2}}{1 - \hat{a}(\hat{z}0 + z{1-\alpha/2})}\right)$$

Here:

$\Phi$ is the standard normal CDF
$z_p$ is the $p$-quantile of the standard normal
$\hat{z}_0$ is the bias-correction factor
$\hat{a}$ is the acceleration factor

Understanding the Correction Factors

The bias-correction factor ẑ₀ measures the median bias of the bootstrap distribution: how often bootstrap statistics exceed the original estimate. The acceleration factor â measures how the standard error of the statistic changes with the parameter value—a form of 'curvature' in the problem.

Computing the Correction Factors:

Bias-correction factor $\hat{z}_0$: $$\hat{z}_0 = \Phi^{-1}\left(\frac{#{\hat{\theta}^*_b < \hat{\theta}}}{B}\right)$$

This is the inverse-normal of the proportion of bootstrap statistics below the original estimate. If the bootstrap is unbiased, this proportion is 0.5 and $\hat{z}_0 = 0$.

Acceleration factor $\hat{a}$:

The acceleration factor is estimated using the jackknife: $$\hat{a} = \frac{\sum_{i=1}^{n}(\hat{\theta}{(\cdot)} - \hat{\theta}{(i)})^3}{6\left[\sum_{i=1}^{n}(\hat{\theta}{(\cdot)} - \hat{\theta}{(i)})^2\right]^{3/2}}$$

where:

$\hat{\theta}_{(i)}$ is the statistic computed with observation $i$ removed
$\hat{\theta}{(\cdot)} = \frac{1}{n}\sum{i=1}^{n} \hat{\theta}_{(i)}$ is the mean of jackknife estimates

bca_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from scipy import stats
from typing import Callable, Tuple, Dict
 
def bca_interval(data: np.ndarray,
                  statistic_fn: Callable[[np.ndarray], float],
                  confidence_level: float = 0.95,
                  n_bootstrap: int = 10000,
                  random_state: int = 42) -> Dict:
    """
    Compute the BCa (bias-corrected and accelerated) bootstrap interval.
    
    The BCa method adjusts percentiles based on:
    - z0: bias correction (median bias of bootstrap distribution)
    - a: acceleration (how SE changes with parameter value)
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic
    confidence_level : float
        Confidence level
    n_bootstrap : int
        Number of bootstrap samples
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        BCa interval, percentile interval (for comparison), and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    alpha = 1 - confidence_level
    
    # =========================================
    # Step 1: Point estimate
    # =========================================
    theta_hat = statistic_fn(data)
    
    # =========================================
    # Step 2: Bootstrap distribution
    # =========================================
    bootstrap_stats = np.zeros(n_bootstrap)
    for b in range(n_bootstrap):
        sample = data[rng.choice(n, size=n, replace=True)]
        bootstrap_stats[b] = statistic_fn(sample)
    
    # =========================================
    # Step 3: Bias-correction factor (z0)
    # =========================================
    # Proportion of bootstrap stats below original estimate
    proportion_below = np.mean(bootstrap_stats < theta_hat)
    
    # Handle edge cases
    if proportion_below == 0:
        proportion_below = 1 / (2 * n_bootstrap)
    elif proportion_below == 1:
        proportion_below = 1 - 1 / (2 * n_bootstrap)
    
    z0 = stats.norm.ppf(proportion_below)
    
    # =========================================
    # Step 4: Acceleration factor (a) via jackknife
    # =========================================
    jackknife_stats = np.zeros(n)
    for i in range(n):
        # Leave one out
        jackknife_sample = np.delete(data, i)
        jackknife_stats[i] = statistic_fn(jackknife_sample)
    
    jackknife_mean = np.mean(jackknife_stats)
    
    # Compute acceleration using skewness of jackknife influence
    numerator = np.sum((jackknife_mean - jackknife_stats)**3)
    denominator = 6 * (np.sum((jackknife_mean - jackknife_stats)**2)**(1.5))
    
    if denominator == 0:
        a = 0  # No acceleration (e.g., constant statistic)
    else:
        a = numerator / denominator
    
    # =========================================
    # Step 5: Compute adjusted percentiles
    # =========================================
    z_alpha_low = stats.norm.ppf(alpha / 2)
    z_alpha_high = stats.norm.ppf(1 - alpha / 2)
    
    # Adjusted percentile levels
    def adjusted_alpha(z_alpha):
        numerator = z0 + z_alpha
        denominator = 1 - a * (z0 + z_alpha)
        if denominator <= 0:
            # Edge case: return extreme percentile
            return 0.999 if z_alpha > 0 else 0.001
        return stats.norm.cdf(z0 + numerator / denominator)
    
    alpha_1 = adjusted_alpha(z_alpha_low)
    alpha_2 = adjusted_alpha(z_alpha_high)
    
    # Ensure valid percentiles
    alpha_1 = np.clip(alpha_1, 0.001, 0.999)
    alpha_2 = np.clip(alpha_2, 0.001, 0.999)
    
    # =========================================
    # Step 6: Extract BCa interval
    # =========================================
    bca_lower = np.percentile(bootstrap_stats, 100 * alpha_1)
    bca_upper = np.percentile(bootstrap_stats, 100 * alpha_2)
    
    # Percentile interval for comparison
    pct_lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    pct_upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    
    return {
        'bca': (bca_lower, bca_upper),
        'percentile': (pct_lower, pct_upper),
        'point_estimate': theta_hat,
        'z0': z0,
        'acceleration': a,
        'alpha_adjusted': (alpha_1, alpha_2),
        'alpha_original': (alpha/2, 1 - alpha/2),
        'bootstrap_mean': np.mean(bootstrap_stats),
        'bootstrap_sd': np.std(bootstrap_stats),
    }
 
 
def demonstrate_bca():
    """
    Demonstrate BCa on a skewed statistic where correction matters.
    """
    np.random.seed(42)
    
    # Draw from an exponential - the sample variance is a skewed statistic
    data = np.random.exponential(scale=2.0, size=50)
    
    # Statistic: sample variance
    result = bca_interval(data, np.var, n_bootstrap=10000)
    
    print("BCa Bootstrap Confidence Interval for Sample Variance")
    print("=" * 60)
    print(f"Point estimate: {result['point_estimate']:.4f}")
    print(f"Bootstrap mean: {result['bootstrap_mean']:.4f}")
    print()
    print("Correction factors:")
    print(f"  z0 (bias):        {result['z0']:.4f}")
    print(f"  a (acceleration): {result['acceleration']:.4f}")
    print()
    print("Adjusted percentiles:")
    print(f"  Original: ({result['alpha_original'][0]:.4f}, {result['alpha_original'][1]:.4f})")
    print(f"  Adjusted: ({result['alpha_adjusted'][0]:.4f}, {result['alpha_adjusted'][1]:.4f})")
    print()
    print("Confidence Intervals:")
    print(f"  Percentile: [{result['percentile'][0]:.4f}, {result['percentile'][1]:.4f}]")
    print(f"  BCa:        [{result['bca'][0]:.4f}, {result['bca'][1]:.4f}]")
    
    return result
 
 
if __name__ == "__main__":
    demonstrate_bca()

When BCa Matters

For symmetric statistics like the sample mean from symmetric populations, BCa reduces to the percentile method (z0 ≈ 0, a ≈ 0). For skewed statistics (variance, correlation, regression coefficients), BCa can provide substantially better coverage than simpler methods.

Bootstrap-t (Studentized) Intervals

The bootstrap-t (or studentized bootstrap) method constructs intervals using a bootstrap analog of the t-statistic. It's particularly effective when the standard error of the statistic is well-estimated.

The Key Idea:

Instead of bootstrapping $\hat{\theta}$ directly, we bootstrap the t-statistic: $$t^* = \frac{\hat{\theta}^* - \hat{\theta}}{\widehat{SE}^*}$$

where $\widehat{SE}^*$ is the estimated standard error from the bootstrap sample (often requiring a nested bootstrap or analytical formula).

The Bootstrap-t Interval:

$$CI_{\text{boot-t}} = \left[\hat{\theta} - t^_{(1-\alpha/2)} \cdot \widehat{SE}, \hat{\theta} - t^_{(\alpha/2)} \cdot \widehat{SE}\right]$$

Note the reversal of quantiles (similar to the basic interval).

bootstrap_t_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from typing import Callable, Dict, Optional
 
def bootstrap_t_interval(data: np.ndarray,
                          statistic_fn: Callable[[np.ndarray], float],
                          se_fn: Optional[Callable[[np.ndarray], float]] = None,
                          confidence_level: float = 0.95,
                          n_bootstrap: int = 1000,
                          n_inner_bootstrap: int = 50,
                          random_state: int = 42) -> Dict:
    """
    Compute the bootstrap-t (studentized) confidence interval.
    
    This method bootstraps the t-statistic (θ̂* - θ̂)/SE* rather than
    θ̂ directly. It requires estimating SE for each bootstrap sample.
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic
    se_fn : Callable, optional
        Function computing SE analytically. If None, uses nested bootstrap.
    confidence_level : float
        Confidence level
    n_bootstrap : int
        Outer bootstrap samples
    n_inner_bootstrap : int
        Inner bootstrap for SE estimation (if se_fn is None)
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        Bootstrap-t interval and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    alpha = 1 - confidence_level
    
    # Original estimate and SE
    theta_hat = statistic_fn(data)
    
    if se_fn is not None:
        se_original = se_fn(data)
    else:
        # Estimate SE via bootstrap
        inner_stats = []
        for _ in range(200):
            inner_sample = data[rng.choice(n, size=n, replace=True)]
            inner_stats.append(statistic_fn(inner_sample))
        se_original = np.std(inner_stats, ddof=1)
    
    # Compute t-statistics for each bootstrap sample
    t_stats = []
    
    for b in range(n_bootstrap):
        # Outer bootstrap sample
        boot_indices = rng.choice(n, size=n, replace=True)
        boot_sample = data[boot_indices]
        
        theta_star = statistic_fn(boot_sample)
        
        # SE for this bootstrap sample
        if se_fn is not None:
            se_star = se_fn(boot_sample)
        else:
            # Nested bootstrap for SE
            nested_stats = []
            for _ in range(n_inner_bootstrap):
                nested_sample = boot_sample[rng.choice(n, size=n, replace=True)]
                nested_stats.append(statistic_fn(nested_sample))
            se_star = np.std(nested_stats, ddof=1)
        
        # Avoid division by zero
        if se_star > 0:
            t_star = (theta_star - theta_hat) / se_star
            t_stats.append(t_star)
    
    t_stats = np.array(t_stats)
    
    # Bootstrap-t interval
    # CI = [θ̂ - t*_{1-α/2} · SE, θ̂ - t*_{α/2} · SE]
    t_lower = np.percentile(t_stats, 100 * (1 - alpha / 2))
    t_upper = np.percentile(t_stats, 100 * alpha / 2)
    
    # Note the reversal!
    ci_lower = theta_hat - t_lower * se_original
    ci_upper = theta_hat - t_upper * se_original
    
    return {
        'bootstrap_t': (ci_lower, ci_upper),
        'point_estimate': theta_hat,
        'se': se_original,
        't_quantiles': (t_lower, t_upper),
        't_mean': np.mean(t_stats),
        't_std': np.std(t_stats),
        'n_t_stats': len(t_stats),
    }
 
 
def analytical_se_mean(data: np.ndarray) -> float:
    """Standard error of the mean (analytical formula)."""
    return np.std(data, ddof=1) / np.sqrt(len(data))
 
 
def demonstrate_bootstrap_t():
    """
    Demonstrate bootstrap-t interval for the sample mean.
    
    For the mean, we can use the analytical SE formula.
    """
    np.random.seed(42)
    
    # Sample from heavy-tailed distribution (t with 3 df)
    from scipy.stats import t as t_dist
    data = t_dist.rvs(df=3, size=30)
    
    result = bootstrap_t_interval(
        data, 
        np.mean,
        se_fn=analytical_se_mean,
        n_bootstrap=2000
    )
    
    print("Bootstrap-t Confidence Interval")
    print("=" * 50)
    print(f"Sample mean: {result['point_estimate']:.4f}")
    print(f"SE(mean):    {result['se']:.4f}")
    print()
    print(f"t-quantiles: ({result['t_quantiles'][0]:.4f}, {result['t_quantiles'][1]:.4f})")
    print(f"t-mean:      {result['t_mean']:.4f}")
    print(f"t-std:       {result['t_std']:.4f}")
    print()
    print(f"Bootstrap-t CI: [{result['bootstrap_t'][0]:.4f}, {result['bootstrap_t'][1]:.4f}]")
    
    return result
 
 
if __name__ == "__main__":
    demonstrate_bootstrap_t()

Computational Cost

Bootstrap-t can be computationally expensive when SE must be estimated via nested bootstrap. For B=1000 outer iterations and b=50 inner iterations, you need 50,000 model fits. Use analytical SE formulas when available (e.g., for the mean), or consider BCa as an alternative.

Coverage Probability and Interval Calibration

A confidence interval's coverage probability is the frequency with which it contains the true parameter value. For a 95% CI, we expect 95% coverage. When actual coverage differs from nominal, the interval is miscalibrated.

Types of Coverage Error:

Undercoverage: Actual coverage < nominal (interval too narrow; overconfident)
Overcoverage: Actual coverage > nominal (interval too wide; conservative)

Assessing Coverage:

In practice, we can't directly measure coverage (we don't know the true parameter). However, we can:

Simulation studies: Generate data from known distributions, compute intervals, check if true value is contained
Historical validation: For problems where truth is eventually revealed
Theoretical analysis: For some methods, coverage is known asymptotically

Typical Coverage Properties of Bootstrap CI Methods
Method	Nominal 95%	Actual Coverage (typical)	Best Case	Worst Case
Percentile	95%	88-94%	Symmetric statistics	Biased/skewed statistics
Basic	95%	90-95%	Pivotal statistics	Non-pivotal, skewed
BCa	95%	93-96%	Broad range	Extreme skew, small n
Bootstrap-t	95%	94-96%	Pivotal w/ good SE	Poor SE estimation

coverage_simulation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from typing import Callable, Dict
 
def coverage_simulation(true_theta: float,
                         data_generator: Callable[[int], np.ndarray],
                         statistic_fn: Callable[[np.ndarray], float],
                         interval_fn: Callable,
                         n_samples: int = 50,
                         n_simulations: int = 1000,
                         confidence_level: float = 0.95) -> Dict:
    """
    Estimate coverage probability of a confidence interval method.
    
    Parameters
    ----------
    true_theta : float
        True parameter value
    data_generator : Callable
        Function that generates samples: data_generator(n) -> np.ndarray
    statistic_fn : Callable
        Statistic function
    interval_fn : Callable
        Function computing CI: interval_fn(data, statistic_fn) -> (lower, upper)
    n_samples : int
        Sample size for each simulation
    n_simulations : int
        Number of simulation repetitions
    confidence_level : float
        Nominal confidence level
    
    Returns
    -------
    Dict
        Coverage probability and diagnostics
    """
    coverage_count = 0
    interval_widths = []
    lower_bounds = []
    upper_bounds = []
    
    for sim in range(n_simulations):
        # Generate data
        np.random.seed(sim)
        data = data_generator(n_samples)
        
        # Compute interval
        lower, upper = interval_fn(data, statistic_fn)
        
        # Check coverage
        if lower <= true_theta <= upper:
            coverage_count += 1
        
        interval_widths.append(upper - lower)
        lower_bounds.append(lower)
        upper_bounds.append(upper)
    
    coverage_prob = coverage_count / n_simulations
    
    # Standard error of coverage estimate
    se_coverage = np.sqrt(coverage_prob * (1 - coverage_prob) / n_simulations)
    
    return {
        'coverage_probability': coverage_prob,
        'se_coverage': se_coverage,
        'nominal_coverage': confidence_level,
        'coverage_error': coverage_prob - confidence_level,
        'mean_width': np.mean(interval_widths),
        'std_width': np.std(interval_widths),
        'mean_lower': np.mean(lower_bounds),
        'mean_upper': np.mean(upper_bounds),
        'n_simulations': n_simulations,
        'n_samples': n_samples,
    }
 
 
def compare_ci_methods_coverage():
    """
    Compare coverage of different bootstrap CI methods.
    """
    # True parameter: population mean = 2 (exponential scale)
    true_mean = 2.0
    
    # Data generator: exponential (skewed)
    def gen_exp(n):
        return np.random.exponential(scale=true_mean, size=n)
    
    # CI methods using simplified implementations
    def percentile_ci(data, stat_fn, n_boot=500):
        stats = []
        for _ in range(n_boot):
            sample = data[np.random.choice(len(data), len(data), replace=True)]
            stats.append(stat_fn(sample))
        return np.percentile(stats, 2.5), np.percentile(stats, 97.5)
    
    def basic_ci(data, stat_fn, n_boot=500):
        theta = stat_fn(data)
        stats = []
        for _ in range(n_boot):
            sample = data[np.random.choice(len(data), len(data), replace=True)]
            stats.append(stat_fn(sample))
        q_low = np.percentile(stats, 97.5)
        q_high = np.percentile(stats, 2.5)
        return 2*theta - q_low, 2*theta - q_high
    
    print("Coverage Simulation: Exponential Data, Sample Mean")
    print("=" * 60)
    print(f"True parameter (exponential mean): {true_mean}")
    print()
    
    for name, ci_fn in [("Percentile", percentile_ci), ("Basic", basic_ci)]:
        result = coverage_simulation(
            true_theta=true_mean,
            data_generator=gen_exp,
            statistic_fn=np.mean,
            interval_fn=ci_fn,
            n_samples=30,
            n_simulations=500
        )
        
        print(f"{name} Method:")
        print(f"  Coverage: {result['coverage_probability']:.3f} "
              f"(±{2*result['se_coverage']:.3f})")
        print(f"  Coverage error: {result['coverage_error']:+.3f}")
        print(f"  Mean width: {result['mean_width']:.4f}")
        print()
 
 
if __name__ == "__main__":
    compare_ci_methods_coverage()

Practical Guidance

For most applications, BCa intervals provide the best balance of coverage accuracy and computational cost. Use at least B=2000 bootstrap samples for reliable interval estimation. When in doubt, report both the point estimate and interval width so readers can assess precision.

Choosing the Right CI Method

Different bootstrap CI methods excel in different situations. Here's a practical decision framework.

Method Selection Guide

•Percentile: Use for quick, exploratory analysis. Good enough when statistic is symmetric and approximately unbiased. Simplest to implement and interpret.
•Basic: Use when you suspect bias but not varying SE. Slightly better coverage than percentile for biased statistics.
•BCa: Default recommendation for publication-quality results. Handles both bias and acceleration. Use B ≥ 2000.
•Bootstrap-t: Use when you have a reliable SE estimator (analytical or well-studied). Best theoretical properties for pivotal statistics.

Decision Flowchart:

Do you need a quick, rough interval?
├── YES → Use PERCENTILE
└── NO → Is an analytical SE formula available?
         ├── YES → Use BOOTSTRAP-T
         └── NO → Use BCa
                  └── If B < 1000 possible, use PERCENTILE with caveat

Special Cases:

Statistic	Recommended Method	Notes
Sample mean	Bootstrap-t or Normal theory	SE formula known
Sample median	BCa	SE depends on density at median
Correlation coefficient	BCa	Bounded, often skewed
Regression coefficients	Bootstrap-t or BCa	Depends on assumptions
AUC/Accuracy	BCa	Bounded [0,1], often high
Quantiles	BCa	SE depends on density

When All Methods Fail

No bootstrap method works well when: (1) Sample size is very small (n < 15); (2) The parameter is near a boundary (e.g., correlation near ±1); (3) The statistic is highly discrete; (4) The data distribution has no variance. In these cases, consider exact methods, transformations, or reporting ranges instead of formal intervals.

Confidence Intervals for Model Performance

In machine learning, we often want confidence intervals for performance metrics: accuracy, AUC, F1 score, RMSE, etc. Bootstrap provides a natural framework, but there are important considerations.

ml_performance_ci.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
import numpy as np
from sklearn.base import clone
from sklearn.metrics import accuracy_score, roc_auc_score
from scipy import stats
from typing import Dict, Callable
 
def bootstrap_metric_ci(X: np.ndarray,
                         y: np.ndarray,
                         model,
                         metric_fn: Callable,
                         n_bootstrap: int = 1000,
                         confidence_level: float = 0.95,
                         method: str = 'percentile',
                         random_state: int = 42) -> Dict:
    """
    Compute bootstrap CI for a model performance metric.
    
    Two approaches are possible:
    1. Resample predictions: Faster, assumes model is fixed
    2. Resample and retrain: Captures model variance, slower
    
    This implementation uses approach 2 (resample + retrain).
    
    Parameters
    ----------
    X : np.ndarray
        Features
    y : np.ndarray
        Labels
    model : sklearn estimator
        Model to evaluate
    metric_fn : Callable
        Metric function(y_true, y_pred) -> float
    n_bootstrap : int
        Number of bootstrap samples
    confidence_level : float
        Confidence level
    method : str
        'percentile', 'basic', or 'bca'
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        CI and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n = len(y)
    alpha = 1 - confidence_level
    
    # Train on full data for point estimate
    model_full = clone(model)
    model_full.fit(X, y)
    y_pred_full = model_full.predict(X)
    point_estimate = metric_fn(y, y_pred_full)
    
    # Bootstrap
    bootstrap_metrics = []
    
    for b in range(n_bootstrap):
        # Resample
        indices = rng.choice(n, size=n, replace=True)
        X_boot, y_boot = X[indices], y[indices]
        
        # Out-of-bag for evaluation (optional but recommended)
        oob_mask = np.ones(n, dtype=bool)
        oob_mask[np.unique(indices)] = False
        
        if np.sum(oob_mask) < 5:
            # Not enough OOB samples, use in-bag
            X_test, y_test = X_boot, y_boot
        else:
            X_test, y_test = X[oob_mask], y[oob_mask]
        
        # Train and predict
        model_b = clone(model)
        model_b.fit(X_boot, y_boot)
        y_pred_b = model_b.predict(X_test)
        
        # Compute metric
        try:
            metric_value = metric_fn(y_test, y_pred_b)
            bootstrap_metrics.append(metric_value)
        except:
            # Some metrics may fail if classes are missing
            pass
    
    bootstrap_metrics = np.array(bootstrap_metrics)
    
    # Compute interval based on method
    if method == 'percentile':
        lower = np.percentile(bootstrap_metrics, 100 * alpha / 2)
        upper = np.percentile(bootstrap_metrics, 100 * (1 - alpha / 2))
    
    elif method == 'basic':
        q_low = np.percentile(bootstrap_metrics, 100 * alpha / 2)
        q_high = np.percentile(bootstrap_metrics, 100 * (1 - alpha / 2))
        lower = 2 * point_estimate - q_high
        upper = 2 * point_estimate - q_low
    
    elif method == 'bca':
        # Simplified BCa (full implementation would use jackknife)
        z0 = stats.norm.ppf(np.mean(bootstrap_metrics < point_estimate))
        z0 = np.clip(z0, -3, 3)
        
        # Approximate acceleration (would need jackknife for exact)
        a = 0  # Simplification
        
        z_low = stats.norm.ppf(alpha / 2)
        z_high = stats.norm.ppf(1 - alpha / 2)
        
        alpha_1 = stats.norm.cdf(z0 + (z0 + z_low) / (1 - a * (z0 + z_low)))
        alpha_2 = stats.norm.cdf(z0 + (z0 + z_high) / (1 - a * (z0 + z_high)))
        
        alpha_1 = np.clip(alpha_1, 0.001, 0.999)
        alpha_2 = np.clip(alpha_2, 0.001, 0.999)
        
        lower = np.percentile(bootstrap_metrics, 100 * alpha_1)
        upper = np.percentile(bootstrap_metrics, 100 * alpha_2)
    
    else:
        raise ValueError(f"Unknown method: {method}")
    
    return {
        'point_estimate': point_estimate,
        'ci': (lower, upper),
        'ci_width': upper - lower,
        'method': method,
        'bootstrap_mean': np.mean(bootstrap_metrics),
        'bootstrap_std': np.std(bootstrap_metrics),
        'n_bootstrap': len(bootstrap_metrics),
    }
 
 
# Example: Comparing models with CIs
def compare_models_with_ci(X, y, models_dict, metric_fn, n_bootstrap=500):
    """
    Compare multiple models with bootstrap CIs for fair comparison.
    """
    results = {}
    
    for name, model in models_dict.items():
        result = bootstrap_metric_ci(X, y, model, metric_fn, n_bootstrap)
        results[name] = result
        
    return results
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier
    
    X, y = make_classification(n_samples=200, n_features=20, random_state=42)
    
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Decision Tree': DecisionTreeClassifier(max_depth=5),
        'Random Forest': RandomForestClassifier(n_estimators=50),
    }
    
    results = compare_models_with_ci(X, y, models, accuracy_score, n_bootstrap=300)
    
    print("Model Comparison with Bootstrap 95% CIs")
    print("=" * 60)
    
    for name, result in results.items():
        ci = result['ci']
        print(f"{name}:")
        print(f"  Accuracy: {result['point_estimate']:.4f}")
        print(f"  95% CI:   [{ci[0]:.4f}, {ci[1]:.4f}]")
        print(f"  Width:    {result['ci_width']:.4f}")
        print()

Resampling Approach Matters

When computing CIs for model metrics, decide whether to resample predictions only (faster, ignores training variance) or resample-and-retrain (slower, captures full uncertainty). For model comparisons and publication, resample-and-retrain is preferred. For quick diagnostics, resampling predictions may suffice.

Summary: Bootstrap Confidence Intervals

Bootstrap confidence intervals represent the practical culmination of bootstrap methodology—enabling rigorous uncertainty quantification for virtually any statistic without requiring analytical derivations.

Key Takeaways

•Percentile Method: Simple—use bootstrap distribution quantiles directly. Works for symmetric, unbiased statistics. Limited otherwise.
•Basic (Reverse) Method: Reflects percentiles around point estimate. Handles bias but not varying SE.
•BCa Method: Gold standard. Corrects for bias (z₀) and acceleration (a). Use B ≥ 2000 for reliable estimates.
•Bootstrap-t: Uses studentized statistics. Best when SE formula is known. Computationally expensive with nested bootstrap.
•Coverage: All methods have nominal 95% but vary in actual coverage. BCa typically best; percentile often worst for complex statistics.
•ML Applications: Use resample-and-retrain for model metric CIs. Critical for fair model comparisons.

Module Complete:

This completes our comprehensive treatment of bootstrap methods for model evaluation. You now have mastery of:

Bootstrap Resampling — The foundational technique and its properties
The .632 Bootstrap — Bias correction via weighted combination
The .632+ Bootstrap — Adaptive correction for overfitting models
Out-of-Bootstrap Error — Free validation through exclusion
Bootstrap Confidence Intervals — Rigorous uncertainty quantification

These methods form a powerful toolkit for estimating and characterizing the generalization performance of machine learning models, enabling you to move beyond point estimates to well-calibrated uncertainty quantification.

Bootstrap Methods Mastered

You now have comprehensive understanding of bootstrap methods for model evaluation—from basic resampling to sophisticated confidence interval construction. This knowledge enables you to estimate generalization performance, quantify uncertainty, and make principled model comparisons in any machine learning context.

5 / 5

Loading learning content...

Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

5 / 5

Bootstrap Confidence Intervals

Quantifying Uncertainty Without Assumptions

What You Will Master

The Percentile Method

The percentile method is the simplest and most intuitive approach to bootstrap confidence intervals. It directly uses quantiles of the bootstrap distribution.

Algorithm:

Generate $B$ bootstrap samples and compute bootstrap statistics $\hat{\theta}_1^, \hat{\theta}_2^, \ldots, \hat{\theta}_B^*$
For a $(1-\alpha)$ confidence interval, take the $\alpha/2$ and $(1-\alpha/2)$ quantiles of the bootstrap distribution:

$$CI_{\text{percentile}} = \left[\hat{\theta}^_{(\alpha/2)}, \hat{\theta}^_{(1-\alpha/2)}\right]$$

For a 95% CI ($\alpha = 0.05$), this means taking the 2.5th and 97.5th percentiles.

percentile_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from typing import Callable, Tuple
 
def percentile_interval(data: np.ndarray,
                         statistic_fn: Callable[[np.ndarray], float],
                         confidence_level: float = 0.95,
                         n_bootstrap: int = 10000,
                         random_state: int = 42) -> Tuple[float, float, np.ndarray]:
    """
    Compute bootstrap percentile confidence interval.
    
    The simplest bootstrap CI: take quantiles of the bootstrap distribution.
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic from sample
    confidence_level : float
        Confidence level (default 0.95)
    n_bootstrap : int
        Number of bootstrap samples
    random_state : int
        Random seed
    
    Returns
    -------
    Tuple[float, float, np.ndarray]
        (lower_bound, upper_bound, bootstrap_statistics)
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Generate bootstrap statistics
    bootstrap_stats = np.zeros(n_bootstrap)
    for b in range(n_bootstrap):
        sample = data[rng.choice(n, size=n, replace=True)]
        bootstrap_stats[b] = statistic_fn(sample)
    
    # Compute percentile interval
    alpha = 1 - confidence_level
    lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    
    return lower, upper, bootstrap_stats
 
 
def demonstrate_percentile_interval():
    """
    Demonstrate the percentile method for the sample median.
    
    The median is a great example because its sampling distribution
    is asymptotically normal but harder to derive analytically
    than the mean.
    """
    # Generate sample
    np.random.seed(42)
    data = np.random.exponential(scale=2.0, size=100)  # Skewed distribution
    
    # Point estimate
    sample_median = np.median(data)
    
    # Percentile interval for median
    lower, upper, boot_stats = percentile_interval(
        data, 
        np.median, 
        confidence_level=0.95,
        n_bootstrap=10000
    )
    
    print("Percentile Bootstrap CI for the Median")
    print("=" * 50)
    print(f"Sample median: {sample_median:.4f}")
    print(f"95% CI: [{lower:.4f}, {upper:.4f}]")
    print(f"CI width: {upper - lower:.4f}")
    print(f"Bootstrap SD: {np.std(boot_stats):.4f}")
    print(f"Bootstrap mean: {np.mean(boot_stats):.4f}")
    print(f"Bias (mean - point): {np.mean(boot_stats) - sample_median:.4f}")
    
    return lower, upper, sample_median, boot_stats
 
 
if __name__ == "__main__":
    demonstrate_percentile_interval()

Limitations of the Percentile Method

The Basic (Reverse Percentile) Bootstrap Interval

The basic bootstrap interval (also called the reverse percentile or pivotal interval) takes a different approach motivated by the concept of pivotal quantities.

Key Insight:

The bootstrap approximates the distribution of $\hat{\theta} - \theta$ (the error in our estimate). If we knew this distribution, we could construct an interval by 'inverting' it.

Let $\delta_\alpha = \hat{\theta}^_{(\alpha)} - \hat{\theta}$ be the $\alpha$-quantile of the bootstrap errors $\hat{\theta}^ - \hat{\theta}$.

Then by symmetry of the interval construction: $$\theta \in \left[\hat{\theta} - \delta_{1-\alpha/2}, \hat{\theta} - \delta_{\alpha/2}\right]$$

The Formula:

Substituting, the basic bootstrap interval is: $$CI_{\text{basic}} = \left[2\hat{\theta} - \hat{\theta}^_{(1-\alpha/2)}, 2\hat{\theta} - \hat{\theta}^_{(\alpha/2)}\right]$$

Notice this 'reflects' the percentile interval around the point estimate $\hat{\theta}$.

basic_bootstrap_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from typing import Callable, Tuple, Dict
 
def basic_bootstrap_interval(data: np.ndarray,
                              statistic_fn: Callable[[np.ndarray], float],
                              confidence_level: float = 0.95,
                              n_bootstrap: int = 10000,
                              random_state: int = 42) -> Dict:
    """
    Compute the basic (reverse percentile) bootstrap interval.
    
    This method 'reflects' bootstrap quantiles around the point estimate.
    It's based on the distribution of (θ̂* - θ̂) approximating (θ̂ - θ).
    
    Formula: CI = [2θ̂ - θ̂*_{1-α/2}, 2θ̂ - θ̂*_{α/2}]
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic
    confidence_level : float
        Confidence level
    n_bootstrap : int
        Number of bootstrap samples
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        Contains both basic and percentile intervals for comparison
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    
    # Point estimate
    theta_hat = statistic_fn(data)
    
    # Generate bootstrap statistics
    bootstrap_stats = np.zeros(n_bootstrap)
    for b in range(n_bootstrap):
        sample = data[rng.choice(n, size=n, replace=True)]
        bootstrap_stats[b] = statistic_fn(sample)
    
    alpha = 1 - confidence_level
    
    # Percentile interval (for comparison)
    q_lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    q_upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    
    # Basic (reverse) interval
    # CI = [2θ̂ - θ̂*_{1-α/2}, 2θ̂ - θ̂*_{α/2}]
    basic_lower = 2 * theta_hat - q_upper
    basic_upper = 2 * theta_hat - q_lower
    
    return {
        'point_estimate': theta_hat,
        'percentile': (q_lower, q_upper),
        'basic': (basic_lower, basic_upper),
        'bootstrap_mean': np.mean(bootstrap_stats),
        'bootstrap_sd': np.std(bootstrap_stats),
        'bias': np.mean(bootstrap_stats) - theta_hat,
    }
 
 
def compare_percentile_vs_basic():
    """
    Compare percentile and basic intervals on a skewed statistic.
    
    For asymmetric distributions, these methods give different
    intervals, and the difference reveals bias.
    """
    np.random.seed(42)
    
    # Exponential data - sample mean is biased estimator of population mean
    data = np.random.exponential(scale=2.0, size=50)
    true_mean = 2.0  # True population mean
    
    result = basic_bootstrap_interval(data, np.mean, n_bootstrap=10000)
    
    print("Comparison: Percentile vs Basic Bootstrap Interval")
    print("=" * 60)
    print(f"True parameter: {true_mean:.4f}")
    print(f"Point estimate: {result['point_estimate']:.4f}")
    print(f"Bootstrap bias: {result['bias']:.4f}")
    print()
    print(f"Percentile CI:  [{result['percentile'][0]:.4f}, {result['percentile'][1]:.4f}]")
    print(f"Basic CI:       [{result['basic'][0]:.4f}, {result['basic'][1]:.4f}]")
    print()
    
    # When bias is near zero, these should be similar
    # The difference indicates the magnitude of correction
    print(f"Lower bound difference: {result['basic'][0] - result['percentile'][0]:.4f}")
    print(f"Upper bound difference: {result['basic'][1] - result['percentile'][1]:.4f}")
    
    return result
 
 
if __name__ == "__main__":
    compare_percentile_vs_basic()

When to Use Basic vs. Percentile:

Scenario	Percentile	Basic
Symmetric, unbiased statistic	✓ Similar	✓ Similar
Biased statistic	Poor coverage	Better coverage
Skewed bootstrap distribution	May be wrong direction	May overcorrect
Transformation exists	Use transformed	Use transformed

In practice, both methods have theoretical limitations that motivate the BCa method.

The Reflection Principle

The BCa (Bias-Corrected and Accelerated) Method

The BCa (bias-corrected and accelerated) method is considered the gold standard for bootstrap confidence intervals. It addresses two key issues:

Bias: The bootstrap distribution may be systematically shifted from the true sampling distribution
Skewness/Acceleration: The standard error of $\hat{\theta}$ may depend on the value of $\theta$ itself

The BCa method adjusts the percentiles used for the interval based on two correction factors.

The BCa Interval:

$$CI_{\text{BCa}} = \left[\hat{\theta}^_{(\alpha_1)}, \hat{\theta}^_{(\alpha_2)}\right]$$

where the adjusted percentile levels are:

$$\alpha_1 = \Phi\left(\hat{z}_0 + \frac{\hat{z}0 + z{\alpha/2}}{1 - \hat{a}(\hat{z}0 + z{\alpha/2})}\right)$$

$$\alpha_2 = \Phi\left(\hat{z}_0 + \frac{\hat{z}0 + z{1-\alpha/2}}{1 - \hat{a}(\hat{z}0 + z{1-\alpha/2})}\right)$$

Here:

$\Phi$ is the standard normal CDF
$z_p$ is the $p$-quantile of the standard normal
$\hat{z}_0$ is the bias-correction factor
$\hat{a}$ is the acceleration factor

Understanding the Correction Factors

Computing the Correction Factors:

Bias-correction factor $\hat{z}_0$: $$\hat{z}_0 = \Phi^{-1}\left(\frac{#{\hat{\theta}^*_b < \hat{\theta}}}{B}\right)$$

This is the inverse-normal of the proportion of bootstrap statistics below the original estimate. If the bootstrap is unbiased, this proportion is 0.5 and $\hat{z}_0 = 0$.

Acceleration factor $\hat{a}$:

where:

$\hat{\theta}_{(i)}$ is the statistic computed with observation $i$ removed
$\hat{\theta}{(\cdot)} = \frac{1}{n}\sum{i=1}^{n} \hat{\theta}_{(i)}$ is the mean of jackknife estimates

bca_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
import numpy as np
from scipy import stats
from typing import Callable, Tuple, Dict
 
def bca_interval(data: np.ndarray,
                  statistic_fn: Callable[[np.ndarray], float],
                  confidence_level: float = 0.95,
                  n_bootstrap: int = 10000,
                  random_state: int = 42) -> Dict:
    """
    Compute the BCa (bias-corrected and accelerated) bootstrap interval.
    
    The BCa method adjusts percentiles based on:
    - z0: bias correction (median bias of bootstrap distribution)
    - a: acceleration (how SE changes with parameter value)
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic
    confidence_level : float
        Confidence level
    n_bootstrap : int
        Number of bootstrap samples
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        BCa interval, percentile interval (for comparison), and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    alpha = 1 - confidence_level
    
    # =========================================
    # Step 1: Point estimate
    # =========================================
    theta_hat = statistic_fn(data)
    
    # =========================================
    # Step 2: Bootstrap distribution
    # =========================================
    bootstrap_stats = np.zeros(n_bootstrap)
    for b in range(n_bootstrap):
        sample = data[rng.choice(n, size=n, replace=True)]
        bootstrap_stats[b] = statistic_fn(sample)
    
    # =========================================
    # Step 3: Bias-correction factor (z0)
    # =========================================
    # Proportion of bootstrap stats below original estimate
    proportion_below = np.mean(bootstrap_stats < theta_hat)
    
    # Handle edge cases
    if proportion_below == 0:
        proportion_below = 1 / (2 * n_bootstrap)
    elif proportion_below == 1:
        proportion_below = 1 - 1 / (2 * n_bootstrap)
    
    z0 = stats.norm.ppf(proportion_below)
    
    # =========================================
    # Step 4: Acceleration factor (a) via jackknife
    # =========================================
    jackknife_stats = np.zeros(n)
    for i in range(n):
        # Leave one out
        jackknife_sample = np.delete(data, i)
        jackknife_stats[i] = statistic_fn(jackknife_sample)
    
    jackknife_mean = np.mean(jackknife_stats)
    
    # Compute acceleration using skewness of jackknife influence
    numerator = np.sum((jackknife_mean - jackknife_stats)**3)
    denominator = 6 * (np.sum((jackknife_mean - jackknife_stats)**2)**(1.5))
    
    if denominator == 0:
        a = 0  # No acceleration (e.g., constant statistic)
    else:
        a = numerator / denominator
    
    # =========================================
    # Step 5: Compute adjusted percentiles
    # =========================================
    z_alpha_low = stats.norm.ppf(alpha / 2)
    z_alpha_high = stats.norm.ppf(1 - alpha / 2)
    
    # Adjusted percentile levels
    def adjusted_alpha(z_alpha):
        numerator = z0 + z_alpha
        denominator = 1 - a * (z0 + z_alpha)
        if denominator <= 0:
            # Edge case: return extreme percentile
            return 0.999 if z_alpha > 0 else 0.001
        return stats.norm.cdf(z0 + numerator / denominator)
    
    alpha_1 = adjusted_alpha(z_alpha_low)
    alpha_2 = adjusted_alpha(z_alpha_high)
    
    # Ensure valid percentiles
    alpha_1 = np.clip(alpha_1, 0.001, 0.999)
    alpha_2 = np.clip(alpha_2, 0.001, 0.999)
    
    # =========================================
    # Step 6: Extract BCa interval
    # =========================================
    bca_lower = np.percentile(bootstrap_stats, 100 * alpha_1)
    bca_upper = np.percentile(bootstrap_stats, 100 * alpha_2)
    
    # Percentile interval for comparison
    pct_lower = np.percentile(bootstrap_stats, 100 * alpha / 2)
    pct_upper = np.percentile(bootstrap_stats, 100 * (1 - alpha / 2))
    
    return {
        'bca': (bca_lower, bca_upper),
        'percentile': (pct_lower, pct_upper),
        'point_estimate': theta_hat,
        'z0': z0,
        'acceleration': a,
        'alpha_adjusted': (alpha_1, alpha_2),
        'alpha_original': (alpha/2, 1 - alpha/2),
        'bootstrap_mean': np.mean(bootstrap_stats),
        'bootstrap_sd': np.std(bootstrap_stats),
    }
 
 
def demonstrate_bca():
    """
    Demonstrate BCa on a skewed statistic where correction matters.
    """
    np.random.seed(42)
    
    # Draw from an exponential - the sample variance is a skewed statistic
    data = np.random.exponential(scale=2.0, size=50)
    
    # Statistic: sample variance
    result = bca_interval(data, np.var, n_bootstrap=10000)
    
    print("BCa Bootstrap Confidence Interval for Sample Variance")
    print("=" * 60)
    print(f"Point estimate: {result['point_estimate']:.4f}")
    print(f"Bootstrap mean: {result['bootstrap_mean']:.4f}")
    print()
    print("Correction factors:")
    print(f"  z0 (bias):        {result['z0']:.4f}")
    print(f"  a (acceleration): {result['acceleration']:.4f}")
    print()
    print("Adjusted percentiles:")
    print(f"  Original: ({result['alpha_original'][0]:.4f}, {result['alpha_original'][1]:.4f})")
    print(f"  Adjusted: ({result['alpha_adjusted'][0]:.4f}, {result['alpha_adjusted'][1]:.4f})")
    print()
    print("Confidence Intervals:")
    print(f"  Percentile: [{result['percentile'][0]:.4f}, {result['percentile'][1]:.4f}]")
    print(f"  BCa:        [{result['bca'][0]:.4f}, {result['bca'][1]:.4f}]")
    
    return result
 
 
if __name__ == "__main__":
    demonstrate_bca()

When BCa Matters

Bootstrap-t (Studentized) Intervals

The Key Idea:

Instead of bootstrapping $\hat{\theta}$ directly, we bootstrap the t-statistic: $$t^* = \frac{\hat{\theta}^* - \hat{\theta}}{\widehat{SE}^*}$$

where $\widehat{SE}^*$ is the estimated standard error from the bootstrap sample (often requiring a nested bootstrap or analytical formula).

The Bootstrap-t Interval:

$$CI_{\text{boot-t}} = \left[\hat{\theta} - t^_{(1-\alpha/2)} \cdot \widehat{SE}, \hat{\theta} - t^_{(\alpha/2)} \cdot \widehat{SE}\right]$$

Note the reversal of quantiles (similar to the basic interval).

bootstrap_t_interval.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from typing import Callable, Dict, Optional
 
def bootstrap_t_interval(data: np.ndarray,
                          statistic_fn: Callable[[np.ndarray], float],
                          se_fn: Optional[Callable[[np.ndarray], float]] = None,
                          confidence_level: float = 0.95,
                          n_bootstrap: int = 1000,
                          n_inner_bootstrap: int = 50,
                          random_state: int = 42) -> Dict:
    """
    Compute the bootstrap-t (studentized) confidence interval.
    
    This method bootstraps the t-statistic (θ̂* - θ̂)/SE* rather than
    θ̂ directly. It requires estimating SE for each bootstrap sample.
    
    Parameters
    ----------
    data : np.ndarray
        Original sample
    statistic_fn : Callable
        Function computing statistic
    se_fn : Callable, optional
        Function computing SE analytically. If None, uses nested bootstrap.
    confidence_level : float
        Confidence level
    n_bootstrap : int
        Outer bootstrap samples
    n_inner_bootstrap : int
        Inner bootstrap for SE estimation (if se_fn is None)
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        Bootstrap-t interval and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n = len(data)
    alpha = 1 - confidence_level
    
    # Original estimate and SE
    theta_hat = statistic_fn(data)
    
    if se_fn is not None:
        se_original = se_fn(data)
    else:
        # Estimate SE via bootstrap
        inner_stats = []
        for _ in range(200):
            inner_sample = data[rng.choice(n, size=n, replace=True)]
            inner_stats.append(statistic_fn(inner_sample))
        se_original = np.std(inner_stats, ddof=1)
    
    # Compute t-statistics for each bootstrap sample
    t_stats = []
    
    for b in range(n_bootstrap):
        # Outer bootstrap sample
        boot_indices = rng.choice(n, size=n, replace=True)
        boot_sample = data[boot_indices]
        
        theta_star = statistic_fn(boot_sample)
        
        # SE for this bootstrap sample
        if se_fn is not None:
            se_star = se_fn(boot_sample)
        else:
            # Nested bootstrap for SE
            nested_stats = []
            for _ in range(n_inner_bootstrap):
                nested_sample = boot_sample[rng.choice(n, size=n, replace=True)]
                nested_stats.append(statistic_fn(nested_sample))
            se_star = np.std(nested_stats, ddof=1)
        
        # Avoid division by zero
        if se_star > 0:
            t_star = (theta_star - theta_hat) / se_star
            t_stats.append(t_star)
    
    t_stats = np.array(t_stats)
    
    # Bootstrap-t interval
    # CI = [θ̂ - t*_{1-α/2} · SE, θ̂ - t*_{α/2} · SE]
    t_lower = np.percentile(t_stats, 100 * (1 - alpha / 2))
    t_upper = np.percentile(t_stats, 100 * alpha / 2)
    
    # Note the reversal!
    ci_lower = theta_hat - t_lower * se_original
    ci_upper = theta_hat - t_upper * se_original
    
    return {
        'bootstrap_t': (ci_lower, ci_upper),
        'point_estimate': theta_hat,
        'se': se_original,
        't_quantiles': (t_lower, t_upper),
        't_mean': np.mean(t_stats),
        't_std': np.std(t_stats),
        'n_t_stats': len(t_stats),
    }
 
 
def analytical_se_mean(data: np.ndarray) -> float:
    """Standard error of the mean (analytical formula)."""
    return np.std(data, ddof=1) / np.sqrt(len(data))
 
 
def demonstrate_bootstrap_t():
    """
    Demonstrate bootstrap-t interval for the sample mean.
    
    For the mean, we can use the analytical SE formula.
    """
    np.random.seed(42)
    
    # Sample from heavy-tailed distribution (t with 3 df)
    from scipy.stats import t as t_dist
    data = t_dist.rvs(df=3, size=30)
    
    result = bootstrap_t_interval(
        data, 
        np.mean,
        se_fn=analytical_se_mean,
        n_bootstrap=2000
    )
    
    print("Bootstrap-t Confidence Interval")
    print("=" * 50)
    print(f"Sample mean: {result['point_estimate']:.4f}")
    print(f"SE(mean):    {result['se']:.4f}")
    print()
    print(f"t-quantiles: ({result['t_quantiles'][0]:.4f}, {result['t_quantiles'][1]:.4f})")
    print(f"t-mean:      {result['t_mean']:.4f}")
    print(f"t-std:       {result['t_std']:.4f}")
    print()
    print(f"Bootstrap-t CI: [{result['bootstrap_t'][0]:.4f}, {result['bootstrap_t'][1]:.4f}]")
    
    return result
 
 
if __name__ == "__main__":
    demonstrate_bootstrap_t()

Computational Cost

Coverage Probability and Interval Calibration

Types of Coverage Error:

Undercoverage: Actual coverage < nominal (interval too narrow; overconfident)
Overcoverage: Actual coverage > nominal (interval too wide; conservative)

Assessing Coverage:

In practice, we can't directly measure coverage (we don't know the true parameter). However, we can:

Simulation studies: Generate data from known distributions, compute intervals, check if true value is contained
Historical validation: For problems where truth is eventually revealed
Theoretical analysis: For some methods, coverage is known asymptotically

Typical Coverage Properties of Bootstrap CI Methods
Method	Nominal 95%	Actual Coverage (typical)	Best Case	Worst Case
Percentile	95%	88-94%	Symmetric statistics	Biased/skewed statistics
Basic	95%	90-95%	Pivotal statistics	Non-pivotal, skewed
BCa	95%	93-96%	Broad range	Extreme skew, small n
Bootstrap-t	95%	94-96%	Pivotal w/ good SE	Poor SE estimation

coverage_simulation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from typing import Callable, Dict
 
def coverage_simulation(true_theta: float,
                         data_generator: Callable[[int], np.ndarray],
                         statistic_fn: Callable[[np.ndarray], float],
                         interval_fn: Callable,
                         n_samples: int = 50,
                         n_simulations: int = 1000,
                         confidence_level: float = 0.95) -> Dict:
    """
    Estimate coverage probability of a confidence interval method.
    
    Parameters
    ----------
    true_theta : float
        True parameter value
    data_generator : Callable
        Function that generates samples: data_generator(n) -> np.ndarray
    statistic_fn : Callable
        Statistic function
    interval_fn : Callable
        Function computing CI: interval_fn(data, statistic_fn) -> (lower, upper)
    n_samples : int
        Sample size for each simulation
    n_simulations : int
        Number of simulation repetitions
    confidence_level : float
        Nominal confidence level
    
    Returns
    -------
    Dict
        Coverage probability and diagnostics
    """
    coverage_count = 0
    interval_widths = []
    lower_bounds = []
    upper_bounds = []
    
    for sim in range(n_simulations):
        # Generate data
        np.random.seed(sim)
        data = data_generator(n_samples)
        
        # Compute interval
        lower, upper = interval_fn(data, statistic_fn)
        
        # Check coverage
        if lower <= true_theta <= upper:
            coverage_count += 1
        
        interval_widths.append(upper - lower)
        lower_bounds.append(lower)
        upper_bounds.append(upper)
    
    coverage_prob = coverage_count / n_simulations
    
    # Standard error of coverage estimate
    se_coverage = np.sqrt(coverage_prob * (1 - coverage_prob) / n_simulations)
    
    return {
        'coverage_probability': coverage_prob,
        'se_coverage': se_coverage,
        'nominal_coverage': confidence_level,
        'coverage_error': coverage_prob - confidence_level,
        'mean_width': np.mean(interval_widths),
        'std_width': np.std(interval_widths),
        'mean_lower': np.mean(lower_bounds),
        'mean_upper': np.mean(upper_bounds),
        'n_simulations': n_simulations,
        'n_samples': n_samples,
    }
 
 
def compare_ci_methods_coverage():
    """
    Compare coverage of different bootstrap CI methods.
    """
    # True parameter: population mean = 2 (exponential scale)
    true_mean = 2.0
    
    # Data generator: exponential (skewed)
    def gen_exp(n):
        return np.random.exponential(scale=true_mean, size=n)
    
    # CI methods using simplified implementations
    def percentile_ci(data, stat_fn, n_boot=500):
        stats = []
        for _ in range(n_boot):
            sample = data[np.random.choice(len(data), len(data), replace=True)]
            stats.append(stat_fn(sample))
        return np.percentile(stats, 2.5), np.percentile(stats, 97.5)
    
    def basic_ci(data, stat_fn, n_boot=500):
        theta = stat_fn(data)
        stats = []
        for _ in range(n_boot):
            sample = data[np.random.choice(len(data), len(data), replace=True)]
            stats.append(stat_fn(sample))
        q_low = np.percentile(stats, 97.5)
        q_high = np.percentile(stats, 2.5)
        return 2*theta - q_low, 2*theta - q_high
    
    print("Coverage Simulation: Exponential Data, Sample Mean")
    print("=" * 60)
    print(f"True parameter (exponential mean): {true_mean}")
    print()
    
    for name, ci_fn in [("Percentile", percentile_ci), ("Basic", basic_ci)]:
        result = coverage_simulation(
            true_theta=true_mean,
            data_generator=gen_exp,
            statistic_fn=np.mean,
            interval_fn=ci_fn,
            n_samples=30,
            n_simulations=500
        )
        
        print(f"{name} Method:")
        print(f"  Coverage: {result['coverage_probability']:.3f} "
              f"(±{2*result['se_coverage']:.3f})")
        print(f"  Coverage error: {result['coverage_error']:+.3f}")
        print(f"  Mean width: {result['mean_width']:.4f}")
        print()
 
 
if __name__ == "__main__":
    compare_ci_methods_coverage()

Practical Guidance

Choosing the Right CI Method

Different bootstrap CI methods excel in different situations. Here's a practical decision framework.

Method Selection Guide

•Percentile: Use for quick, exploratory analysis. Good enough when statistic is symmetric and approximately unbiased. Simplest to implement and interpret.
•Basic: Use when you suspect bias but not varying SE. Slightly better coverage than percentile for biased statistics.
•BCa: Default recommendation for publication-quality results. Handles both bias and acceleration. Use B ≥ 2000.
•Bootstrap-t: Use when you have a reliable SE estimator (analytical or well-studied). Best theoretical properties for pivotal statistics.

Decision Flowchart:

Do you need a quick, rough interval?
├── YES → Use PERCENTILE
└── NO → Is an analytical SE formula available?
         ├── YES → Use BOOTSTRAP-T
         └── NO → Use BCa
                  └── If B < 1000 possible, use PERCENTILE with caveat

Special Cases:

Statistic	Recommended Method	Notes
Sample mean	Bootstrap-t or Normal theory	SE formula known
Sample median	BCa	SE depends on density at median
Correlation coefficient	BCa	Bounded, often skewed
Regression coefficients	Bootstrap-t or BCa	Depends on assumptions
AUC/Accuracy	BCa	Bounded [0,1], often high
Quantiles	BCa	SE depends on density

When All Methods Fail

Confidence Intervals for Model Performance

In machine learning, we often want confidence intervals for performance metrics: accuracy, AUC, F1 score, RMSE, etc. Bootstrap provides a natural framework, but there are important considerations.

ml_performance_ci.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
import numpy as np
from sklearn.base import clone
from sklearn.metrics import accuracy_score, roc_auc_score
from scipy import stats
from typing import Dict, Callable
 
def bootstrap_metric_ci(X: np.ndarray,
                         y: np.ndarray,
                         model,
                         metric_fn: Callable,
                         n_bootstrap: int = 1000,
                         confidence_level: float = 0.95,
                         method: str = 'percentile',
                         random_state: int = 42) -> Dict:
    """
    Compute bootstrap CI for a model performance metric.
    
    Two approaches are possible:
    1. Resample predictions: Faster, assumes model is fixed
    2. Resample and retrain: Captures model variance, slower
    
    This implementation uses approach 2 (resample + retrain).
    
    Parameters
    ----------
    X : np.ndarray
        Features
    y : np.ndarray
        Labels
    model : sklearn estimator
        Model to evaluate
    metric_fn : Callable
        Metric function(y_true, y_pred) -> float
    n_bootstrap : int
        Number of bootstrap samples
    confidence_level : float
        Confidence level
    method : str
        'percentile', 'basic', or 'bca'
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        CI and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n = len(y)
    alpha = 1 - confidence_level
    
    # Train on full data for point estimate
    model_full = clone(model)
    model_full.fit(X, y)
    y_pred_full = model_full.predict(X)
    point_estimate = metric_fn(y, y_pred_full)
    
    # Bootstrap
    bootstrap_metrics = []
    
    for b in range(n_bootstrap):
        # Resample
        indices = rng.choice(n, size=n, replace=True)
        X_boot, y_boot = X[indices], y[indices]
        
        # Out-of-bag for evaluation (optional but recommended)
        oob_mask = np.ones(n, dtype=bool)
        oob_mask[np.unique(indices)] = False
        
        if np.sum(oob_mask) < 5:
            # Not enough OOB samples, use in-bag
            X_test, y_test = X_boot, y_boot
        else:
            X_test, y_test = X[oob_mask], y[oob_mask]
        
        # Train and predict
        model_b = clone(model)
        model_b.fit(X_boot, y_boot)
        y_pred_b = model_b.predict(X_test)
        
        # Compute metric
        try:
            metric_value = metric_fn(y_test, y_pred_b)
            bootstrap_metrics.append(metric_value)
        except:
            # Some metrics may fail if classes are missing
            pass
    
    bootstrap_metrics = np.array(bootstrap_metrics)
    
    # Compute interval based on method
    if method == 'percentile':
        lower = np.percentile(bootstrap_metrics, 100 * alpha / 2)
        upper = np.percentile(bootstrap_metrics, 100 * (1 - alpha / 2))
    
    elif method == 'basic':
        q_low = np.percentile(bootstrap_metrics, 100 * alpha / 2)
        q_high = np.percentile(bootstrap_metrics, 100 * (1 - alpha / 2))
        lower = 2 * point_estimate - q_high
        upper = 2 * point_estimate - q_low
    
    elif method == 'bca':
        # Simplified BCa (full implementation would use jackknife)
        z0 = stats.norm.ppf(np.mean(bootstrap_metrics < point_estimate))
        z0 = np.clip(z0, -3, 3)
        
        # Approximate acceleration (would need jackknife for exact)
        a = 0  # Simplification
        
        z_low = stats.norm.ppf(alpha / 2)
        z_high = stats.norm.ppf(1 - alpha / 2)
        
        alpha_1 = stats.norm.cdf(z0 + (z0 + z_low) / (1 - a * (z0 + z_low)))
        alpha_2 = stats.norm.cdf(z0 + (z0 + z_high) / (1 - a * (z0 + z_high)))
        
        alpha_1 = np.clip(alpha_1, 0.001, 0.999)
        alpha_2 = np.clip(alpha_2, 0.001, 0.999)
        
        lower = np.percentile(bootstrap_metrics, 100 * alpha_1)
        upper = np.percentile(bootstrap_metrics, 100 * alpha_2)
    
    else:
        raise ValueError(f"Unknown method: {method}")
    
    return {
        'point_estimate': point_estimate,
        'ci': (lower, upper),
        'ci_width': upper - lower,
        'method': method,
        'bootstrap_mean': np.mean(bootstrap_metrics),
        'bootstrap_std': np.std(bootstrap_metrics),
        'n_bootstrap': len(bootstrap_metrics),
    }
 
 
# Example: Comparing models with CIs
def compare_models_with_ci(X, y, models_dict, metric_fn, n_bootstrap=500):
    """
    Compare multiple models with bootstrap CIs for fair comparison.
    """
    results = {}
    
    for name, model in models_dict.items():
        result = bootstrap_metric_ci(X, y, model, metric_fn, n_bootstrap)
        results[name] = result
        
    return results
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier
    
    X, y = make_classification(n_samples=200, n_features=20, random_state=42)
    
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Decision Tree': DecisionTreeClassifier(max_depth=5),
        'Random Forest': RandomForestClassifier(n_estimators=50),
    }
    
    results = compare_models_with_ci(X, y, models, accuracy_score, n_bootstrap=300)
    
    print("Model Comparison with Bootstrap 95% CIs")
    print("=" * 60)
    
    for name, result in results.items():
        ci = result['ci']
        print(f"{name}:")
        print(f"  Accuracy: {result['point_estimate']:.4f}")
        print(f"  95% CI:   [{ci[0]:.4f}, {ci[1]:.4f}]")
        print(f"  Width:    {result['ci_width']:.4f}")
        print()

Resampling Approach Matters

Summary: Bootstrap Confidence Intervals

Key Takeaways

•Percentile Method: Simple—use bootstrap distribution quantiles directly. Works for symmetric, unbiased statistics. Limited otherwise.
•Basic (Reverse) Method: Reflects percentiles around point estimate. Handles bias but not varying SE.
•BCa Method: Gold standard. Corrects for bias (z₀) and acceleration (a). Use B ≥ 2000 for reliable estimates.
•Bootstrap-t: Uses studentized statistics. Best when SE formula is known. Computationally expensive with nested bootstrap.
•Coverage: All methods have nominal 95% but vary in actual coverage. BCa typically best; percentile often worst for complex statistics.
•ML Applications: Use resample-and-retrain for model metric CIs. Critical for fair model comparisons.

Module Complete:

This completes our comprehensive treatment of bootstrap methods for model evaluation. You now have mastery of:

Bootstrap Resampling — The foundational technique and its properties
The .632 Bootstrap — Bias correction via weighted combination
The .632+ Bootstrap — Adaptive correction for overfitting models
Out-of-Bootstrap Error — Free validation through exclusion
Bootstrap Confidence Intervals — Rigorous uncertainty quantification

Bootstrap Methods Mastered

5 / 5