Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

2 / 5

The .632 Bootstrap Estimator

The Problem of Biased Error Estimates

When estimating a model's generalization error using the bootstrap, we face a fundamental tension between two biased estimators:

The Apparent Error (Training Error): Evaluating a model on its training data yields an optimistically biased estimate. The model has been fitted to minimize error on this exact data, so it performs better on training data than on new data. For flexible models that can interpolate, the apparent error can approach zero even when true generalization error is substantial.

The Out-of-Bootstrap (OOB) Error: Evaluating each observation only when it's excluded from the bootstrap sample yields a pessimistically biased estimate. Each bootstrap sample contains only about 63.2% of unique observations (on average), so models are trained on effectively smaller samples than the full dataset. This upward bias in error can be substantial for complex models or small datasets.

The .632 bootstrap elegantly resolves this tension through a weighted combination that cancels the biases.

What You Will Master

By the end of this page, you will understand: (1) The mathematical derivation of the .632 weighting; (2) Why this specific combination reduces bias; (3) The assumptions under which .632 is optimal; (4) Detailed implementation for machine learning model evaluation; (5) When .632 works well and when it fails.

Understanding the Sources of Bias

To understand the .632 bootstrap, we must first carefully characterize the biases in our constituent estimators.

Notation:

$\hat{f}^D$: Model trained on dataset $D$
$\text{Err}$: True generalization error (what we want to estimate)
$\bar{\text{err}}$: Apparent error (training error)
$\bar{\text{err}}^{(1)}$: Leave-one-out cross-validation error
$\widehat{\text{Err}}^{\text{boot}}$: Bootstrap estimate of generalization error

The Apparent Error:

The apparent error is defined as: $$\bar{\text{err}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^D(x_i))$$

where $L$ is the loss function. This evaluates on the same data used for training, yielding: $$E[\bar{\text{err}}] < \text{Err}$$

The downward bias is called the optimism. Efron showed that: $$\text{optimism} = \text{Err} - E[\bar{\text{err}}] = \frac{2}{n} \sum_{i=1}^{n} \text{Cov}(y_i, \hat{f}^D(x_i))$$

This covariance is positive because the model's predictions at $x_i$ are influenced by the response $y_i$—particularly for flexible models.

Degrees of Freedom Interpretation

The optimism is directly related to the effective degrees of freedom (model complexity). For linear regression with p predictors, the optimism equals 2p/n times the noise variance. More complex models have higher optimism—they fit the noise more aggressively.

The Out-of-Bootstrap Error:

The leave-one-out bootstrap error is: $$\widehat{\text{Err}}^{(1)} = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{|C^{-i}|} \sum_{b \in C^{-i}} L(y_i, \hat{f}^{*b}(x_i))$$

where $C^{-i}$ is the set of bootstrap samples that do not contain observation $i$, and $\hat{f}^{*b}$ is the model trained on bootstrap sample $b$.

This estimator is nearly unbiased for the error rate of a model trained on a sample of size $n(1 - e^{-1}) \approx 0.632n$—not size $n$. Since learning curves typically decrease with training set size, this gives: $$E[\widehat{\text{Err}}^{(1)}] > \text{Err}$$

The upward bias occurs because we're effectively estimating error for a smaller training set than we actually have.

Comparison of Bias Characteristics
Estimator	Bias Direction	Bias Magnitude	Cause of Bias
Apparent Error (Training)	Optimistic (too low)	Large for flexible models	Same data for train + evaluate
OOB Error	Pessimistic (too high)	Moderate, depends on learning curve	Effective training size is ~0.632n
Leave-One-Out CV	Nearly unbiased	Small	Training size is n-1 ≈ n
.632 Bootstrap	Small, can be either	Much reduced compared to constituents	Optimal weighting cancels biases

Deriving the .632 Weighting

The .632 bootstrap estimator combines the apparent error and OOB error:

$$\widehat{\text{Err}}^{(.632)} = 0.368 \times \bar{\text{err}} + 0.632 \times \widehat{\text{Err}}^{(1)}$$

Why these specific weights?

The weighting comes from the expected proportion of unique observations in a bootstrap sample. Recall that the probability an observation is not selected in any of $n$ draws is: $$P(\text{not in bootstrap}) = \left(1 - \frac{1}{n}\right)^n \to e^{-1} \approx 0.368$$

Thus, the expected proportion of unique observations in a bootstrap sample is: $$1 - e^{-1} \approx 0.632$$

The .632 weights arise from viewing the error estimation as interpolating between two regimes:

Weight 0.368: Contribution from 'in-bag' evaluation (apparent error)
Weight 0.632: Contribution from 'out-of-bag' evaluation (OOB error)

Intuition Behind the Weighting

Think of it this way: each observation in our dataset plays two roles. About 63.2% of the time (across bootstrap samples), it's a test point (excluded from training). About 36.8% of observations are 'redundant' (included in training, so testing on them is biased). The .632 estimator weights these contributions according to their proportions.

Formal Derivation (Efron's Approach):

Let $\omega = 0.632$ be the expected proportion of unique observations. Consider the expected error of a model trained on data that includes a specific test point $(x_i, y_i)$ with probability 0.368.

Under a linear approximation to the learning curve, if $\text{Err}(m)$ is the error of a model trained on $m$ observations:

$$\widehat{\text{Err}}^{(.632)} \approx \text{Err}(n) + \text{lower order terms}$$

The key insight is that the optimistic bias of the apparent error (from training-test overlap) is approximately offset by the pessimistic bias of the OOB error (from reduced training size) when combined with the 0.368/0.632 weights.

Mathematical Justification:

For a linear learning curve model $\text{Err}(m) = \text{Err}(\infty) + \gamma/m$:

OOB estimates $\text{Err}(0.632n)$, which has excess $\gamma/(0.632n) - \gamma/n = \gamma \times 0.368/(0.632n)$
The apparent error has optimism approximately $-\gamma \times 0.632/(0.368n)$ (opposite sign)
The weighted combination: $0.368 \times (\text{Err} - \gamma_1/n) + 0.632 \times (\text{Err} + \gamma_2/n) \approx \text{Err}$

when $\gamma_1$ and $\gamma_2$ are chosen appropriately.

Complete Implementation

Let's implement the .632 bootstrap estimator with full detail and diagnostics.

point_632_bootstrap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
import numpy as np
from sklearn.base import clone
from typing import Callable, Dict, List, Tuple, Optional
 
def point_632_bootstrap(X: np.ndarray,
                         y: np.ndarray,
                         model,
                         n_bootstrap: int = 200,
                         loss_fn: Optional[Callable] = None,
                         random_state: int = 42) -> Dict:
    """
    Compute the .632 bootstrap error estimate.
    
    The .632 estimator is a weighted combination:
        Err_632 = 0.368 * apparent_error + 0.632 * OOB_error
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix of shape (n_samples, n_features)
    y : np.ndarray
        Target vector of shape (n_samples,)
    model : sklearn estimator
        Model with fit() and predict() methods
    n_bootstrap : int
        Number of bootstrap iterations
    loss_fn : Callable, optional
        Loss function(y_true, y_pred) -> float
        Default: MSE for regression, 0-1 loss for classification
    random_state : int
        Random seed for reproducibility
    
    Returns
    -------
    Dict
        Comprehensive results including:
        - point_632_error: The .632 bootstrap estimate
        - apparent_error: Training error
        - oob_error: Out-of-bootstrap error
        - standard_error: Bootstrap SE of the .632 estimate
        - diagnostics: Additional debugging information
    """
    rng = np.random.RandomState(random_state)
    n_samples = len(y)
    
    # Determine if classification or regression
    is_classification = len(np.unique(y)) <= 20  # Heuristic
    
    if loss_fn is None:
        if is_classification:
            loss_fn = lambda y_true, y_pred: np.mean(y_true != y_pred)
        else:
            loss_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2)
    
    # =========================================
    # Step 1: Compute Apparent Error
    # =========================================
    model_full = clone(model)
    model_full.fit(X, y)
    y_pred_train = model_full.predict(X)
    apparent_error = loss_fn(y, y_pred_train)
    
    # =========================================
    # Step 2: Compute OOB Error
    # =========================================
    # Track OOB predictions for each observation
    oob_predictions = {i: [] for i in range(n_samples)}
    bootstrap_oob_errors = []
    bootstrap_apparent_errors = []
    oob_counts = np.zeros(n_samples)
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices (sample with replacement)
        bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Identify in-bag and out-of-bag samples
        in_bag_set = set(bootstrap_indices)
        out_of_bag = [i for i in range(n_samples) if i not in in_bag_set]
        
        if len(out_of_bag) == 0:
            # Rare: all samples selected (skip this iteration)
            continue
        
        # Train on bootstrap sample
        X_boot = X[bootstrap_indices]
        y_boot = y[bootstrap_indices]
        
        model_b = clone(model)
        model_b.fit(X_boot, y_boot)
        
        # Apparent error for this bootstrap sample
        y_pred_boot = model_b.predict(X_boot)
        bootstrap_apparent_errors.append(loss_fn(y_boot, y_pred_boot))
        
        # OOB predictions
        X_oob = X[out_of_bag]
        y_oob_true = y[out_of_bag]
        y_oob_pred = model_b.predict(X_oob)
        
        # Store OOB predictions for each observation
        for i, obs_idx in enumerate(out_of_bag):
            oob_predictions[obs_idx].append(y_oob_pred[i])
            oob_counts[obs_idx] += 1
        
        # OOB error for this iteration
        bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred))
    
    # =========================================
    # Step 3: Aggregate OOB Error
    # =========================================
    # For each observation, we have predictions from bootstrap samples
    # where it was excluded. Average these predictions, then compute error.
    
    final_oob_predictions = []
    final_oob_true = []
    
    for i in range(n_samples):
        if len(oob_predictions[i]) > 0:
            if is_classification:
                # Majority vote
                from collections import Counter
                votes = Counter(oob_predictions[i])
                final_oob_predictions.append(votes.most_common(1)[0][0])
            else:
                # Average predictions
                final_oob_predictions.append(np.mean(oob_predictions[i]))
            final_oob_true.append(y[i])
    
    if len(final_oob_true) == 0:
        raise ValueError("No OOB predictions available. Increase n_bootstrap.")
    
    oob_error = loss_fn(np.array(final_oob_true), np.array(final_oob_predictions))
    
    # =========================================
    # Step 4: Compute .632 Estimate
    # =========================================
    WEIGHT_OOB = 0.632
    WEIGHT_APPARENT = 0.368
    
    point_632_error = WEIGHT_APPARENT * apparent_error + WEIGHT_OOB * oob_error
    
    # =========================================
    # Step 5: Estimate Standard Error
    # =========================================
    # Use the bootstrap samples to estimate SE
    bootstrap_632_errors = [
        WEIGHT_APPARENT * app + WEIGHT_OOB * oob 
        for app, oob in zip(bootstrap_apparent_errors, bootstrap_oob_errors)
    ]
    standard_error = np.std(bootstrap_632_errors, ddof=1)
    
    # =========================================
    # Diagnostics
    # =========================================
    mean_oob_count = np.mean(oob_counts)
    coverage = np.sum(oob_counts > 0) / n_samples
    
    theoretical_oob_fraction = (1 - 1/n_samples)**n_samples
    empirical_oob_fraction = np.mean([
        len([j for j in range(n_samples) if j not in set(
            rng.choice(n_samples, size=n_samples, replace=True)
        )]) / n_samples
        for _ in range(100)
    ])
    
    return {
        'point_632_error': point_632_error,
        'apparent_error': apparent_error,
        'oob_error': oob_error,
        'standard_error': standard_error,
        'diagnostics': {
            'n_bootstrap': n_bootstrap,
            'n_samples': n_samples,
            'observations_with_oob': int(np.sum(oob_counts > 0)),
            'coverage_fraction': coverage,
            'mean_oob_predictions_per_obs': mean_oob_count,
            'theoretical_oob_fraction': theoretical_oob_fraction,
            'weight_apparent': WEIGHT_APPARENT,
            'weight_oob': WEIGHT_OOB,
        }
    }
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    
    # Generate sample data
    X, y = make_classification(
        n_samples=200, n_features=20, n_informative=10,
        n_redundant=5, random_state=42
    )
    
    # Model: Decision Tree (flexible, prone to overfitting)
    model = DecisionTreeClassifier(max_depth=10, random_state=42)
    
    # .632 Bootstrap Error
    results = point_632_bootstrap(X, y, model, n_bootstrap=500)
    
    # Compare with cross-validation
    cv_scores = 1 - cross_val_score(model, X, y, cv=10, scoring='accuracy')
    cv_error = np.mean(cv_scores)
    cv_se = np.std(cv_scores) / np.sqrt(len(cv_scores))
    
    print("=" * 60)
    print(".632 Bootstrap Error Estimation")
    print("=" * 60)
    print(f"\n.632 Bootstrap Error: {results['point_632_error']:.4f} "
          f"(±{results['standard_error']:.4f})")
    print(f"Apparent Error:       {results['apparent_error']:.4f}")
    print(f"OOB Error:            {results['oob_error']:.4f}")
    print(f"\n10-Fold CV Error:     {cv_error:.4f} (±{cv_se:.4f})")
    print("\nDiagnostics:")
    for key, value in results['diagnostics'].items():
        print(f"  {key}: {value}")

Comparison with Cross-Validation

How does the .632 bootstrap compare to cross-validation, the most widely used alternative for error estimation?

Leave-One-Out Cross-Validation (LOOCV):

LOOCV trains on $n-1$ observations and tests on 1, repeating for each observation. It's nearly unbiased (training size is $n-1 \approx n$) but has high variance because the training sets overlap heavily—the $n$ different training sets share $n-2$ observations.

K-Fold Cross-Validation:

K-fold CV reduces variance by using larger test folds, but introduces slight pessimistic bias (training on $(k-1)/k$ of the data). The bias-variance tradeoff depends on the choice of $k$.

The .632 Bootstrap:

The .632 bootstrap has characteristics similar to 2-fold CV in terms of effective training set size (using ~63.2% of data in each bootstrap sample), but achieves lower variance through the combination of many bootstrap samples and the weighted estimator.

Error Estimation Methods Comparison
Method	Bias	Variance	Computational Cost	Best For
LOOCV	Very low	High	O(n × training)	Small n, need nearly unbiased
5-Fold CV	Moderate (pessimistic)	Moderate	O(5 × training)	General purpose
10-Fold CV	Low	Low-Moderate	O(10 × training)	Standard choice
.632 Bootstrap	Low	Low	O(B × training)	Small n, need low variance
OOB Error (RF)	Low (pessimistic)	Low	Free (built-in)	Random Forests

When .632 Bootstrap Excels:

Small sample sizes: With limited data (n < 100), the .632 bootstrap often provides more stable estimates than k-fold CV because it uses many bootstrap samples rather than a fixed number of folds.
High-dimensional settings: When p >> n (more features than observations), the .632 can provide more reliable estimates.
Complex error landscapes: When the error surface has high curvature (error changes rapidly with training set size), the bias correction in .632 is particularly valuable.

When .632 Bootstrap Struggles:

Overfitting models: For models that can dramatically overfit (memorize training data), the apparent error approaches zero. The weighted combination then suffers because one component is uninformative.
Very flexible models: Neural networks, deep ensembles, and other high-capacity models may achieve near-zero training error, making the .632 estimator overly optimistic.

This limitation motivated the development of the .632+ bootstrap, which we'll cover in the next page.

The Overfitting Problem

For a model that achieves apparent_error ≈ 0, the .632 estimator becomes: Err_632 ≈ 0.632 × OOB_error. This ignores all information from the training error component and can still be optimistically biased because the 0.632 weight was derived assuming a non-zero apparent error.

Theoretical Properties and Assumptions

The .632 estimator is optimal under certain assumptions. Understanding these helps us recognize when the estimator may break down.

Key Assumptions:

Linear Learning Curve: The expected error decreases linearly with the reciprocal of training set size: $\text{Err}(m) = \alpha + \beta/m$. This is a reasonable approximation for many models in the moderate-$n$ regime.
Stable Optimism: The optimism (difference between true error and apparent error) is approximately constant across bootstrap samples.
i.i.d. Data: Observations are independent and identically distributed. Violations (time series, clustered data) require modified bootstrap procedures.
Moderate Overfitting: The apparent error is bounded away from zero. Extreme overfitting violates this assumption.

When Assumptions Fail

•Non-linear learning curves: For models whose error decreases faster or slower than 1/m, the .632 weights are suboptimal. Neural networks often have power-law learning curves.
•Interpolating models: When apparent_error = 0, the 0.368 weight on apparent error contributes nothing, and the estimator is dominated by OOB error alone.
•High noise relative to signal: When the Bayes error rate is high, the 'no-information' regime becomes relevant (see .632+).
•Non-i.i.d. data: Time series, spatial data, or clustered observations require block or cluster bootstrap variants.

Asymptotic Properties:

Under regularity conditions, the .632 estimator is consistent: $$\widehat{\text{Err}}^{(.632)} \xrightarrow{P} \text{Err} \text{ as } n \to \infty$$

The convergence rate depends on the model complexity and the sample size. For parametric models with fixed complexity, the .632 estimator achieves the optimal rate of $O(n^{-1/2})$ for the estimation error.

Variance Reduction:

Compared to leave-one-out CV, the .632 bootstrap typically has lower variance because:

It averages over $B$ bootstrap samples rather than exactly $n$ training sets
The weighted combination smooths out extreme values
The apparent error component has very low variance (it's computed on the full dataset)

However, the variance reduction comes at the cost of potential bias, particularly for overfit models.

Practical Implementation Guidelines

When applying the .632 bootstrap in practice, several considerations affect the quality of estimates.

Implementation Best Practices

•Number of bootstrap samples (B): Use B ≥ 200 for stable estimates; B ≥ 500 for publication. The .632 estimate is less sensitive to B than pure bootstrap SE estimates.
•Ensure OOB coverage: Verify that most/all observations receive at least one OOB prediction. With small n and large B, coverage should be near 100%.
•Handle classification carefully: For classification, aggregate OOB predictions via majority vote, not probability averaging (unless using soft predictions).
•Monitor apparent error: If apparent error is very close to zero (< 0.01 for classification), consider using .632+ instead.
•Report both components: Always report apparent error, OOB error, and .632 error. Large discrepancies reveal model behavior.

diagnostic_checks.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
 
def diagnose_632_reliability(results: dict) -> dict:
    """
    Diagnose potential issues with .632 bootstrap estimate.
    
    Parameters
    ----------
    results : dict
        Output from point_632_bootstrap()
    
    Returns
    -------
    dict
        Diagnostic assessment with warnings and recommendations
    """
    diagnostics = []
    warnings = []
    is_reliable = True
    
    apparent = results['apparent_error']
    oob = results['oob_error']
    point_632 = results['point_632_error']
    
    # Check 1: Apparent error near zero
    if apparent < 0.01:
        warnings.append(
            "WARNING: Apparent error near zero. "
            "Model may be overfitting. Consider .632+ estimator."
        )
        is_reliable = False
    
    # Check 2: Large gap between apparent and OOB
    gap = oob - apparent
    relative_gap = gap / (oob + 1e-10)
    
    if relative_gap > 0.5:
        warnings.append(
            f"WARNING: Large gap between apparent ({apparent:.4f}) and "
            f"OOB ({oob:.4f}) error. High model variance or overfitting."
        )
    
    # Check 3: OOB error much higher than .632
    if oob > 1.5 * point_632:
        diagnostics.append(
            "NOTE: OOB error is significantly higher than .632 estimate. "
            "Learning curve may be steep."
        )
    
    # Check 4: Coverage
    coverage = results['diagnostics'].get('coverage_fraction', 1.0)
    if coverage < 0.95:
        warnings.append(
            f"WARNING: Only {coverage*100:.1f}% of observations have OOB predictions. "
            f"Increase n_bootstrap or check for data issues."
        )
        is_reliable = False
    
    # Check 5: Mean OOB predictions per observation
    mean_oob_preds = results['diagnostics'].get('mean_oob_predictions_per_obs', 0)
    if mean_oob_preds < 20:
        diagnostics.append(
            f"NOTE: Mean {mean_oob_preds:.1f} OOB predictions per observation. "
            f"Consider increasing n_bootstrap for more stable aggregation."
        )
    
    # Reliability assessment
    reliability_score = 1.0
    if apparent < 0.01:
        reliability_score -= 0.3
    if relative_gap > 0.5:
        reliability_score -= 0.2
    if coverage < 0.95:
        reliability_score -= 0.3
    
    recommendation = ""
    if reliability_score >= 0.8:
        recommendation = "RECOMMEND: .632 estimate appears reliable."
    elif reliability_score >= 0.5:
        recommendation = "RECOMMEND: Use .632 with caution. Consider .632+ or CV."
    else:
        recommendation = "RECOMMEND: .632 may be unreliable. Use .632+ or CV instead."
    
    return {
        'is_reliable': is_reliable,
        'reliability_score': reliability_score,
        'warnings': warnings,
        'diagnostics': diagnostics,
        'recommendation': recommendation,
        'raw_metrics': {
            'apparent_error': apparent,
            'oob_error': oob,
            'gap': gap,
            'relative_gap': relative_gap,
            'coverage': coverage,
        }
    }
 
 
# Example usage
if __name__ == "__main__":
    # Simulate results for an overfit model
    overfit_results = {
        'point_632_error': 0.15,
        'apparent_error': 0.001,  # Near-zero training error
        'oob_error': 0.24,
        'standard_error': 0.02,
        'diagnostics': {
            'coverage_fraction': 0.98,
            'mean_oob_predictions_per_obs': 35.2,
        }
    }
    
    diagnosis = diagnose_632_reliability(overfit_results)
    
    print("Reliability Diagnosis")
    print("=" * 50)
    print(f"Reliability Score: {diagnosis['reliability_score']:.2f}")
    print(f"Is Reliable: {diagnosis['is_reliable']}")
    print(f"\n{diagnosis['recommendation']}")
    
    if diagnosis['warnings']:
        print("\nWarnings:")
        for w in diagnosis['warnings']:
            print(f"  - {w}")

Connection to Random Forest OOB Error

Random Forests provide out-of-bag error estimates 'for free' as a byproduct of the bagging procedure. This OOB error is closely related to—but not identical to—the bootstrap OOB error we've discussed.

How Random Forest OOB Works:

Each tree is trained on a bootstrap sample of the training data
For each observation, predictions are collected only from trees whose bootstrap sample excluded that observation
These predictions are aggregated (majority vote or average) to form the OOB prediction
OOB error is the error rate using these aggregated OOB predictions

Key Difference from .632 Bootstrap:

Random Forest OOB uses aggregated predictions from multiple trees, each trained on different bootstrap samples. This aggregation reduces variance compared to using a single model per bootstrap (as in standard .632 bootstrap).

For Random Forests specifically, the OOB error is:

Approximately unbiased (or slightly pessimistic)
Computationally free (no additional model training needed)
Often preferred over .632 because the forest structure already provides ensemble averaging

When to Use Which

For Random Forests and gradient boosting with bagging (like some XGBoost configurations), use the built-in OOB error—it's free and well-calibrated. For single models (logistic regression, SVM, single decision tree, neural networks), use .632 or .632+ bootstrap when you need stable error estimates from limited data.

rf_oob_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
def compare_rf_oob_with_632(X: np.ndarray, y: np.ndarray, 
                             n_estimators: int = 100) -> dict:
    """
    Compare Random Forest OOB error with .632 bootstrap.
    
    Demonstrates that RF OOB is similar to .632 OOB but
    benefits from tree aggregation.
    """
    # Random Forest with OOB scoring enabled
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        oob_score=True,  # Enable OOB scoring
        random_state=42,
        n_jobs=-1
    )
    rf.fit(X, y)
    
    # RF OOB error
    rf_oob_error = 1 - rf.oob_score_
    
    # Training error
    train_error = 1 - rf.score(X, y)
    
    # Cross-validation for comparison
    cv_scores = cross_val_score(
        RandomForestClassifier(n_estimators=n_estimators, random_state=42),
        X, y, cv=10, scoring='accuracy'
    )
    cv_error = 1 - np.mean(cv_scores)
    
    # Theoretical .632 (using RF OOB as the OOB component)
    point_632_approx = 0.368 * train_error + 0.632 * rf_oob_error
    
    return {
        'rf_oob_error': rf_oob_error,
        'rf_train_error': train_error,
        'cv_10fold_error': cv_error,
        'point_632_using_rf_oob': point_632_approx,
        'note': (
            "RF OOB is typically preferred for Random Forests. "
            "The .632 combination with RF OOB isn't standard because "
            "RF OOB already has low bias."
        )
    }
 
 
if __name__ == "__main__":
    # Generate sample data
    X, y = make_classification(
        n_samples=500, n_features=20, n_informative=10,
        random_state=42
    )
    
    results = compare_rf_oob_with_632(X, y)
    
    print("Random Forest Error Estimation Comparison")
    print("=" * 50)
    print(f"RF OOB Error:           {results['rf_oob_error']:.4f}")
    print(f"RF Training Error:      {results['rf_train_error']:.4f}")
    print(f"10-Fold CV Error:       {results['cv_10fold_error']:.4f}")
    print(f".632 (using RF OOB):    {results['point_632_using_rf_oob']:.4f}")

Summary: The .632 Bootstrap

The .632 bootstrap provides a principled approach to combining biased estimators into a less biased composite. By understanding the sources of bias in apparent and OOB errors, and using weights derived from the fundamental properties of bootstrap sampling, we achieve more accurate generalization error estimates.

Key Takeaways

•The Formula: Err_632 = 0.368 × apparent_error + 0.632 × OOB_error
•The Weights Come From: The probability structure of bootstrap sampling—36.8% of observations are excluded from each sample on average.
•Bias Cancellation: The optimistic bias of apparent error and pessimistic bias of OOB error approximately cancel with these weights.
•Assumptions: Linear learning curve, moderate overfitting, i.i.d. data.
•Failure Mode: When apparent error is near zero (interpolating models), the estimator can be overly optimistic.
•Comparison to CV: Lower variance than LOOCV, similar bias to 5-10 fold CV, particularly valuable for small n.

What's Next:

The .632 bootstrap has a significant limitation: it can be overly optimistic for models that overfit dramatically (apparent error ≈ 0). In the next page, we'll study the .632+ bootstrap, which addresses this limitation by incorporating a 'no-information' error rate that accounts for the amount of overfitting.

Core Concept Mastered

You now understand the .632 bootstrap—its derivation, implementation, strengths, and limitations. This weighted estimator represents a cornerstone of bootstrap error estimation, providing the foundation for understanding the more sophisticated .632+ variant.

2 / 5

Loading learning content...

Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

2 / 5

The .632 Bootstrap Estimator

The Problem of Biased Error Estimates

When estimating a model's generalization error using the bootstrap, we face a fundamental tension between two biased estimators:

The .632 bootstrap elegantly resolves this tension through a weighted combination that cancels the biases.

What You Will Master

Understanding the Sources of Bias

To understand the .632 bootstrap, we must first carefully characterize the biases in our constituent estimators.

Notation:

$\hat{f}^D$: Model trained on dataset $D$
$\text{Err}$: True generalization error (what we want to estimate)
$\bar{\text{err}}$: Apparent error (training error)
$\bar{\text{err}}^{(1)}$: Leave-one-out cross-validation error
$\widehat{\text{Err}}^{\text{boot}}$: Bootstrap estimate of generalization error

The Apparent Error:

The apparent error is defined as: $$\bar{\text{err}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^D(x_i))$$

where $L$ is the loss function. This evaluates on the same data used for training, yielding: $$E[\bar{\text{err}}] < \text{Err}$$

The downward bias is called the optimism. Efron showed that: $$\text{optimism} = \text{Err} - E[\bar{\text{err}}] = \frac{2}{n} \sum_{i=1}^{n} \text{Cov}(y_i, \hat{f}^D(x_i))$$

This covariance is positive because the model's predictions at $x_i$ are influenced by the response $y_i$—particularly for flexible models.

Degrees of Freedom Interpretation

The Out-of-Bootstrap Error:

The leave-one-out bootstrap error is: $$\widehat{\text{Err}}^{(1)} = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{|C^{-i}|} \sum_{b \in C^{-i}} L(y_i, \hat{f}^{*b}(x_i))$$

where $C^{-i}$ is the set of bootstrap samples that do not contain observation $i$, and $\hat{f}^{*b}$ is the model trained on bootstrap sample $b$.

The upward bias occurs because we're effectively estimating error for a smaller training set than we actually have.

Comparison of Bias Characteristics
Estimator	Bias Direction	Bias Magnitude	Cause of Bias
Apparent Error (Training)	Optimistic (too low)	Large for flexible models	Same data for train + evaluate
OOB Error	Pessimistic (too high)	Moderate, depends on learning curve	Effective training size is ~0.632n
Leave-One-Out CV	Nearly unbiased	Small	Training size is n-1 ≈ n
.632 Bootstrap	Small, can be either	Much reduced compared to constituents	Optimal weighting cancels biases

Deriving the .632 Weighting

The .632 bootstrap estimator combines the apparent error and OOB error:

$$\widehat{\text{Err}}^{(.632)} = 0.368 \times \bar{\text{err}} + 0.632 \times \widehat{\text{Err}}^{(1)}$$

Why these specific weights?

Thus, the expected proportion of unique observations in a bootstrap sample is: $$1 - e^{-1} \approx 0.632$$

The .632 weights arise from viewing the error estimation as interpolating between two regimes:

Weight 0.368: Contribution from 'in-bag' evaluation (apparent error)
Weight 0.632: Contribution from 'out-of-bag' evaluation (OOB error)

Intuition Behind the Weighting

Formal Derivation (Efron's Approach):

Let $\omega = 0.632$ be the expected proportion of unique observations. Consider the expected error of a model trained on data that includes a specific test point $(x_i, y_i)$ with probability 0.368.

Under a linear approximation to the learning curve, if $\text{Err}(m)$ is the error of a model trained on $m$ observations:

$$\widehat{\text{Err}}^{(.632)} \approx \text{Err}(n) + \text{lower order terms}$$

Mathematical Justification:

For a linear learning curve model $\text{Err}(m) = \text{Err}(\infty) + \gamma/m$:

OOB estimates $\text{Err}(0.632n)$, which has excess $\gamma/(0.632n) - \gamma/n = \gamma \times 0.368/(0.632n)$
The apparent error has optimism approximately $-\gamma \times 0.632/(0.368n)$ (opposite sign)
The weighted combination: $0.368 \times (\text{Err} - \gamma_1/n) + 0.632 \times (\text{Err} + \gamma_2/n) \approx \text{Err}$

when $\gamma_1$ and $\gamma_2$ are chosen appropriately.

Complete Implementation

Let's implement the .632 bootstrap estimator with full detail and diagnostics.

point_632_bootstrap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
import numpy as np
from sklearn.base import clone
from typing import Callable, Dict, List, Tuple, Optional
 
def point_632_bootstrap(X: np.ndarray,
                         y: np.ndarray,
                         model,
                         n_bootstrap: int = 200,
                         loss_fn: Optional[Callable] = None,
                         random_state: int = 42) -> Dict:
    """
    Compute the .632 bootstrap error estimate.
    
    The .632 estimator is a weighted combination:
        Err_632 = 0.368 * apparent_error + 0.632 * OOB_error
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix of shape (n_samples, n_features)
    y : np.ndarray
        Target vector of shape (n_samples,)
    model : sklearn estimator
        Model with fit() and predict() methods
    n_bootstrap : int
        Number of bootstrap iterations
    loss_fn : Callable, optional
        Loss function(y_true, y_pred) -> float
        Default: MSE for regression, 0-1 loss for classification
    random_state : int
        Random seed for reproducibility
    
    Returns
    -------
    Dict
        Comprehensive results including:
        - point_632_error: The .632 bootstrap estimate
        - apparent_error: Training error
        - oob_error: Out-of-bootstrap error
        - standard_error: Bootstrap SE of the .632 estimate
        - diagnostics: Additional debugging information
    """
    rng = np.random.RandomState(random_state)
    n_samples = len(y)
    
    # Determine if classification or regression
    is_classification = len(np.unique(y)) <= 20  # Heuristic
    
    if loss_fn is None:
        if is_classification:
            loss_fn = lambda y_true, y_pred: np.mean(y_true != y_pred)
        else:
            loss_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2)
    
    # =========================================
    # Step 1: Compute Apparent Error
    # =========================================
    model_full = clone(model)
    model_full.fit(X, y)
    y_pred_train = model_full.predict(X)
    apparent_error = loss_fn(y, y_pred_train)
    
    # =========================================
    # Step 2: Compute OOB Error
    # =========================================
    # Track OOB predictions for each observation
    oob_predictions = {i: [] for i in range(n_samples)}
    bootstrap_oob_errors = []
    bootstrap_apparent_errors = []
    oob_counts = np.zeros(n_samples)
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices (sample with replacement)
        bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Identify in-bag and out-of-bag samples
        in_bag_set = set(bootstrap_indices)
        out_of_bag = [i for i in range(n_samples) if i not in in_bag_set]
        
        if len(out_of_bag) == 0:
            # Rare: all samples selected (skip this iteration)
            continue
        
        # Train on bootstrap sample
        X_boot = X[bootstrap_indices]
        y_boot = y[bootstrap_indices]
        
        model_b = clone(model)
        model_b.fit(X_boot, y_boot)
        
        # Apparent error for this bootstrap sample
        y_pred_boot = model_b.predict(X_boot)
        bootstrap_apparent_errors.append(loss_fn(y_boot, y_pred_boot))
        
        # OOB predictions
        X_oob = X[out_of_bag]
        y_oob_true = y[out_of_bag]
        y_oob_pred = model_b.predict(X_oob)
        
        # Store OOB predictions for each observation
        for i, obs_idx in enumerate(out_of_bag):
            oob_predictions[obs_idx].append(y_oob_pred[i])
            oob_counts[obs_idx] += 1
        
        # OOB error for this iteration
        bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred))
    
    # =========================================
    # Step 3: Aggregate OOB Error
    # =========================================
    # For each observation, we have predictions from bootstrap samples
    # where it was excluded. Average these predictions, then compute error.
    
    final_oob_predictions = []
    final_oob_true = []
    
    for i in range(n_samples):
        if len(oob_predictions[i]) > 0:
            if is_classification:
                # Majority vote
                from collections import Counter
                votes = Counter(oob_predictions[i])
                final_oob_predictions.append(votes.most_common(1)[0][0])
            else:
                # Average predictions
                final_oob_predictions.append(np.mean(oob_predictions[i]))
            final_oob_true.append(y[i])
    
    if len(final_oob_true) == 0:
        raise ValueError("No OOB predictions available. Increase n_bootstrap.")
    
    oob_error = loss_fn(np.array(final_oob_true), np.array(final_oob_predictions))
    
    # =========================================
    # Step 4: Compute .632 Estimate
    # =========================================
    WEIGHT_OOB = 0.632
    WEIGHT_APPARENT = 0.368
    
    point_632_error = WEIGHT_APPARENT * apparent_error + WEIGHT_OOB * oob_error
    
    # =========================================
    # Step 5: Estimate Standard Error
    # =========================================
    # Use the bootstrap samples to estimate SE
    bootstrap_632_errors = [
        WEIGHT_APPARENT * app + WEIGHT_OOB * oob 
        for app, oob in zip(bootstrap_apparent_errors, bootstrap_oob_errors)
    ]
    standard_error = np.std(bootstrap_632_errors, ddof=1)
    
    # =========================================
    # Diagnostics
    # =========================================
    mean_oob_count = np.mean(oob_counts)
    coverage = np.sum(oob_counts > 0) / n_samples
    
    theoretical_oob_fraction = (1 - 1/n_samples)**n_samples
    empirical_oob_fraction = np.mean([
        len([j for j in range(n_samples) if j not in set(
            rng.choice(n_samples, size=n_samples, replace=True)
        )]) / n_samples
        for _ in range(100)
    ])
    
    return {
        'point_632_error': point_632_error,
        'apparent_error': apparent_error,
        'oob_error': oob_error,
        'standard_error': standard_error,
        'diagnostics': {
            'n_bootstrap': n_bootstrap,
            'n_samples': n_samples,
            'observations_with_oob': int(np.sum(oob_counts > 0)),
            'coverage_fraction': coverage,
            'mean_oob_predictions_per_obs': mean_oob_count,
            'theoretical_oob_fraction': theoretical_oob_fraction,
            'weight_apparent': WEIGHT_APPARENT,
            'weight_oob': WEIGHT_OOB,
        }
    }
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    
    # Generate sample data
    X, y = make_classification(
        n_samples=200, n_features=20, n_informative=10,
        n_redundant=5, random_state=42
    )
    
    # Model: Decision Tree (flexible, prone to overfitting)
    model = DecisionTreeClassifier(max_depth=10, random_state=42)
    
    # .632 Bootstrap Error
    results = point_632_bootstrap(X, y, model, n_bootstrap=500)
    
    # Compare with cross-validation
    cv_scores = 1 - cross_val_score(model, X, y, cv=10, scoring='accuracy')
    cv_error = np.mean(cv_scores)
    cv_se = np.std(cv_scores) / np.sqrt(len(cv_scores))
    
    print("=" * 60)
    print(".632 Bootstrap Error Estimation")
    print("=" * 60)
    print(f"\n.632 Bootstrap Error: {results['point_632_error']:.4f} "
          f"(±{results['standard_error']:.4f})")
    print(f"Apparent Error:       {results['apparent_error']:.4f}")
    print(f"OOB Error:            {results['oob_error']:.4f}")
    print(f"\n10-Fold CV Error:     {cv_error:.4f} (±{cv_se:.4f})")
    print("\nDiagnostics:")
    for key, value in results['diagnostics'].items():
        print(f"  {key}: {value}")

Comparison with Cross-Validation

How does the .632 bootstrap compare to cross-validation, the most widely used alternative for error estimation?

Leave-One-Out Cross-Validation (LOOCV):

K-Fold Cross-Validation:

K-fold CV reduces variance by using larger test folds, but introduces slight pessimistic bias (training on $(k-1)/k$ of the data). The bias-variance tradeoff depends on the choice of $k$.

The .632 Bootstrap:

Error Estimation Methods Comparison
Method	Bias	Variance	Computational Cost	Best For
LOOCV	Very low	High	O(n × training)	Small n, need nearly unbiased
5-Fold CV	Moderate (pessimistic)	Moderate	O(5 × training)	General purpose
10-Fold CV	Low	Low-Moderate	O(10 × training)	Standard choice
.632 Bootstrap	Low	Low	O(B × training)	Small n, need low variance
OOB Error (RF)	Low (pessimistic)	Low	Free (built-in)	Random Forests

When .632 Bootstrap Excels:

Small sample sizes: With limited data (n < 100), the .632 bootstrap often provides more stable estimates than k-fold CV because it uses many bootstrap samples rather than a fixed number of folds.
High-dimensional settings: When p >> n (more features than observations), the .632 can provide more reliable estimates.
Complex error landscapes: When the error surface has high curvature (error changes rapidly with training set size), the bias correction in .632 is particularly valuable.

When .632 Bootstrap Struggles:

Overfitting models: For models that can dramatically overfit (memorize training data), the apparent error approaches zero. The weighted combination then suffers because one component is uninformative.
Very flexible models: Neural networks, deep ensembles, and other high-capacity models may achieve near-zero training error, making the .632 estimator overly optimistic.

This limitation motivated the development of the .632+ bootstrap, which we'll cover in the next page.

The Overfitting Problem

Theoretical Properties and Assumptions

The .632 estimator is optimal under certain assumptions. Understanding these helps us recognize when the estimator may break down.

Key Assumptions:

Linear Learning Curve: The expected error decreases linearly with the reciprocal of training set size: $\text{Err}(m) = \alpha + \beta/m$. This is a reasonable approximation for many models in the moderate-$n$ regime.
Stable Optimism: The optimism (difference between true error and apparent error) is approximately constant across bootstrap samples.
i.i.d. Data: Observations are independent and identically distributed. Violations (time series, clustered data) require modified bootstrap procedures.
Moderate Overfitting: The apparent error is bounded away from zero. Extreme overfitting violates this assumption.

When Assumptions Fail

•Non-linear learning curves: For models whose error decreases faster or slower than 1/m, the .632 weights are suboptimal. Neural networks often have power-law learning curves.
•Interpolating models: When apparent_error = 0, the 0.368 weight on apparent error contributes nothing, and the estimator is dominated by OOB error alone.
•High noise relative to signal: When the Bayes error rate is high, the 'no-information' regime becomes relevant (see .632+).
•Non-i.i.d. data: Time series, spatial data, or clustered observations require block or cluster bootstrap variants.

Asymptotic Properties:

Under regularity conditions, the .632 estimator is consistent: $$\widehat{\text{Err}}^{(.632)} \xrightarrow{P} \text{Err} \text{ as } n \to \infty$$

Variance Reduction:

Compared to leave-one-out CV, the .632 bootstrap typically has lower variance because:

It averages over $B$ bootstrap samples rather than exactly $n$ training sets
The weighted combination smooths out extreme values
The apparent error component has very low variance (it's computed on the full dataset)

However, the variance reduction comes at the cost of potential bias, particularly for overfit models.

Practical Implementation Guidelines

When applying the .632 bootstrap in practice, several considerations affect the quality of estimates.

Implementation Best Practices

•Number of bootstrap samples (B): Use B ≥ 200 for stable estimates; B ≥ 500 for publication. The .632 estimate is less sensitive to B than pure bootstrap SE estimates.
•Ensure OOB coverage: Verify that most/all observations receive at least one OOB prediction. With small n and large B, coverage should be near 100%.
•Handle classification carefully: For classification, aggregate OOB predictions via majority vote, not probability averaging (unless using soft predictions).
•Monitor apparent error: If apparent error is very close to zero (< 0.01 for classification), consider using .632+ instead.
•Report both components: Always report apparent error, OOB error, and .632 error. Large discrepancies reveal model behavior.

diagnostic_checks.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
 
def diagnose_632_reliability(results: dict) -> dict:
    """
    Diagnose potential issues with .632 bootstrap estimate.
    
    Parameters
    ----------
    results : dict
        Output from point_632_bootstrap()
    
    Returns
    -------
    dict
        Diagnostic assessment with warnings and recommendations
    """
    diagnostics = []
    warnings = []
    is_reliable = True
    
    apparent = results['apparent_error']
    oob = results['oob_error']
    point_632 = results['point_632_error']
    
    # Check 1: Apparent error near zero
    if apparent < 0.01:
        warnings.append(
            "WARNING: Apparent error near zero. "
            "Model may be overfitting. Consider .632+ estimator."
        )
        is_reliable = False
    
    # Check 2: Large gap between apparent and OOB
    gap = oob - apparent
    relative_gap = gap / (oob + 1e-10)
    
    if relative_gap > 0.5:
        warnings.append(
            f"WARNING: Large gap between apparent ({apparent:.4f}) and "
            f"OOB ({oob:.4f}) error. High model variance or overfitting."
        )
    
    # Check 3: OOB error much higher than .632
    if oob > 1.5 * point_632:
        diagnostics.append(
            "NOTE: OOB error is significantly higher than .632 estimate. "
            "Learning curve may be steep."
        )
    
    # Check 4: Coverage
    coverage = results['diagnostics'].get('coverage_fraction', 1.0)
    if coverage < 0.95:
        warnings.append(
            f"WARNING: Only {coverage*100:.1f}% of observations have OOB predictions. "
            f"Increase n_bootstrap or check for data issues."
        )
        is_reliable = False
    
    # Check 5: Mean OOB predictions per observation
    mean_oob_preds = results['diagnostics'].get('mean_oob_predictions_per_obs', 0)
    if mean_oob_preds < 20:
        diagnostics.append(
            f"NOTE: Mean {mean_oob_preds:.1f} OOB predictions per observation. "
            f"Consider increasing n_bootstrap for more stable aggregation."
        )
    
    # Reliability assessment
    reliability_score = 1.0
    if apparent < 0.01:
        reliability_score -= 0.3
    if relative_gap > 0.5:
        reliability_score -= 0.2
    if coverage < 0.95:
        reliability_score -= 0.3
    
    recommendation = ""
    if reliability_score >= 0.8:
        recommendation = "RECOMMEND: .632 estimate appears reliable."
    elif reliability_score >= 0.5:
        recommendation = "RECOMMEND: Use .632 with caution. Consider .632+ or CV."
    else:
        recommendation = "RECOMMEND: .632 may be unreliable. Use .632+ or CV instead."
    
    return {
        'is_reliable': is_reliable,
        'reliability_score': reliability_score,
        'warnings': warnings,
        'diagnostics': diagnostics,
        'recommendation': recommendation,
        'raw_metrics': {
            'apparent_error': apparent,
            'oob_error': oob,
            'gap': gap,
            'relative_gap': relative_gap,
            'coverage': coverage,
        }
    }
 
 
# Example usage
if __name__ == "__main__":
    # Simulate results for an overfit model
    overfit_results = {
        'point_632_error': 0.15,
        'apparent_error': 0.001,  # Near-zero training error
        'oob_error': 0.24,
        'standard_error': 0.02,
        'diagnostics': {
            'coverage_fraction': 0.98,
            'mean_oob_predictions_per_obs': 35.2,
        }
    }
    
    diagnosis = diagnose_632_reliability(overfit_results)
    
    print("Reliability Diagnosis")
    print("=" * 50)
    print(f"Reliability Score: {diagnosis['reliability_score']:.2f}")
    print(f"Is Reliable: {diagnosis['is_reliable']}")
    print(f"\n{diagnosis['recommendation']}")
    
    if diagnosis['warnings']:
        print("\nWarnings:")
        for w in diagnosis['warnings']:
            print(f"  - {w}")

Connection to Random Forest OOB Error

How Random Forest OOB Works:

Each tree is trained on a bootstrap sample of the training data
For each observation, predictions are collected only from trees whose bootstrap sample excluded that observation
These predictions are aggregated (majority vote or average) to form the OOB prediction
OOB error is the error rate using these aggregated OOB predictions

Key Difference from .632 Bootstrap:

For Random Forests specifically, the OOB error is:

Approximately unbiased (or slightly pessimistic)
Computationally free (no additional model training needed)
Often preferred over .632 because the forest structure already provides ensemble averaging

When to Use Which

rf_oob_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
 
def compare_rf_oob_with_632(X: np.ndarray, y: np.ndarray, 
                             n_estimators: int = 100) -> dict:
    """
    Compare Random Forest OOB error with .632 bootstrap.
    
    Demonstrates that RF OOB is similar to .632 OOB but
    benefits from tree aggregation.
    """
    # Random Forest with OOB scoring enabled
    rf = RandomForestClassifier(
        n_estimators=n_estimators,
        oob_score=True,  # Enable OOB scoring
        random_state=42,
        n_jobs=-1
    )
    rf.fit(X, y)
    
    # RF OOB error
    rf_oob_error = 1 - rf.oob_score_
    
    # Training error
    train_error = 1 - rf.score(X, y)
    
    # Cross-validation for comparison
    cv_scores = cross_val_score(
        RandomForestClassifier(n_estimators=n_estimators, random_state=42),
        X, y, cv=10, scoring='accuracy'
    )
    cv_error = 1 - np.mean(cv_scores)
    
    # Theoretical .632 (using RF OOB as the OOB component)
    point_632_approx = 0.368 * train_error + 0.632 * rf_oob_error
    
    return {
        'rf_oob_error': rf_oob_error,
        'rf_train_error': train_error,
        'cv_10fold_error': cv_error,
        'point_632_using_rf_oob': point_632_approx,
        'note': (
            "RF OOB is typically preferred for Random Forests. "
            "The .632 combination with RF OOB isn't standard because "
            "RF OOB already has low bias."
        )
    }
 
 
if __name__ == "__main__":
    # Generate sample data
    X, y = make_classification(
        n_samples=500, n_features=20, n_informative=10,
        random_state=42
    )
    
    results = compare_rf_oob_with_632(X, y)
    
    print("Random Forest Error Estimation Comparison")
    print("=" * 50)
    print(f"RF OOB Error:           {results['rf_oob_error']:.4f}")
    print(f"RF Training Error:      {results['rf_train_error']:.4f}")
    print(f"10-Fold CV Error:       {results['cv_10fold_error']:.4f}")
    print(f".632 (using RF OOB):    {results['point_632_using_rf_oob']:.4f}")

Summary: The .632 Bootstrap

Key Takeaways

•The Formula: Err_632 = 0.368 × apparent_error + 0.632 × OOB_error
•The Weights Come From: The probability structure of bootstrap sampling—36.8% of observations are excluded from each sample on average.
•Bias Cancellation: The optimistic bias of apparent error and pessimistic bias of OOB error approximately cancel with these weights.
•Assumptions: Linear learning curve, moderate overfitting, i.i.d. data.
•Failure Mode: When apparent error is near zero (interpolating models), the estimator can be overly optimistic.
•Comparison to CV: Lower variance than LOOCV, similar bias to 5-10 fold CV, particularly valuable for small n.

What's Next:

Core Concept Mastered

2 / 5