Loading learning content...
When estimating a model's generalization error using the bootstrap, we face a fundamental tension between two biased estimators:
The Apparent Error (Training Error): Evaluating a model on its training data yields an optimistically biased estimate. The model has been fitted to minimize error on this exact data, so it performs better on training data than on new data. For flexible models that can interpolate, the apparent error can approach zero even when true generalization error is substantial.
The Out-of-Bootstrap (OOB) Error: Evaluating each observation only when it's excluded from the bootstrap sample yields a pessimistically biased estimate. Each bootstrap sample contains only about 63.2% of unique observations (on average), so models are trained on effectively smaller samples than the full dataset. This upward bias in error can be substantial for complex models or small datasets.
The .632 bootstrap elegantly resolves this tension through a weighted combination that cancels the biases.
By the end of this page, you will understand: (1) The mathematical derivation of the .632 weighting; (2) Why this specific combination reduces bias; (3) The assumptions under which .632 is optimal; (4) Detailed implementation for machine learning model evaluation; (5) When .632 works well and when it fails.
To understand the .632 bootstrap, we must first carefully characterize the biases in our constituent estimators.
Notation:
The Apparent Error:
The apparent error is defined as: $$\bar{\text{err}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^D(x_i))$$
where $L$ is the loss function. This evaluates on the same data used for training, yielding: $$E[\bar{\text{err}}] < \text{Err}$$
The downward bias is called the optimism. Efron showed that: $$\text{optimism} = \text{Err} - E[\bar{\text{err}}] = \frac{2}{n} \sum_{i=1}^{n} \text{Cov}(y_i, \hat{f}^D(x_i))$$
This covariance is positive because the model's predictions at $x_i$ are influenced by the response $y_i$—particularly for flexible models.
The optimism is directly related to the effective degrees of freedom (model complexity). For linear regression with p predictors, the optimism equals 2p/n times the noise variance. More complex models have higher optimism—they fit the noise more aggressively.
The Out-of-Bootstrap Error:
The leave-one-out bootstrap error is: $$\widehat{\text{Err}}^{(1)} = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{|C^{-i}|} \sum_{b \in C^{-i}} L(y_i, \hat{f}^{*b}(x_i))$$
where $C^{-i}$ is the set of bootstrap samples that do not contain observation $i$, and $\hat{f}^{*b}$ is the model trained on bootstrap sample $b$.
This estimator is nearly unbiased for the error rate of a model trained on a sample of size $n(1 - e^{-1}) \approx 0.632n$—not size $n$. Since learning curves typically decrease with training set size, this gives: $$E[\widehat{\text{Err}}^{(1)}] > \text{Err}$$
The upward bias occurs because we're effectively estimating error for a smaller training set than we actually have.
| Estimator | Bias Direction | Bias Magnitude | Cause of Bias |
|---|---|---|---|
| Apparent Error (Training) | Optimistic (too low) | Large for flexible models | Same data for train + evaluate |
| OOB Error | Pessimistic (too high) | Moderate, depends on learning curve | Effective training size is ~0.632n |
| Leave-One-Out CV | Nearly unbiased | Small | Training size is n-1 ≈ n |
| .632 Bootstrap | Small, can be either | Much reduced compared to constituents | Optimal weighting cancels biases |
The .632 bootstrap estimator combines the apparent error and OOB error:
$$\widehat{\text{Err}}^{(.632)} = 0.368 \times \bar{\text{err}} + 0.632 \times \widehat{\text{Err}}^{(1)}$$
Why these specific weights?
The weighting comes from the expected proportion of unique observations in a bootstrap sample. Recall that the probability an observation is not selected in any of $n$ draws is: $$P(\text{not in bootstrap}) = \left(1 - \frac{1}{n}\right)^n \to e^{-1} \approx 0.368$$
Thus, the expected proportion of unique observations in a bootstrap sample is: $$1 - e^{-1} \approx 0.632$$
The .632 weights arise from viewing the error estimation as interpolating between two regimes:
Think of it this way: each observation in our dataset plays two roles. About 63.2% of the time (across bootstrap samples), it's a test point (excluded from training). About 36.8% of observations are 'redundant' (included in training, so testing on them is biased). The .632 estimator weights these contributions according to their proportions.
Formal Derivation (Efron's Approach):
Let $\omega = 0.632$ be the expected proportion of unique observations. Consider the expected error of a model trained on data that includes a specific test point $(x_i, y_i)$ with probability 0.368.
Under a linear approximation to the learning curve, if $\text{Err}(m)$ is the error of a model trained on $m$ observations:
$$\widehat{\text{Err}}^{(.632)} \approx \text{Err}(n) + \text{lower order terms}$$
The key insight is that the optimistic bias of the apparent error (from training-test overlap) is approximately offset by the pessimistic bias of the OOB error (from reduced training size) when combined with the 0.368/0.632 weights.
Mathematical Justification:
For a linear learning curve model $\text{Err}(m) = \text{Err}(\infty) + \gamma/m$:
when $\gamma_1$ and $\gamma_2$ are chosen appropriately.
Let's implement the .632 bootstrap estimator with full detail and diagnostics.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217
import numpy as npfrom sklearn.base import clonefrom typing import Callable, Dict, List, Tuple, Optional def point_632_bootstrap(X: np.ndarray, y: np.ndarray, model, n_bootstrap: int = 200, loss_fn: Optional[Callable] = None, random_state: int = 42) -> Dict: """ Compute the .632 bootstrap error estimate. The .632 estimator is a weighted combination: Err_632 = 0.368 * apparent_error + 0.632 * OOB_error Parameters ---------- X : np.ndarray Feature matrix of shape (n_samples, n_features) y : np.ndarray Target vector of shape (n_samples,) model : sklearn estimator Model with fit() and predict() methods n_bootstrap : int Number of bootstrap iterations loss_fn : Callable, optional Loss function(y_true, y_pred) -> float Default: MSE for regression, 0-1 loss for classification random_state : int Random seed for reproducibility Returns ------- Dict Comprehensive results including: - point_632_error: The .632 bootstrap estimate - apparent_error: Training error - oob_error: Out-of-bootstrap error - standard_error: Bootstrap SE of the .632 estimate - diagnostics: Additional debugging information """ rng = np.random.RandomState(random_state) n_samples = len(y) # Determine if classification or regression is_classification = len(np.unique(y)) <= 20 # Heuristic if loss_fn is None: if is_classification: loss_fn = lambda y_true, y_pred: np.mean(y_true != y_pred) else: loss_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2) # ========================================= # Step 1: Compute Apparent Error # ========================================= model_full = clone(model) model_full.fit(X, y) y_pred_train = model_full.predict(X) apparent_error = loss_fn(y, y_pred_train) # ========================================= # Step 2: Compute OOB Error # ========================================= # Track OOB predictions for each observation oob_predictions = {i: [] for i in range(n_samples)} bootstrap_oob_errors = [] bootstrap_apparent_errors = [] oob_counts = np.zeros(n_samples) for b in range(n_bootstrap): # Generate bootstrap indices (sample with replacement) bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True) # Identify in-bag and out-of-bag samples in_bag_set = set(bootstrap_indices) out_of_bag = [i for i in range(n_samples) if i not in in_bag_set] if len(out_of_bag) == 0: # Rare: all samples selected (skip this iteration) continue # Train on bootstrap sample X_boot = X[bootstrap_indices] y_boot = y[bootstrap_indices] model_b = clone(model) model_b.fit(X_boot, y_boot) # Apparent error for this bootstrap sample y_pred_boot = model_b.predict(X_boot) bootstrap_apparent_errors.append(loss_fn(y_boot, y_pred_boot)) # OOB predictions X_oob = X[out_of_bag] y_oob_true = y[out_of_bag] y_oob_pred = model_b.predict(X_oob) # Store OOB predictions for each observation for i, obs_idx in enumerate(out_of_bag): oob_predictions[obs_idx].append(y_oob_pred[i]) oob_counts[obs_idx] += 1 # OOB error for this iteration bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred)) # ========================================= # Step 3: Aggregate OOB Error # ========================================= # For each observation, we have predictions from bootstrap samples # where it was excluded. Average these predictions, then compute error. final_oob_predictions = [] final_oob_true = [] for i in range(n_samples): if len(oob_predictions[i]) > 0: if is_classification: # Majority vote from collections import Counter votes = Counter(oob_predictions[i]) final_oob_predictions.append(votes.most_common(1)[0][0]) else: # Average predictions final_oob_predictions.append(np.mean(oob_predictions[i])) final_oob_true.append(y[i]) if len(final_oob_true) == 0: raise ValueError("No OOB predictions available. Increase n_bootstrap.") oob_error = loss_fn(np.array(final_oob_true), np.array(final_oob_predictions)) # ========================================= # Step 4: Compute .632 Estimate # ========================================= WEIGHT_OOB = 0.632 WEIGHT_APPARENT = 0.368 point_632_error = WEIGHT_APPARENT * apparent_error + WEIGHT_OOB * oob_error # ========================================= # Step 5: Estimate Standard Error # ========================================= # Use the bootstrap samples to estimate SE bootstrap_632_errors = [ WEIGHT_APPARENT * app + WEIGHT_OOB * oob for app, oob in zip(bootstrap_apparent_errors, bootstrap_oob_errors) ] standard_error = np.std(bootstrap_632_errors, ddof=1) # ========================================= # Diagnostics # ========================================= mean_oob_count = np.mean(oob_counts) coverage = np.sum(oob_counts > 0) / n_samples theoretical_oob_fraction = (1 - 1/n_samples)**n_samples empirical_oob_fraction = np.mean([ len([j for j in range(n_samples) if j not in set( rng.choice(n_samples, size=n_samples, replace=True) )]) / n_samples for _ in range(100) ]) return { 'point_632_error': point_632_error, 'apparent_error': apparent_error, 'oob_error': oob_error, 'standard_error': standard_error, 'diagnostics': { 'n_bootstrap': n_bootstrap, 'n_samples': n_samples, 'observations_with_oob': int(np.sum(oob_counts > 0)), 'coverage_fraction': coverage, 'mean_oob_predictions_per_obs': mean_oob_count, 'theoretical_oob_fraction': theoretical_oob_fraction, 'weight_apparent': WEIGHT_APPARENT, 'weight_oob': WEIGHT_OOB, } } # Demonstrationif __name__ == "__main__": from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score # Generate sample data X, y = make_classification( n_samples=200, n_features=20, n_informative=10, n_redundant=5, random_state=42 ) # Model: Decision Tree (flexible, prone to overfitting) model = DecisionTreeClassifier(max_depth=10, random_state=42) # .632 Bootstrap Error results = point_632_bootstrap(X, y, model, n_bootstrap=500) # Compare with cross-validation cv_scores = 1 - cross_val_score(model, X, y, cv=10, scoring='accuracy') cv_error = np.mean(cv_scores) cv_se = np.std(cv_scores) / np.sqrt(len(cv_scores)) print("=" * 60) print(".632 Bootstrap Error Estimation") print("=" * 60) print(f"\n.632 Bootstrap Error: {results['point_632_error']:.4f} " f"(±{results['standard_error']:.4f})") print(f"Apparent Error: {results['apparent_error']:.4f}") print(f"OOB Error: {results['oob_error']:.4f}") print(f"\n10-Fold CV Error: {cv_error:.4f} (±{cv_se:.4f})") print("\nDiagnostics:") for key, value in results['diagnostics'].items(): print(f" {key}: {value}")How does the .632 bootstrap compare to cross-validation, the most widely used alternative for error estimation?
Leave-One-Out Cross-Validation (LOOCV):
LOOCV trains on $n-1$ observations and tests on 1, repeating for each observation. It's nearly unbiased (training size is $n-1 \approx n$) but has high variance because the training sets overlap heavily—the $n$ different training sets share $n-2$ observations.
K-Fold Cross-Validation:
K-fold CV reduces variance by using larger test folds, but introduces slight pessimistic bias (training on $(k-1)/k$ of the data). The bias-variance tradeoff depends on the choice of $k$.
The .632 Bootstrap:
The .632 bootstrap has characteristics similar to 2-fold CV in terms of effective training set size (using ~63.2% of data in each bootstrap sample), but achieves lower variance through the combination of many bootstrap samples and the weighted estimator.
| Method | Bias | Variance | Computational Cost | Best For |
|---|---|---|---|---|
| LOOCV | Very low | High | O(n × training) | Small n, need nearly unbiased |
| 5-Fold CV | Moderate (pessimistic) | Moderate | O(5 × training) | General purpose |
| 10-Fold CV | Low | Low-Moderate | O(10 × training) | Standard choice |
| .632 Bootstrap | Low | Low | O(B × training) | Small n, need low variance |
| OOB Error (RF) | Low (pessimistic) | Low | Free (built-in) | Random Forests |
When .632 Bootstrap Excels:
Small sample sizes: With limited data (n < 100), the .632 bootstrap often provides more stable estimates than k-fold CV because it uses many bootstrap samples rather than a fixed number of folds.
High-dimensional settings: When p >> n (more features than observations), the .632 can provide more reliable estimates.
Complex error landscapes: When the error surface has high curvature (error changes rapidly with training set size), the bias correction in .632 is particularly valuable.
When .632 Bootstrap Struggles:
Overfitting models: For models that can dramatically overfit (memorize training data), the apparent error approaches zero. The weighted combination then suffers because one component is uninformative.
Very flexible models: Neural networks, deep ensembles, and other high-capacity models may achieve near-zero training error, making the .632 estimator overly optimistic.
This limitation motivated the development of the .632+ bootstrap, which we'll cover in the next page.
For a model that achieves apparent_error ≈ 0, the .632 estimator becomes: Err_632 ≈ 0.632 × OOB_error. This ignores all information from the training error component and can still be optimistically biased because the 0.632 weight was derived assuming a non-zero apparent error.
The .632 estimator is optimal under certain assumptions. Understanding these helps us recognize when the estimator may break down.
Key Assumptions:
Linear Learning Curve: The expected error decreases linearly with the reciprocal of training set size: $\text{Err}(m) = \alpha + \beta/m$. This is a reasonable approximation for many models in the moderate-$n$ regime.
Stable Optimism: The optimism (difference between true error and apparent error) is approximately constant across bootstrap samples.
i.i.d. Data: Observations are independent and identically distributed. Violations (time series, clustered data) require modified bootstrap procedures.
Moderate Overfitting: The apparent error is bounded away from zero. Extreme overfitting violates this assumption.
Asymptotic Properties:
Under regularity conditions, the .632 estimator is consistent: $$\widehat{\text{Err}}^{(.632)} \xrightarrow{P} \text{Err} \text{ as } n \to \infty$$
The convergence rate depends on the model complexity and the sample size. For parametric models with fixed complexity, the .632 estimator achieves the optimal rate of $O(n^{-1/2})$ for the estimation error.
Variance Reduction:
Compared to leave-one-out CV, the .632 bootstrap typically has lower variance because:
However, the variance reduction comes at the cost of potential bias, particularly for overfit models.
When applying the .632 bootstrap in practice, several considerations affect the quality of estimates.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
import numpy as np def diagnose_632_reliability(results: dict) -> dict: """ Diagnose potential issues with .632 bootstrap estimate. Parameters ---------- results : dict Output from point_632_bootstrap() Returns ------- dict Diagnostic assessment with warnings and recommendations """ diagnostics = [] warnings = [] is_reliable = True apparent = results['apparent_error'] oob = results['oob_error'] point_632 = results['point_632_error'] # Check 1: Apparent error near zero if apparent < 0.01: warnings.append( "WARNING: Apparent error near zero. " "Model may be overfitting. Consider .632+ estimator." ) is_reliable = False # Check 2: Large gap between apparent and OOB gap = oob - apparent relative_gap = gap / (oob + 1e-10) if relative_gap > 0.5: warnings.append( f"WARNING: Large gap between apparent ({apparent:.4f}) and " f"OOB ({oob:.4f}) error. High model variance or overfitting." ) # Check 3: OOB error much higher than .632 if oob > 1.5 * point_632: diagnostics.append( "NOTE: OOB error is significantly higher than .632 estimate. " "Learning curve may be steep." ) # Check 4: Coverage coverage = results['diagnostics'].get('coverage_fraction', 1.0) if coverage < 0.95: warnings.append( f"WARNING: Only {coverage*100:.1f}% of observations have OOB predictions. " f"Increase n_bootstrap or check for data issues." ) is_reliable = False # Check 5: Mean OOB predictions per observation mean_oob_preds = results['diagnostics'].get('mean_oob_predictions_per_obs', 0) if mean_oob_preds < 20: diagnostics.append( f"NOTE: Mean {mean_oob_preds:.1f} OOB predictions per observation. " f"Consider increasing n_bootstrap for more stable aggregation." ) # Reliability assessment reliability_score = 1.0 if apparent < 0.01: reliability_score -= 0.3 if relative_gap > 0.5: reliability_score -= 0.2 if coverage < 0.95: reliability_score -= 0.3 recommendation = "" if reliability_score >= 0.8: recommendation = "RECOMMEND: .632 estimate appears reliable." elif reliability_score >= 0.5: recommendation = "RECOMMEND: Use .632 with caution. Consider .632+ or CV." else: recommendation = "RECOMMEND: .632 may be unreliable. Use .632+ or CV instead." return { 'is_reliable': is_reliable, 'reliability_score': reliability_score, 'warnings': warnings, 'diagnostics': diagnostics, 'recommendation': recommendation, 'raw_metrics': { 'apparent_error': apparent, 'oob_error': oob, 'gap': gap, 'relative_gap': relative_gap, 'coverage': coverage, } } # Example usageif __name__ == "__main__": # Simulate results for an overfit model overfit_results = { 'point_632_error': 0.15, 'apparent_error': 0.001, # Near-zero training error 'oob_error': 0.24, 'standard_error': 0.02, 'diagnostics': { 'coverage_fraction': 0.98, 'mean_oob_predictions_per_obs': 35.2, } } diagnosis = diagnose_632_reliability(overfit_results) print("Reliability Diagnosis") print("=" * 50) print(f"Reliability Score: {diagnosis['reliability_score']:.2f}") print(f"Is Reliable: {diagnosis['is_reliable']}") print(f"\n{diagnosis['recommendation']}") if diagnosis['warnings']: print("\nWarnings:") for w in diagnosis['warnings']: print(f" - {w}")Random Forests provide out-of-bag error estimates 'for free' as a byproduct of the bagging procedure. This OOB error is closely related to—but not identical to—the bootstrap OOB error we've discussed.
How Random Forest OOB Works:
Key Difference from .632 Bootstrap:
Random Forest OOB uses aggregated predictions from multiple trees, each trained on different bootstrap samples. This aggregation reduces variance compared to using a single model per bootstrap (as in standard .632 bootstrap).
For Random Forests specifically, the OOB error is:
For Random Forests and gradient boosting with bagging (like some XGBoost configurations), use the built-in OOB error—it's free and well-calibrated. For single models (logistic regression, SVM, single decision tree, neural networks), use .632 or .632+ bootstrap when you need stable error estimates from limited data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import cross_val_score def compare_rf_oob_with_632(X: np.ndarray, y: np.ndarray, n_estimators: int = 100) -> dict: """ Compare Random Forest OOB error with .632 bootstrap. Demonstrates that RF OOB is similar to .632 OOB but benefits from tree aggregation. """ # Random Forest with OOB scoring enabled rf = RandomForestClassifier( n_estimators=n_estimators, oob_score=True, # Enable OOB scoring random_state=42, n_jobs=-1 ) rf.fit(X, y) # RF OOB error rf_oob_error = 1 - rf.oob_score_ # Training error train_error = 1 - rf.score(X, y) # Cross-validation for comparison cv_scores = cross_val_score( RandomForestClassifier(n_estimators=n_estimators, random_state=42), X, y, cv=10, scoring='accuracy' ) cv_error = 1 - np.mean(cv_scores) # Theoretical .632 (using RF OOB as the OOB component) point_632_approx = 0.368 * train_error + 0.632 * rf_oob_error return { 'rf_oob_error': rf_oob_error, 'rf_train_error': train_error, 'cv_10fold_error': cv_error, 'point_632_using_rf_oob': point_632_approx, 'note': ( "RF OOB is typically preferred for Random Forests. " "The .632 combination with RF OOB isn't standard because " "RF OOB already has low bias." ) } if __name__ == "__main__": # Generate sample data X, y = make_classification( n_samples=500, n_features=20, n_informative=10, random_state=42 ) results = compare_rf_oob_with_632(X, y) print("Random Forest Error Estimation Comparison") print("=" * 50) print(f"RF OOB Error: {results['rf_oob_error']:.4f}") print(f"RF Training Error: {results['rf_train_error']:.4f}") print(f"10-Fold CV Error: {results['cv_10fold_error']:.4f}") print(f".632 (using RF OOB): {results['point_632_using_rf_oob']:.4f}")The .632 bootstrap provides a principled approach to combining biased estimators into a less biased composite. By understanding the sources of bias in apparent and OOB errors, and using weights derived from the fundamental properties of bootstrap sampling, we achieve more accurate generalization error estimates.
What's Next:
The .632 bootstrap has a significant limitation: it can be overly optimistic for models that overfit dramatically (apparent error ≈ 0). In the next page, we'll study the .632+ bootstrap, which addresses this limitation by incorporating a 'no-information' error rate that accounts for the amount of overfitting.
You now understand the .632 bootstrap—its derivation, implementation, strengths, and limitations. This weighted estimator represents a cornerstone of bootstrap error estimation, providing the foundation for understanding the more sophisticated .632+ variant.