Loading learning content...
Every time we draw a bootstrap sample, we inadvertently create a natural validation set: the observations that weren't selected. These out-of-bootstrap (OOB) samples—comprising roughly 36.8% of the data on average—provide 'free' test cases that were never seen during training.
This property, initially a curiosity of bootstrap sampling, turned out to be remarkably useful. In ensemble methods like Random Forests, OOB error has become the de facto method for estimating generalization performance, eliminating the need for separate cross-validation.
In this page, we'll thoroughly examine the OOB error estimator: its mathematical properties, relationship to other methods, implementation details, and role in modern machine learning.
By the end of this page, you will understand: (1) The precise definition and computation of OOB error; (2) Its bias characteristics and relationship to training set size; (3) How OOB error relates to leave-one-out cross-validation; (4) OOB error in Random Forests—why it 'comes for free'; (5) Variance analysis and confidence interval construction; (6) Practical guidelines for when to use OOB error vs. cross-validation.
The out-of-bootstrap error is computed by evaluating each observation using only models trained on bootstrap samples that exclude that observation.
Formal Definition:
Let $D = {(x_1, y_1), \ldots, (x_n, y_n)}$ be our dataset, and let $D^_1, D^_2, \ldots, D^_B$ be $B$ bootstrap samples drawn from $D$. For each bootstrap sample $D^_b$, let $\hat{f}_b$ be the model trained on $D^*_b$.
For each observation $i \in {1, \ldots, n}$, define: $$C^{-i} = {b : (x_i, y_i) \notin D^*_b}$$
This is the set of bootstrap samples that exclude observation $i$.
The OOB prediction for observation $i$ is: $$\hat{y}_i^{\text{OOB}} = \text{aggregate}{\hat{f}_b(x_i) : b \in C^{-i}}$$
where 'aggregate' is majority vote for classification or mean for regression.
The OOB error is then: $$\widehat{\text{Err}}^{\text{OOB}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{\text{OOB}})$$
where $L$ is the loss function.
Each observation is used as a test case multiple times—once for each bootstrap sample that excludes it. On average, each observation is excluded from about B × (1/e) ≈ 0.368B bootstrap samples. This averaging over many models reduces variance compared to simple holdout.
Two Flavors of OOB Error:
There are subtle variations in how OOB error is computed:
Per-Observation OOB (Aggregated): For each observation, aggregate predictions from all models that excluded it, then compute loss. This is the standard approach for Random Forests.
Per-Bootstrap OOB (Averaged): For each bootstrap sample, compute error on its OOB observations, then average across bootstraps. This gives the same expectation but different variance.
Mathematically:
Aggregated: $\widehat{\text{Err}}^{\text{OOB}}_\text{agg} = \frac{1}{n} \sum_i L(y_i, \text{agg}_b \hat{f}_b(x_i))$
Averaged: $\widehat{\text{Err}}^{\text{OOB}}\text{avg} = \frac{1}{B} \sum_b \frac{1}{|OOB_b|} \sum{i \in OOB_b} L(y_i, \hat{f}_b(x_i))$
The aggregated version typically has lower variance because individual predictions are smoothed before loss computation.
The OOB error estimator has a well-characterized bias that arises from the reduced effective training set size.
The Source of Bias:
Each bootstrap sample contains approximately $n(1 - 1/e) \approx 0.632n$ unique observations. Therefore, each model $\hat{f}_b$ is trained on effectively fewer observations than the full dataset.
If generalization error decreases with training set size (the typical case), then: $$E[\widehat{\text{Err}}^{\text{OOB}}] \approx \text{Err}(0.632n) > \text{Err}(n)$$
The OOB error estimates the performance of a model trained on ~63.2% of the data, not the full dataset. This creates pessimistic (upward) bias.
| n (samples) | Effective Training (0.632n) | Typical Bias Magnitude | Impact |
|---|---|---|---|
| 50 | ~32 | Moderate-High | Noticeably pessimistic |
| 100 | ~63 | Moderate | Measurable pessimism |
| 500 | ~316 | Small | Slight pessimism |
| 1000 | ~632 | Small | Minor effect |
| 10000 | ~6320 | Very Small | Nearly negligible |
Quantifying the Bias:
Under a linear learning curve model $\text{Err}(m) = \alpha + \beta/m$, the OOB bias is:
$$\text{Bias} = E[\widehat{\text{Err}}^{\text{OOB}}] - \text{Err}(n) = \frac{\beta}{0.632n} - \frac{\beta}{n} = \frac{\beta(1 - 0.632)}{0.632n} = \frac{0.368\beta}{0.632n}$$
This bias decreases as $O(1/n)$, meaning OOB error becomes less biased for larger datasets.
Why OOB Bias is Usually Acceptable:
OOB bias matters most when: (1) Sample size is small (n < 200); (2) Learning curve is steep (complex models benefit greatly from more data); (3) You need accurate absolute error estimates, not just relative comparisons. In these cases, consider .632 or .632+ corrections.
OOB error has an interesting relationship to leave-one-out cross-validation (LOOCV). Understanding this connection helps clarify the properties of both methods.
Leave-One-Out CV:
$$\widehat{\text{Err}}^{\text{LOOCV}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^{-i}(x_i))$$
where $\hat{f}^{-i}$ is trained on all observations except $i$.
Key Similarity: Both methods evaluate each observation using models that haven't seen it. In LOOCV, this is exactly one model trained on $n-1$ observations. In OOB, it's an ensemble of ~0.368B models, each trained on ~0.632n observations.
Key Differences:
| Property | LOOCV | OOB Error |
|---|---|---|
| Training set size per model | n - 1 (nearly full) | ~0.632n (reduced) |
| Bias | Nearly unbiased | Slightly pessimistic |
| Variance | High (correlated training sets) | Lower (ensemble averaging) |
| Computational cost | O(n × train cost) | O(B × train cost) |
| Number of models | Exactly n | User-specified B |
| Aggregation | No (single prediction each) | Yes (multiple predictions each) |
The Variance Advantage of OOB:
LOOCV has notoriously high variance because the $n$ training sets overlap almost completely—they share $n-2$ observations. This correlation means the $n$ error terms are highly correlated, inflating variance.
OOB error benefits from two variance-reducing mechanisms:
Ensemble averaging: Each observation's OOB prediction is an average (or vote) over multiple models, smoothing out model-specific variability.
Diverse training sets: Bootstrap samples overlap less than LOOCV training sets—two bootstrap samples share an expected ~0.632² × n ≈ 0.40n observations (by the inclusion-exclusion principle applied twice).
When OOB Approximates LOOCV:
For stable learners where error doesn't depend strongly on training set size, OOB and LOOCV give similar results. For unstable learners (decision trees, neural networks), they can differ substantially.
If you need nearly unbiased estimation and low variance simultaneously, neither LOOCV nor basic OOB is ideal. LOOCV is nearly unbiased but high variance. OOB has lower variance but pessimistic bias. The .632 bootstrap strikes a middle ground by combining both.
Random Forests have made OOB error famous because it provides error estimation for free—as a byproduct of the bagging procedure, with no additional computational cost beyond training the ensemble.
How Random Forest OOB Works:
A Random Forest consists of $T$ trees, each trained on a bootstrap sample of the data. During training:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressorfrom sklearn.datasets import make_classification, make_regressionfrom sklearn.model_selection import cross_val_score def demonstrate_rf_oob(): """ Demonstrate Random Forest OOB error and compare with CV. Key insight: RF OOB is computed during training with no additional cost. The oob_score_ attribute gives accuracy, so OOB error = 1 - oob_score_. """ # Classification example X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) # Random Forest with OOB scoring enabled rf = RandomForestClassifier( n_estimators=100, oob_score=True, # MUST enable OOB scoring random_state=42, n_jobs=-1 ) rf.fit(X, y) # OOB predictions and error oob_predictions = rf.oob_decision_function_ # Shape: (n_samples, n_classes) oob_error = 1 - rf.oob_score_ # Compare with 10-fold CV cv_scores = cross_val_score( RandomForestClassifier(n_estimators=100, random_state=42), X, y, cv=10, scoring='accuracy' ) cv_error = 1 - np.mean(cv_scores) cv_std = np.std(cv_scores) print("Random Forest OOB vs Cross-Validation") print("=" * 50) print(f"OOB Error: {oob_error:.4f}") print(f"10-fold CV Error: {cv_error:.4f} (±{cv_std:.4f})") print(f"Training Error: {1 - rf.score(X, y):.4f}") print() # Per-observation OOB details n_trees_per_obs = np.sum(~np.isnan(rf.oob_decision_function_[:, 0]), axis=0) print("OOB Statistics:") print(f" Mean trees per OOB prediction: ~{100 * 0.368:.0f}") print(f" Observations with OOB predictions: {np.sum(~np.isnan(oob_predictions[:, 0]))}") return { 'oob_error': oob_error, 'cv_error': cv_error, 'cv_std': cv_std, 'train_error': 1 - rf.score(X, y), } def analyze_oob_convergence(X, y, n_trees_range=(10, 500, 10)): """ Analyze how OOB error stabilizes as more trees are added. As n_estimators increases: - Each observation has more OOB predictions to aggregate - OOB error estimate becomes more stable - Variance of OOB estimate decreases """ start, stop, step = n_trees_range n_trees_list = range(start, stop + 1, step) oob_errors = [] for n_trees in n_trees_list: rf = RandomForestClassifier( n_estimators=n_trees, oob_score=True, random_state=42, n_jobs=-1 ) rf.fit(X, y) oob_errors.append(1 - rf.oob_score_) # Find stabilization point errors_array = np.array(oob_errors) final_error = errors_array[-1] stabilization_idx = np.argmax(np.abs(errors_array - final_error) < 0.005) stabilization_trees = list(n_trees_list)[stabilization_idx] return { 'n_trees_list': list(n_trees_list), 'oob_errors': oob_errors, 'final_oob_error': final_error, 'stabilization_trees': stabilization_trees, } if __name__ == "__main__": # Basic demonstration results = demonstrate_rf_oob() print("\nOOB Error Analysis:") print(f" Difference (OOB - CV): {results['oob_error'] - results['cv_error']:.4f}") print(" (Positive = OOB is more pessimistic, as expected)")For Random Forests specifically, OOB error benefits from the ensemble structure: each OOB prediction is a vote/average over ~0.368T trees. This aggregation reduces variance substantially compared to using a single model's OOB predictions. The OOB error is thus both free to compute AND well-calibrated for RF.
Understanding the variance of OOB error estimates is crucial for constructing confidence intervals and making model comparison decisions.
Sources of Variance:
Estimating OOB Error Variance:
For bootstrap-based OOB (with explicit bootstrap iterations), we can estimate variance from the per-bootstrap error terms:
$$\widehat{\text{Var}}(\widehat{\text{Err}}^{\text{OOB}}) = \frac{1}{B-1} \sum_{b=1}^{B} \left(\widehat{\text{Err}}_b^{\text{OOB}} - \bar{\text{Err}}^{\text{OOB}}\right)^2$$
For Random Forest OOB, the variance is more complex because errors are not independent across observations (same trees contribute to multiple OOB predictions).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181
import numpy as npfrom sklearn.base import clonefrom typing import Dict, Tuple def oob_error_with_ci(X: np.ndarray, y: np.ndarray, model, n_bootstrap: int = 500, confidence_level: float = 0.95, random_state: int = 42) -> Dict: """ Compute OOB error with confidence intervals. Uses the per-bootstrap OOB errors to construct bootstrap confidence intervals for the OOB error estimate. Parameters ---------- X : np.ndarray Features y : np.ndarray Labels model : sklearn estimator Model to evaluate n_bootstrap : int Number of bootstrap iterations confidence_level : float Confidence level for interval (default 0.95) random_state : int Random seed Returns ------- Dict OOB error, SE, and confidence interval """ rng = np.random.RandomState(random_state) n = len(y) # Determine if classification is_classification = len(np.unique(y)) <= 20 loss_fn = (lambda yt, yp: np.mean(yt != yp) if is_classification else lambda yt, yp: np.mean((yt - yp)**2)) # Track per-bootstrap OOB errors bootstrap_oob_errors = [] oob_predictions = {i: [] for i in range(n)} for b in range(n_bootstrap): # Bootstrap sample indices = rng.choice(n, size=n, replace=True) in_bag = set(indices) out_of_bag = [i for i in range(n) if i not in in_bag] if len(out_of_bag) == 0: continue # Train and predict m = clone(model) m.fit(X[indices], y[indices]) y_oob_pred = m.predict(X[out_of_bag]) y_oob_true = y[out_of_bag] # Per-bootstrap error bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred)) # Store predictions for i, idx in enumerate(out_of_bag): oob_predictions[idx].append(y_oob_pred[i]) # Aggregate OOB predictions final_pred = [] final_true = [] for i in range(n): if oob_predictions[i]: if is_classification: from collections import Counter final_pred.append(Counter(oob_predictions[i]).most_common(1)[0][0]) else: final_pred.append(np.mean(oob_predictions[i])) final_true.append(y[i]) # Point estimate oob_error = loss_fn(np.array(final_true), np.array(final_pred)) # Bootstrap SE (from per-bootstrap errors) se_bootstrap = np.std(bootstrap_oob_errors, ddof=1) # Bootstrap percentile CI alpha = 1 - confidence_level ci_lower = np.percentile(bootstrap_oob_errors, 100 * alpha / 2) ci_upper = np.percentile(bootstrap_oob_errors, 100 * (1 - alpha / 2)) # Normal approximation CI from scipy import stats z = stats.norm.ppf(1 - alpha / 2) ci_normal_lower = oob_error - z * se_bootstrap ci_normal_upper = oob_error + z * se_bootstrap return { 'oob_error': oob_error, 'se_bootstrap': se_bootstrap, 'ci_percentile': (ci_lower, ci_upper), 'ci_normal': (ci_normal_lower, ci_normal_upper), 'n_bootstrap': n_bootstrap, 'confidence_level': confidence_level, 'coverage': len(final_true) / n, } def rf_oob_with_jackknife_variance(X: np.ndarray, y: np.ndarray, n_estimators: int = 500, random_state: int = 42) -> Dict: """ Estimate RF OOB error with jackknife variance estimation. The jackknife (delete-one) approach estimates variance by looking at how the OOB error changes when each observation is removed from the dataset. Note: This is computationally expensive as it requires refitting the RF n times. """ from sklearn.ensemble import RandomForestClassifier n = len(y) # Full dataset OOB rf_full = RandomForestClassifier(n_estimators=n_estimators, oob_score=True, random_state=random_state) rf_full.fit(X, y) full_oob_error = 1 - rf_full.oob_score_ # Jackknife: leave each observation out jackknife_errors = [] for i in range(min(n, 100)): # Limit for computational reasons mask = np.ones(n, dtype=bool) mask[i] = False rf_i = RandomForestClassifier(n_estimators=n_estimators, oob_score=True, random_state=random_state) rf_i.fit(X[mask], y[mask]) jackknife_errors.append(1 - rf_i.oob_score_) # Jackknife variance estimate mean_jackknife = np.mean(jackknife_errors) n_jack = len(jackknife_errors) jackknife_variance = ((n_jack - 1) / n_jack) * np.sum( (np.array(jackknife_errors) - mean_jackknife)**2 ) jackknife_se = np.sqrt(jackknife_variance) return { 'oob_error': full_oob_error, 'jackknife_se': jackknife_se, 'jackknife_variance': jackknife_variance, 'n_jackknife_samples': n_jack, } if __name__ == "__main__": from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=200, n_features=10, random_state=42) model = DecisionTreeClassifier(max_depth=5, random_state=42) results = oob_error_with_ci(X, y, model, n_bootstrap=1000) print("OOB Error with Confidence Intervals") print("=" * 50) print(f"OOB Error: {results['oob_error']:.4f}") print(f"Bootstrap SE: {results['se_bootstrap']:.4f}") print(f"95% CI (percentile): ({results['ci_percentile'][0]:.4f}, " f"{results['ci_percentile'][1]:.4f})") print(f"95% CI (normal): ({results['ci_normal'][0]:.4f}, " f"{results['ci_normal'][1]:.4f})") print(f"Coverage: {results['coverage']:.1%}")The per-bootstrap variance estimate can underestimate true variance because bootstrap samples are not independent—they're all drawn from the same finite dataset. For more accurate variance estimates, consider the infinitesimal jackknife or grouped bootstrap approaches.
Correct implementation of OOB error requires attention to several subtle details.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
import numpy as npfrom sklearn.base import clonefrom collections import defaultdictfrom typing import Dict, Optional, Callable class RobustOOBEstimator: """ Robust Out-of-Bootstrap error estimator with comprehensive diagnostics and edge case handling. """ def __init__(self, n_bootstrap: int = 200, min_oob_percentage: float = 0.95, probability_aggregation: bool = True, random_state: int = 42): """ Parameters ---------- n_bootstrap : int Number of bootstrap iterations min_oob_percentage : float Minimum percentage of observations that must have OOB predictions probability_aggregation : bool If True (and model has predict_proba), aggregate probabilities instead of hard predictions random_state : int Random seed """ self.n_bootstrap = n_bootstrap self.min_oob_percentage = min_oob_percentage self.probability_aggregation = probability_aggregation self.random_state = random_state def compute(self, X: np.ndarray, y: np.ndarray, model, loss_fn: Optional[Callable] = None) -> Dict: """ Compute robust OOB error estimate. """ rng = np.random.RandomState(self.random_state) n = len(y) # Determine task type unique_classes = np.unique(y) is_classification = len(unique_classes) <= 20 has_predict_proba = (hasattr(model, 'predict_proba') and is_classification and self.probability_aggregation) # Default loss function if loss_fn is None: if is_classification: loss_fn = lambda yt, yp: np.mean(yt != yp) else: loss_fn = lambda yt, yp: np.mean((yt - yp)**2) # Initialize storage if has_predict_proba: # Store probability sums and counts for averaging n_classes = len(unique_classes) oob_proba_sum = np.zeros((n, n_classes)) oob_counts = np.zeros(n) else: oob_predictions = defaultdict(list) bootstrap_errors = [] oob_sizes = [] for b in range(self.n_bootstrap): # Bootstrap sample indices = rng.choice(n, size=n, replace=True) in_bag = set(indices) out_of_bag = np.array([i for i in range(n) if i not in in_bag]) if len(out_of_bag) == 0: continue oob_sizes.append(len(out_of_bag)) # Train model m = clone(model) m.fit(X[indices], y[indices]) # OOB predictions X_oob = X[out_of_bag] y_oob_true = y[out_of_bag] if has_predict_proba: proba = m.predict_proba(X_oob) for i, idx in enumerate(out_of_bag): oob_proba_sum[idx] += proba[i] oob_counts[idx] += 1 y_oob_pred = unique_classes[np.argmax(proba, axis=1)] else: y_oob_pred = m.predict(X_oob) for i, idx in enumerate(out_of_bag): oob_predictions[idx].append(y_oob_pred[i]) bootstrap_errors.append(loss_fn(y_oob_true, y_oob_pred)) # Aggregate predictions final_predictions = np.full(n, np.nan) predictions_available = np.zeros(n, dtype=bool) for i in range(n): if has_predict_proba: if oob_counts[i] > 0: avg_proba = oob_proba_sum[i] / oob_counts[i] final_predictions[i] = unique_classes[np.argmax(avg_proba)] predictions_available[i] = True else: if i in oob_predictions and len(oob_predictions[i]) > 0: if is_classification: from collections import Counter final_predictions[i] = Counter(oob_predictions[i]).most_common(1)[0][0] else: final_predictions[i] = np.mean(oob_predictions[i]) predictions_available[i] = True # Check coverage coverage = np.sum(predictions_available) / n if coverage < self.min_oob_percentage: import warnings warnings.warn( f"Only {coverage:.1%} of observations have OOB predictions. " f"Consider increasing n_bootstrap from {self.n_bootstrap}." ) # Compute final OOB error mask = predictions_available oob_error = loss_fn(y[mask], final_predictions[mask].astype(y.dtype)) # Statistics return { 'oob_error': oob_error, 'se_bootstrap': np.std(bootstrap_errors, ddof=1), 'mean_bootstrap_error': np.mean(bootstrap_errors), 'coverage': coverage, 'n_observations_with_oob': int(np.sum(predictions_available)), 'mean_oob_size': np.mean(oob_sizes), 'mean_oob_fraction': np.mean(oob_sizes) / n, 'n_bootstrap_used': len(bootstrap_errors), 'is_classification': is_classification, 'used_probability_aggregation': has_predict_proba, } if __name__ == "__main__": from sklearn.ensemble import GradientBoostingClassifier from sklearn.datasets import make_classification X, y = make_classification(n_samples=300, n_features=15, random_state=42) estimator = RobustOOBEstimator(n_bootstrap=500, probability_aggregation=True) results = estimator.compute(X, y, GradientBoostingClassifier(n_estimators=50)) print("Robust OOB Estimation") print("=" * 50) for k, v in results.items(): if isinstance(v, float): print(f"{k}: {v:.4f}") else: print(f"{k}: {v}")Both OOB error and cross-validation estimate generalization performance, but they have different strengths. Understanding when to use each is crucial for efficient and accurate model evaluation.
Detailed Comparison:
| Scenario | Recommendation | Rationale |
|---|---|---|
| Random Forest training | Use built-in OOB | Free; well-calibrated for RF |
| Comparing RF to other models | Use CV for fair comparison | OOB isn't available for non-bagging models |
| Single decision tree | Use CV or bootstrap OOB | OOB requires multiple models |
| Neural network | Use CV | OOB not standard; CV with early stopping common |
| Hyperparameter tuning | Use CV | OOB can leak info across hyperparameter settings |
| Final model evaluation | Either; report method used | Consistency matters more than choice |
| Small sample (n < 100) | Consider .632 bootstrap | Both OOB and CV have issues with tiny n |
For Random Forests, trust OOB error—it's been extensively validated. For other models, k-fold CV (k=5 or 10) is the safe default. When results matter and computation permits, use repeated 10-fold CV for lowest variance. The .632 and .632+ bootstrap are most useful when you specifically need bootstrap methodology (e.g., for ensemble methods or when you want bootstrap confidence intervals).
The out-of-bootstrap error exploits a fundamental property of bootstrap sampling—that roughly 36.8% of observations are excluded from each sample—to provide 'free' test cases for model evaluation. This property has made OOB error a cornerstone of ensemble methods, particularly Random Forests.
What's Next:
We've now covered OOB error as a standalone estimator and its role in the .632/.632+ formulas. In the next page, we'll explore bootstrap confidence intervals—using bootstrap resampling to construct intervals for any quantity of interest, including model performance metrics.
You now understand out-of-bootstrap error in depth—its computation, bias-variance characteristics, relationship to LOOCV, role in Random Forests, and when to use it versus cross-validation. This knowledge enables you to make informed choices about model evaluation strategies.