Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

4 / 5

Out-of-Bootstrap Error

Free Validation Through Exclusion

Every time we draw a bootstrap sample, we inadvertently create a natural validation set: the observations that weren't selected. These out-of-bootstrap (OOB) samples—comprising roughly 36.8% of the data on average—provide 'free' test cases that were never seen during training.

This property, initially a curiosity of bootstrap sampling, turned out to be remarkably useful. In ensemble methods like Random Forests, OOB error has become the de facto method for estimating generalization performance, eliminating the need for separate cross-validation.

In this page, we'll thoroughly examine the OOB error estimator: its mathematical properties, relationship to other methods, implementation details, and role in modern machine learning.

What You Will Master

By the end of this page, you will understand: (1) The precise definition and computation of OOB error; (2) Its bias characteristics and relationship to training set size; (3) How OOB error relates to leave-one-out cross-validation; (4) OOB error in Random Forests—why it 'comes for free'; (5) Variance analysis and confidence interval construction; (6) Practical guidelines for when to use OOB error vs. cross-validation.

Precise Definition of OOB Error

The out-of-bootstrap error is computed by evaluating each observation using only models trained on bootstrap samples that exclude that observation.

Formal Definition:

Let $D = {(x_1, y_1), \ldots, (x_n, y_n)}$ be our dataset, and let $D^_1, D^_2, \ldots, D^_B$ be $B$ bootstrap samples drawn from $D$. For each bootstrap sample $D^_b$, let $\hat{f}_b$ be the model trained on $D^*_b$.

For each observation $i \in {1, \ldots, n}$, define: $$C^{-i} = {b : (x_i, y_i) \notin D^*_b}$$

This is the set of bootstrap samples that exclude observation $i$.

The OOB prediction for observation $i$ is: $$\hat{y}_i^{\text{OOB}} = \text{aggregate}{\hat{f}_b(x_i) : b \in C^{-i}}$$

where 'aggregate' is majority vote for classification or mean for regression.

The OOB error is then: $$\widehat{\text{Err}}^{\text{OOB}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{\text{OOB}})$$

where $L$ is the loss function.

Key Insight

Each observation is used as a test case multiple times—once for each bootstrap sample that excludes it. On average, each observation is excluded from about B × (1/e) ≈ 0.368B bootstrap samples. This averaging over many models reduces variance compared to simple holdout.

Two Flavors of OOB Error:

There are subtle variations in how OOB error is computed:

Per-Observation OOB (Aggregated): For each observation, aggregate predictions from all models that excluded it, then compute loss. This is the standard approach for Random Forests.
Per-Bootstrap OOB (Averaged): For each bootstrap sample, compute error on its OOB observations, then average across bootstraps. This gives the same expectation but different variance.

Mathematically:

Aggregated: $\widehat{\text{Err}}^{\text{OOB}}_\text{agg} = \frac{1}{n} \sum_i L(y_i, \text{agg}_b \hat{f}_b(x_i))$

Averaged: $\widehat{\text{Err}}^{\text{OOB}}\text{avg} = \frac{1}{B} \sum_b \frac{1}{|OOB_b|} \sum{i \in OOB_b} L(y_i, \hat{f}_b(x_i))$

The aggregated version typically has lower variance because individual predictions are smoothed before loss computation.

Bias Characteristics of OOB Error

The OOB error estimator has a well-characterized bias that arises from the reduced effective training set size.

The Source of Bias:

Each bootstrap sample contains approximately $n(1 - 1/e) \approx 0.632n$ unique observations. Therefore, each model $\hat{f}_b$ is trained on effectively fewer observations than the full dataset.

If generalization error decreases with training set size (the typical case), then: $$E[\widehat{\text{Err}}^{\text{OOB}}] \approx \text{Err}(0.632n) > \text{Err}(n)$$

The OOB error estimates the performance of a model trained on ~63.2% of the data, not the full dataset. This creates pessimistic (upward) bias.

OOB Bias vs. Training Set Fraction
n (samples)	Effective Training (0.632n)	Typical Bias Magnitude	Impact
50	~32	Moderate-High	Noticeably pessimistic
100	~63	Moderate	Measurable pessimism
500	~316	Small	Slight pessimism
1000	~632	Small	Minor effect
10000	~6320	Very Small	Nearly negligible

Quantifying the Bias:

Under a linear learning curve model $\text{Err}(m) = \alpha + \beta/m$, the OOB bias is:

$$\text{Bias} = E[\widehat{\text{Err}}^{\text{OOB}}] - \text{Err}(n) = \frac{\beta}{0.632n} - \frac{\beta}{n} = \frac{\beta(1 - 0.632)}{0.632n} = \frac{0.368\beta}{0.632n}$$

This bias decreases as $O(1/n)$, meaning OOB error becomes less biased for larger datasets.

Why OOB Bias is Usually Acceptable:

Conservative: Pessimistic bias is often preferable to optimistic bias in model selection.
Relative Comparisons: When comparing models using OOB, the bias affects all models similarly, preserving rankings.
Ensemble Averaging: In Random Forests, the aggregated OOB predictions benefit from ensemble averaging, partially compensating for individual model weakness.

When Bias Matters

OOB bias matters most when: (1) Sample size is small (n < 200); (2) Learning curve is steep (complex models benefit greatly from more data); (3) You need accurate absolute error estimates, not just relative comparisons. In these cases, consider .632 or .632+ corrections.

Relationship to Leave-One-Out Cross-Validation

OOB error has an interesting relationship to leave-one-out cross-validation (LOOCV). Understanding this connection helps clarify the properties of both methods.

Leave-One-Out CV:

$$\widehat{\text{Err}}^{\text{LOOCV}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^{-i}(x_i))$$

where $\hat{f}^{-i}$ is trained on all observations except $i$.

Key Similarity: Both methods evaluate each observation using models that haven't seen it. In LOOCV, this is exactly one model trained on $n-1$ observations. In OOB, it's an ensemble of ~0.368B models, each trained on ~0.632n observations.

Key Differences:

OOB vs. Leave-One-Out CV Comparison
Property	LOOCV	OOB Error
Training set size per model	n - 1 (nearly full)	~0.632n (reduced)
Bias	Nearly unbiased	Slightly pessimistic
Variance	High (correlated training sets)	Lower (ensemble averaging)
Computational cost	O(n × train cost)	O(B × train cost)
Number of models	Exactly n	User-specified B
Aggregation	No (single prediction each)	Yes (multiple predictions each)

The Variance Advantage of OOB:

LOOCV has notoriously high variance because the $n$ training sets overlap almost completely—they share $n-2$ observations. This correlation means the $n$ error terms are highly correlated, inflating variance.

OOB error benefits from two variance-reducing mechanisms:

Ensemble averaging: Each observation's OOB prediction is an average (or vote) over multiple models, smoothing out model-specific variability.
Diverse training sets: Bootstrap samples overlap less than LOOCV training sets—two bootstrap samples share an expected ~0.632² × n ≈ 0.40n observations (by the inclusion-exclusion principle applied twice).

When OOB Approximates LOOCV:

For stable learners where error doesn't depend strongly on training set size, OOB and LOOCV give similar results. For unstable learners (decision trees, neural networks), they can differ substantially.

Practical Implication

If you need nearly unbiased estimation and low variance simultaneously, neither LOOCV nor basic OOB is ideal. LOOCV is nearly unbiased but high variance. OOB has lower variance but pessimistic bias. The .632 bootstrap strikes a middle ground by combining both.

OOB Error in Random Forests

Random Forests have made OOB error famous because it provides error estimation for free—as a byproduct of the bagging procedure, with no additional computational cost beyond training the ensemble.

How Random Forest OOB Works:

A Random Forest consists of $T$ trees, each trained on a bootstrap sample of the data. During training:

For each tree $t$, record which observations were not in its bootstrap sample
After all trees are trained, for each observation $i$:
- Identify trees that didn't see observation $i$ during training
- Aggregate predictions from these trees to get OOB prediction
Compute error between OOB predictions and true labels

random_forest_oob.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import cross_val_score
 
def demonstrate_rf_oob():
    """
    Demonstrate Random Forest OOB error and compare with CV.
    
    Key insight: RF OOB is computed during training with no
    additional cost. The oob_score_ attribute gives accuracy,
    so OOB error = 1 - oob_score_.
    """
    # Classification example
    X, y = make_classification(n_samples=500, n_features=20, 
                                n_informative=10, random_state=42)
    
    # Random Forest with OOB scoring enabled
    rf = RandomForestClassifier(
        n_estimators=100,
        oob_score=True,  # MUST enable OOB scoring
        random_state=42,
        n_jobs=-1
    )
    rf.fit(X, y)
    
    # OOB predictions and error
    oob_predictions = rf.oob_decision_function_  # Shape: (n_samples, n_classes)
    oob_error = 1 - rf.oob_score_
    
    # Compare with 10-fold CV
    cv_scores = cross_val_score(
        RandomForestClassifier(n_estimators=100, random_state=42),
        X, y, cv=10, scoring='accuracy'
    )
    cv_error = 1 - np.mean(cv_scores)
    cv_std = np.std(cv_scores)
    
    print("Random Forest OOB vs Cross-Validation")
    print("=" * 50)
    print(f"OOB Error:           {oob_error:.4f}")
    print(f"10-fold CV Error:    {cv_error:.4f} (±{cv_std:.4f})")
    print(f"Training Error:      {1 - rf.score(X, y):.4f}")
    print()
    
    # Per-observation OOB details
    n_trees_per_obs = np.sum(~np.isnan(rf.oob_decision_function_[:, 0]), axis=0)
    print("OOB Statistics:")
    print(f"  Mean trees per OOB prediction: ~{100 * 0.368:.0f}")
    print(f"  Observations with OOB predictions: {np.sum(~np.isnan(oob_predictions[:, 0]))}")
    
    return {
        'oob_error': oob_error,
        'cv_error': cv_error,
        'cv_std': cv_std,
        'train_error': 1 - rf.score(X, y),
    }
 
 
def analyze_oob_convergence(X, y, n_trees_range=(10, 500, 10)):
    """
    Analyze how OOB error stabilizes as more trees are added.
    
    As n_estimators increases:
    - Each observation has more OOB predictions to aggregate
    - OOB error estimate becomes more stable
    - Variance of OOB estimate decreases
    """
    start, stop, step = n_trees_range
    n_trees_list = range(start, stop + 1, step)
    
    oob_errors = []
    
    for n_trees in n_trees_list:
        rf = RandomForestClassifier(
            n_estimators=n_trees,
            oob_score=True,
            random_state=42,
            n_jobs=-1
        )
        rf.fit(X, y)
        oob_errors.append(1 - rf.oob_score_)
    
    # Find stabilization point
    errors_array = np.array(oob_errors)
    final_error = errors_array[-1]
    stabilization_idx = np.argmax(np.abs(errors_array - final_error) < 0.005)
    stabilization_trees = list(n_trees_list)[stabilization_idx]
    
    return {
        'n_trees_list': list(n_trees_list),
        'oob_errors': oob_errors,
        'final_oob_error': final_error,
        'stabilization_trees': stabilization_trees,
    }
 
 
if __name__ == "__main__":
    # Basic demonstration
    results = demonstrate_rf_oob()
    
    print("\nOOB Error Analysis:")
    print(f"  Difference (OOB - CV): {results['oob_error'] - results['cv_error']:.4f}")
    print("  (Positive = OOB is more pessimistic, as expected)")

Why RF OOB is Special

For Random Forests specifically, OOB error benefits from the ensemble structure: each OOB prediction is a vote/average over ~0.368T trees. This aggregation reduces variance substantially compared to using a single model's OOB predictions. The OOB error is thus both free to compute AND well-calibrated for RF.

Variance and Confidence Intervals

Understanding the variance of OOB error estimates is crucial for constructing confidence intervals and making model comparison decisions.

Sources of Variance:

Bootstrap sampling variance: Which observations appear in each bootstrap sample
Model training variance: Randomness in the learning algorithm itself
Finite sample variance: The data is a sample from the population

Estimating OOB Error Variance:

For bootstrap-based OOB (with explicit bootstrap iterations), we can estimate variance from the per-bootstrap error terms:

$$\widehat{\text{Var}}(\widehat{\text{Err}}^{\text{OOB}}) = \frac{1}{B-1} \sum_{b=1}^{B} \left(\widehat{\text{Err}}_b^{\text{OOB}} - \bar{\text{Err}}^{\text{OOB}}\right)^2$$

For Random Forest OOB, the variance is more complex because errors are not independent across observations (same trees contribute to multiple OOB predictions).

oob_confidence_intervals.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
import numpy as np
from sklearn.base import clone
from typing import Dict, Tuple
 
def oob_error_with_ci(X: np.ndarray,
                       y: np.ndarray,
                       model,
                       n_bootstrap: int = 500,
                       confidence_level: float = 0.95,
                       random_state: int = 42) -> Dict:
    """
    Compute OOB error with confidence intervals.
    
    Uses the per-bootstrap OOB errors to construct bootstrap
    confidence intervals for the OOB error estimate.
    
    Parameters
    ----------
    X : np.ndarray
        Features
    y : np.ndarray
        Labels
    model : sklearn estimator
        Model to evaluate
    n_bootstrap : int
        Number of bootstrap iterations
    confidence_level : float
        Confidence level for interval (default 0.95)
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        OOB error, SE, and confidence interval
    """
    rng = np.random.RandomState(random_state)
    n = len(y)
    
    # Determine if classification
    is_classification = len(np.unique(y)) <= 20
    loss_fn = (lambda yt, yp: np.mean(yt != yp) if is_classification 
               else lambda yt, yp: np.mean((yt - yp)**2))
    
    # Track per-bootstrap OOB errors
    bootstrap_oob_errors = []
    oob_predictions = {i: [] for i in range(n)}
    
    for b in range(n_bootstrap):
        # Bootstrap sample
        indices = rng.choice(n, size=n, replace=True)
        in_bag = set(indices)
        out_of_bag = [i for i in range(n) if i not in in_bag]
        
        if len(out_of_bag) == 0:
            continue
        
        # Train and predict
        m = clone(model)
        m.fit(X[indices], y[indices])
        
        y_oob_pred = m.predict(X[out_of_bag])
        y_oob_true = y[out_of_bag]
        
        # Per-bootstrap error
        bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred))
        
        # Store predictions
        for i, idx in enumerate(out_of_bag):
            oob_predictions[idx].append(y_oob_pred[i])
    
    # Aggregate OOB predictions
    final_pred = []
    final_true = []
    for i in range(n):
        if oob_predictions[i]:
            if is_classification:
                from collections import Counter
                final_pred.append(Counter(oob_predictions[i]).most_common(1)[0][0])
            else:
                final_pred.append(np.mean(oob_predictions[i]))
            final_true.append(y[i])
    
    # Point estimate
    oob_error = loss_fn(np.array(final_true), np.array(final_pred))
    
    # Bootstrap SE (from per-bootstrap errors)
    se_bootstrap = np.std(bootstrap_oob_errors, ddof=1)
    
    # Bootstrap percentile CI
    alpha = 1 - confidence_level
    ci_lower = np.percentile(bootstrap_oob_errors, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_oob_errors, 100 * (1 - alpha / 2))
    
    # Normal approximation CI
    from scipy import stats
    z = stats.norm.ppf(1 - alpha / 2)
    ci_normal_lower = oob_error - z * se_bootstrap
    ci_normal_upper = oob_error + z * se_bootstrap
    
    return {
        'oob_error': oob_error,
        'se_bootstrap': se_bootstrap,
        'ci_percentile': (ci_lower, ci_upper),
        'ci_normal': (ci_normal_lower, ci_normal_upper),
        'n_bootstrap': n_bootstrap,
        'confidence_level': confidence_level,
        'coverage': len(final_true) / n,
    }
 
 
def rf_oob_with_jackknife_variance(X: np.ndarray,
                                    y: np.ndarray,
                                    n_estimators: int = 500,
                                    random_state: int = 42) -> Dict:
    """
    Estimate RF OOB error with jackknife variance estimation.
    
    The jackknife (delete-one) approach estimates variance by
    looking at how the OOB error changes when each observation
    is removed from the dataset.
    
    Note: This is computationally expensive as it requires
    refitting the RF n times.
    """
    from sklearn.ensemble import RandomForestClassifier
    
    n = len(y)
    
    # Full dataset OOB
    rf_full = RandomForestClassifier(n_estimators=n_estimators, 
                                      oob_score=True, random_state=random_state)
    rf_full.fit(X, y)
    full_oob_error = 1 - rf_full.oob_score_
    
    # Jackknife: leave each observation out
    jackknife_errors = []
    
    for i in range(min(n, 100)):  # Limit for computational reasons
        mask = np.ones(n, dtype=bool)
        mask[i] = False
        
        rf_i = RandomForestClassifier(n_estimators=n_estimators,
                                       oob_score=True, random_state=random_state)
        rf_i.fit(X[mask], y[mask])
        jackknife_errors.append(1 - rf_i.oob_score_)
    
    # Jackknife variance estimate
    mean_jackknife = np.mean(jackknife_errors)
    n_jack = len(jackknife_errors)
    jackknife_variance = ((n_jack - 1) / n_jack) * np.sum(
        (np.array(jackknife_errors) - mean_jackknife)**2
    )
    jackknife_se = np.sqrt(jackknife_variance)
    
    return {
        'oob_error': full_oob_error,
        'jackknife_se': jackknife_se,
        'jackknife_variance': jackknife_variance,
        'n_jackknife_samples': n_jack,
    }
 
 
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.datasets import make_classification
    
    X, y = make_classification(n_samples=200, n_features=10, random_state=42)
    model = DecisionTreeClassifier(max_depth=5, random_state=42)
    
    results = oob_error_with_ci(X, y, model, n_bootstrap=1000)
    
    print("OOB Error with Confidence Intervals")
    print("=" * 50)
    print(f"OOB Error:     {results['oob_error']:.4f}")
    print(f"Bootstrap SE:  {results['se_bootstrap']:.4f}")
    print(f"95% CI (percentile): ({results['ci_percentile'][0]:.4f}, "
          f"{results['ci_percentile'][1]:.4f})")
    print(f"95% CI (normal):     ({results['ci_normal'][0]:.4f}, "
          f"{results['ci_normal'][1]:.4f})")
    print(f"Coverage:      {results['coverage']:.1%}")

Variance Underestimation

The per-bootstrap variance estimate can underestimate true variance because bootstrap samples are not independent—they're all drawn from the same finite dataset. For more accurate variance estimates, consider the infinitesimal jackknife or grouped bootstrap approaches.

Implementation Details and Pitfalls

Correct implementation of OOB error requires attention to several subtle details.

Common Implementation Pitfalls

•Missing observations: With small B, some observations may never be OOB. Ensure B ≥ 50 minimum; B ≥ 200 recommended.
•Probability vs. hard predictions: For classification, aggregating probabilities then thresholding is generally better than voting on hard predictions.
•Unbalanced classes: With rare classes, ensure OOB predictions exist for all classes. Use stratified approaches if needed.
•Duplicate handling: Some implementations count duplicates in bootstrap; others use unique indices only. Be consistent.
•Regression aggregation: Use mean, not median, unless you have specific reasons (robustness to outliers).

oob_implementation_robust.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import numpy as np
from sklearn.base import clone
from collections import defaultdict
from typing import Dict, Optional, Callable
 
class RobustOOBEstimator:
    """
    Robust Out-of-Bootstrap error estimator with comprehensive
    diagnostics and edge case handling.
    """
    
    def __init__(self, 
                 n_bootstrap: int = 200,
                 min_oob_percentage: float = 0.95,
                 probability_aggregation: bool = True,
                 random_state: int = 42):
        """
        Parameters
        ----------
        n_bootstrap : int
            Number of bootstrap iterations
        min_oob_percentage : float
            Minimum percentage of observations that must have OOB predictions
        probability_aggregation : bool
            If True (and model has predict_proba), aggregate probabilities
            instead of hard predictions
        random_state : int
            Random seed
        """
        self.n_bootstrap = n_bootstrap
        self.min_oob_percentage = min_oob_percentage
        self.probability_aggregation = probability_aggregation
        self.random_state = random_state
        
    def compute(self, X: np.ndarray, y: np.ndarray, model,
                loss_fn: Optional[Callable] = None) -> Dict:
        """
        Compute robust OOB error estimate.
        """
        rng = np.random.RandomState(self.random_state)
        n = len(y)
        
        # Determine task type
        unique_classes = np.unique(y)
        is_classification = len(unique_classes) <= 20
        has_predict_proba = (hasattr(model, 'predict_proba') and 
                            is_classification and 
                            self.probability_aggregation)
        
        # Default loss function
        if loss_fn is None:
            if is_classification:
                loss_fn = lambda yt, yp: np.mean(yt != yp)
            else:
                loss_fn = lambda yt, yp: np.mean((yt - yp)**2)
        
        # Initialize storage
        if has_predict_proba:
            # Store probability sums and counts for averaging
            n_classes = len(unique_classes)
            oob_proba_sum = np.zeros((n, n_classes))
            oob_counts = np.zeros(n)
        else:
            oob_predictions = defaultdict(list)
        
        bootstrap_errors = []
        oob_sizes = []
        
        for b in range(self.n_bootstrap):
            # Bootstrap sample
            indices = rng.choice(n, size=n, replace=True)
            in_bag = set(indices)
            out_of_bag = np.array([i for i in range(n) if i not in in_bag])
            
            if len(out_of_bag) == 0:
                continue
            
            oob_sizes.append(len(out_of_bag))
            
            # Train model
            m = clone(model)
            m.fit(X[indices], y[indices])
            
            # OOB predictions
            X_oob = X[out_of_bag]
            y_oob_true = y[out_of_bag]
            
            if has_predict_proba:
                proba = m.predict_proba(X_oob)
                for i, idx in enumerate(out_of_bag):
                    oob_proba_sum[idx] += proba[i]
                    oob_counts[idx] += 1
                y_oob_pred = unique_classes[np.argmax(proba, axis=1)]
            else:
                y_oob_pred = m.predict(X_oob)
                for i, idx in enumerate(out_of_bag):
                    oob_predictions[idx].append(y_oob_pred[i])
            
            bootstrap_errors.append(loss_fn(y_oob_true, y_oob_pred))
        
        # Aggregate predictions
        final_predictions = np.full(n, np.nan)
        predictions_available = np.zeros(n, dtype=bool)
        
        for i in range(n):
            if has_predict_proba:
                if oob_counts[i] > 0:
                    avg_proba = oob_proba_sum[i] / oob_counts[i]
                    final_predictions[i] = unique_classes[np.argmax(avg_proba)]
                    predictions_available[i] = True
            else:
                if i in oob_predictions and len(oob_predictions[i]) > 0:
                    if is_classification:
                        from collections import Counter
                        final_predictions[i] = Counter(oob_predictions[i]).most_common(1)[0][0]
                    else:
                        final_predictions[i] = np.mean(oob_predictions[i])
                    predictions_available[i] = True
        
        # Check coverage
        coverage = np.sum(predictions_available) / n
        if coverage < self.min_oob_percentage:
            import warnings
            warnings.warn(
                f"Only {coverage:.1%} of observations have OOB predictions. "
                f"Consider increasing n_bootstrap from {self.n_bootstrap}."
            )
        
        # Compute final OOB error
        mask = predictions_available
        oob_error = loss_fn(y[mask], final_predictions[mask].astype(y.dtype))
        
        # Statistics
        return {
            'oob_error': oob_error,
            'se_bootstrap': np.std(bootstrap_errors, ddof=1),
            'mean_bootstrap_error': np.mean(bootstrap_errors),
            'coverage': coverage,
            'n_observations_with_oob': int(np.sum(predictions_available)),
            'mean_oob_size': np.mean(oob_sizes),
            'mean_oob_fraction': np.mean(oob_sizes) / n,
            'n_bootstrap_used': len(bootstrap_errors),
            'is_classification': is_classification,
            'used_probability_aggregation': has_predict_proba,
        }
 
 
if __name__ == "__main__":
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.datasets import make_classification
    
    X, y = make_classification(n_samples=300, n_features=15, random_state=42)
    
    estimator = RobustOOBEstimator(n_bootstrap=500, probability_aggregation=True)
    results = estimator.compute(X, y, GradientBoostingClassifier(n_estimators=50))
    
    print("Robust OOB Estimation")
    print("=" * 50)
    for k, v in results.items():
        if isinstance(v, float):
            print(f"{k}: {v:.4f}")
        else:
            print(f"{k}: {v}")

OOB vs. Cross-Validation: When to Use Each

Both OOB error and cross-validation estimate generalization performance, but they have different strengths. Understanding when to use each is crucial for efficient and accurate model evaluation.

Use OOB Error When

•Using Random Forests (it's free!)
•Using any bagging-based ensemble
•You want low variance estimates
•Dataset is moderate size (n < 5000)
•Models are expensive to train
•You accept slight pessimistic bias

Use Cross-Validation When

•Using single models (not ensembles)
•You need less biased estimates
•Models train quickly anyway
•You want fold-level diagnostics
•Doing hyperparameter tuning
•Standard practice in your domain

Detailed Comparison:

Scenario	Recommendation	Rationale
Random Forest training	Use built-in OOB	Free; well-calibrated for RF
Comparing RF to other models	Use CV for fair comparison	OOB isn't available for non-bagging models
Single decision tree	Use CV or bootstrap OOB	OOB requires multiple models
Neural network	Use CV	OOB not standard; CV with early stopping common
Hyperparameter tuning	Use CV	OOB can leak info across hyperparameter settings
Final model evaluation	Either; report method used	Consistency matters more than choice
Small sample (n < 100)	Consider .632 bootstrap	Both OOB and CV have issues with tiny n

Practical Wisdom

For Random Forests, trust OOB error—it's been extensively validated. For other models, k-fold CV (k=5 or 10) is the safe default. When results matter and computation permits, use repeated 10-fold CV for lowest variance. The .632 and .632+ bootstrap are most useful when you specifically need bootstrap methodology (e.g., for ensemble methods or when you want bootstrap confidence intervals).

Summary: Out-of-Bootstrap Error

The out-of-bootstrap error exploits a fundamental property of bootstrap sampling—that roughly 36.8% of observations are excluded from each sample—to provide 'free' test cases for model evaluation. This property has made OOB error a cornerstone of ensemble methods, particularly Random Forests.

Key Takeaways

•Definition: OOB error evaluates each observation using only models trained on bootstrap samples that excluded it.
•Bias: Slightly pessimistic because effective training size is ~63.2% of n. Bias decreases as O(1/n).
•Variance: Lower than LOOCV due to ensemble averaging and less-correlated training sets.
•Random Forests: OOB comes for free during RF training—no additional computation required.
•Aggregation: Per-observation aggregation (voting/averaging OOB predictions) is preferred over per-bootstrap averaging.
•vs. CV: OOB has slight pessimistic bias but lower variance; CV is more widely applicable.

What's Next:

We've now covered OOB error as a standalone estimator and its role in the .632/.632+ formulas. In the next page, we'll explore bootstrap confidence intervals—using bootstrap resampling to construct intervals for any quantity of interest, including model performance metrics.

OOB Error Mastered

You now understand out-of-bootstrap error in depth—its computation, bias-variance characteristics, relationship to LOOCV, role in Random Forests, and when to use it versus cross-validation. This knowledge enables you to make informed choices about model evaluation strategies.

4 / 5

Loading learning content...

Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

4 / 5

Out-of-Bootstrap Error

Free Validation Through Exclusion

In this page, we'll thoroughly examine the OOB error estimator: its mathematical properties, relationship to other methods, implementation details, and role in modern machine learning.

What You Will Master

Precise Definition of OOB Error

The out-of-bootstrap error is computed by evaluating each observation using only models trained on bootstrap samples that exclude that observation.

Formal Definition:

For each observation $i \in {1, \ldots, n}$, define: $$C^{-i} = {b : (x_i, y_i) \notin D^*_b}$$

This is the set of bootstrap samples that exclude observation $i$.

The OOB prediction for observation $i$ is: $$\hat{y}_i^{\text{OOB}} = \text{aggregate}{\hat{f}_b(x_i) : b \in C^{-i}}$$

where 'aggregate' is majority vote for classification or mean for regression.

The OOB error is then: $$\widehat{\text{Err}}^{\text{OOB}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{\text{OOB}})$$

where $L$ is the loss function.

Key Insight

Two Flavors of OOB Error:

There are subtle variations in how OOB error is computed:

Per-Observation OOB (Aggregated): For each observation, aggregate predictions from all models that excluded it, then compute loss. This is the standard approach for Random Forests.
Per-Bootstrap OOB (Averaged): For each bootstrap sample, compute error on its OOB observations, then average across bootstraps. This gives the same expectation but different variance.

Mathematically:

Aggregated: $\widehat{\text{Err}}^{\text{OOB}}_\text{agg} = \frac{1}{n} \sum_i L(y_i, \text{agg}_b \hat{f}_b(x_i))$

Averaged: $\widehat{\text{Err}}^{\text{OOB}}\text{avg} = \frac{1}{B} \sum_b \frac{1}{|OOB_b|} \sum{i \in OOB_b} L(y_i, \hat{f}_b(x_i))$

The aggregated version typically has lower variance because individual predictions are smoothed before loss computation.

Bias Characteristics of OOB Error

The OOB error estimator has a well-characterized bias that arises from the reduced effective training set size.

The Source of Bias:

Each bootstrap sample contains approximately $n(1 - 1/e) \approx 0.632n$ unique observations. Therefore, each model $\hat{f}_b$ is trained on effectively fewer observations than the full dataset.

If generalization error decreases with training set size (the typical case), then: $$E[\widehat{\text{Err}}^{\text{OOB}}] \approx \text{Err}(0.632n) > \text{Err}(n)$$

The OOB error estimates the performance of a model trained on ~63.2% of the data, not the full dataset. This creates pessimistic (upward) bias.

OOB Bias vs. Training Set Fraction
n (samples)	Effective Training (0.632n)	Typical Bias Magnitude	Impact
50	~32	Moderate-High	Noticeably pessimistic
100	~63	Moderate	Measurable pessimism
500	~316	Small	Slight pessimism
1000	~632	Small	Minor effect
10000	~6320	Very Small	Nearly negligible

Quantifying the Bias:

Under a linear learning curve model $\text{Err}(m) = \alpha + \beta/m$, the OOB bias is:

$$\text{Bias} = E[\widehat{\text{Err}}^{\text{OOB}}] - \text{Err}(n) = \frac{\beta}{0.632n} - \frac{\beta}{n} = \frac{\beta(1 - 0.632)}{0.632n} = \frac{0.368\beta}{0.632n}$$

This bias decreases as $O(1/n)$, meaning OOB error becomes less biased for larger datasets.

Why OOB Bias is Usually Acceptable:

Conservative: Pessimistic bias is often preferable to optimistic bias in model selection.
Relative Comparisons: When comparing models using OOB, the bias affects all models similarly, preserving rankings.
Ensemble Averaging: In Random Forests, the aggregated OOB predictions benefit from ensemble averaging, partially compensating for individual model weakness.

When Bias Matters

Relationship to Leave-One-Out Cross-Validation

OOB error has an interesting relationship to leave-one-out cross-validation (LOOCV). Understanding this connection helps clarify the properties of both methods.

Leave-One-Out CV:

$$\widehat{\text{Err}}^{\text{LOOCV}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{f}^{-i}(x_i))$$

where $\hat{f}^{-i}$ is trained on all observations except $i$.

Key Differences:

OOB vs. Leave-One-Out CV Comparison
Property	LOOCV	OOB Error
Training set size per model	n - 1 (nearly full)	~0.632n (reduced)
Bias	Nearly unbiased	Slightly pessimistic
Variance	High (correlated training sets)	Lower (ensemble averaging)
Computational cost	O(n × train cost)	O(B × train cost)
Number of models	Exactly n	User-specified B
Aggregation	No (single prediction each)	Yes (multiple predictions each)

The Variance Advantage of OOB:

OOB error benefits from two variance-reducing mechanisms:

Ensemble averaging: Each observation's OOB prediction is an average (or vote) over multiple models, smoothing out model-specific variability.
Diverse training sets: Bootstrap samples overlap less than LOOCV training sets—two bootstrap samples share an expected ~0.632² × n ≈ 0.40n observations (by the inclusion-exclusion principle applied twice).

When OOB Approximates LOOCV:

Practical Implication

OOB Error in Random Forests

How Random Forest OOB Works:

A Random Forest consists of $T$ trees, each trained on a bootstrap sample of the data. During training:

For each tree $t$, record which observations were not in its bootstrap sample
After all trees are trained, for each observation $i$:
- Identify trees that didn't see observation $i$ during training
- Aggregate predictions from these trees to get OOB prediction
Compute error between OOB predictions and true labels

random_forest_oob.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import cross_val_score
 
def demonstrate_rf_oob():
    """
    Demonstrate Random Forest OOB error and compare with CV.
    
    Key insight: RF OOB is computed during training with no
    additional cost. The oob_score_ attribute gives accuracy,
    so OOB error = 1 - oob_score_.
    """
    # Classification example
    X, y = make_classification(n_samples=500, n_features=20, 
                                n_informative=10, random_state=42)
    
    # Random Forest with OOB scoring enabled
    rf = RandomForestClassifier(
        n_estimators=100,
        oob_score=True,  # MUST enable OOB scoring
        random_state=42,
        n_jobs=-1
    )
    rf.fit(X, y)
    
    # OOB predictions and error
    oob_predictions = rf.oob_decision_function_  # Shape: (n_samples, n_classes)
    oob_error = 1 - rf.oob_score_
    
    # Compare with 10-fold CV
    cv_scores = cross_val_score(
        RandomForestClassifier(n_estimators=100, random_state=42),
        X, y, cv=10, scoring='accuracy'
    )
    cv_error = 1 - np.mean(cv_scores)
    cv_std = np.std(cv_scores)
    
    print("Random Forest OOB vs Cross-Validation")
    print("=" * 50)
    print(f"OOB Error:           {oob_error:.4f}")
    print(f"10-fold CV Error:    {cv_error:.4f} (±{cv_std:.4f})")
    print(f"Training Error:      {1 - rf.score(X, y):.4f}")
    print()
    
    # Per-observation OOB details
    n_trees_per_obs = np.sum(~np.isnan(rf.oob_decision_function_[:, 0]), axis=0)
    print("OOB Statistics:")
    print(f"  Mean trees per OOB prediction: ~{100 * 0.368:.0f}")
    print(f"  Observations with OOB predictions: {np.sum(~np.isnan(oob_predictions[:, 0]))}")
    
    return {
        'oob_error': oob_error,
        'cv_error': cv_error,
        'cv_std': cv_std,
        'train_error': 1 - rf.score(X, y),
    }
 
 
def analyze_oob_convergence(X, y, n_trees_range=(10, 500, 10)):
    """
    Analyze how OOB error stabilizes as more trees are added.
    
    As n_estimators increases:
    - Each observation has more OOB predictions to aggregate
    - OOB error estimate becomes more stable
    - Variance of OOB estimate decreases
    """
    start, stop, step = n_trees_range
    n_trees_list = range(start, stop + 1, step)
    
    oob_errors = []
    
    for n_trees in n_trees_list:
        rf = RandomForestClassifier(
            n_estimators=n_trees,
            oob_score=True,
            random_state=42,
            n_jobs=-1
        )
        rf.fit(X, y)
        oob_errors.append(1 - rf.oob_score_)
    
    # Find stabilization point
    errors_array = np.array(oob_errors)
    final_error = errors_array[-1]
    stabilization_idx = np.argmax(np.abs(errors_array - final_error) < 0.005)
    stabilization_trees = list(n_trees_list)[stabilization_idx]
    
    return {
        'n_trees_list': list(n_trees_list),
        'oob_errors': oob_errors,
        'final_oob_error': final_error,
        'stabilization_trees': stabilization_trees,
    }
 
 
if __name__ == "__main__":
    # Basic demonstration
    results = demonstrate_rf_oob()
    
    print("\nOOB Error Analysis:")
    print(f"  Difference (OOB - CV): {results['oob_error'] - results['cv_error']:.4f}")
    print("  (Positive = OOB is more pessimistic, as expected)")

Why RF OOB is Special

Variance and Confidence Intervals

Understanding the variance of OOB error estimates is crucial for constructing confidence intervals and making model comparison decisions.

Sources of Variance:

Bootstrap sampling variance: Which observations appear in each bootstrap sample
Model training variance: Randomness in the learning algorithm itself
Finite sample variance: The data is a sample from the population

Estimating OOB Error Variance:

For bootstrap-based OOB (with explicit bootstrap iterations), we can estimate variance from the per-bootstrap error terms:

$$\widehat{\text{Var}}(\widehat{\text{Err}}^{\text{OOB}}) = \frac{1}{B-1} \sum_{b=1}^{B} \left(\widehat{\text{Err}}_b^{\text{OOB}} - \bar{\text{Err}}^{\text{OOB}}\right)^2$$

For Random Forest OOB, the variance is more complex because errors are not independent across observations (same trees contribute to multiple OOB predictions).

oob_confidence_intervals.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
import numpy as np
from sklearn.base import clone
from typing import Dict, Tuple
 
def oob_error_with_ci(X: np.ndarray,
                       y: np.ndarray,
                       model,
                       n_bootstrap: int = 500,
                       confidence_level: float = 0.95,
                       random_state: int = 42) -> Dict:
    """
    Compute OOB error with confidence intervals.
    
    Uses the per-bootstrap OOB errors to construct bootstrap
    confidence intervals for the OOB error estimate.
    
    Parameters
    ----------
    X : np.ndarray
        Features
    y : np.ndarray
        Labels
    model : sklearn estimator
        Model to evaluate
    n_bootstrap : int
        Number of bootstrap iterations
    confidence_level : float
        Confidence level for interval (default 0.95)
    random_state : int
        Random seed
    
    Returns
    -------
    Dict
        OOB error, SE, and confidence interval
    """
    rng = np.random.RandomState(random_state)
    n = len(y)
    
    # Determine if classification
    is_classification = len(np.unique(y)) <= 20
    loss_fn = (lambda yt, yp: np.mean(yt != yp) if is_classification 
               else lambda yt, yp: np.mean((yt - yp)**2))
    
    # Track per-bootstrap OOB errors
    bootstrap_oob_errors = []
    oob_predictions = {i: [] for i in range(n)}
    
    for b in range(n_bootstrap):
        # Bootstrap sample
        indices = rng.choice(n, size=n, replace=True)
        in_bag = set(indices)
        out_of_bag = [i for i in range(n) if i not in in_bag]
        
        if len(out_of_bag) == 0:
            continue
        
        # Train and predict
        m = clone(model)
        m.fit(X[indices], y[indices])
        
        y_oob_pred = m.predict(X[out_of_bag])
        y_oob_true = y[out_of_bag]
        
        # Per-bootstrap error
        bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred))
        
        # Store predictions
        for i, idx in enumerate(out_of_bag):
            oob_predictions[idx].append(y_oob_pred[i])
    
    # Aggregate OOB predictions
    final_pred = []
    final_true = []
    for i in range(n):
        if oob_predictions[i]:
            if is_classification:
                from collections import Counter
                final_pred.append(Counter(oob_predictions[i]).most_common(1)[0][0])
            else:
                final_pred.append(np.mean(oob_predictions[i]))
            final_true.append(y[i])
    
    # Point estimate
    oob_error = loss_fn(np.array(final_true), np.array(final_pred))
    
    # Bootstrap SE (from per-bootstrap errors)
    se_bootstrap = np.std(bootstrap_oob_errors, ddof=1)
    
    # Bootstrap percentile CI
    alpha = 1 - confidence_level
    ci_lower = np.percentile(bootstrap_oob_errors, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_oob_errors, 100 * (1 - alpha / 2))
    
    # Normal approximation CI
    from scipy import stats
    z = stats.norm.ppf(1 - alpha / 2)
    ci_normal_lower = oob_error - z * se_bootstrap
    ci_normal_upper = oob_error + z * se_bootstrap
    
    return {
        'oob_error': oob_error,
        'se_bootstrap': se_bootstrap,
        'ci_percentile': (ci_lower, ci_upper),
        'ci_normal': (ci_normal_lower, ci_normal_upper),
        'n_bootstrap': n_bootstrap,
        'confidence_level': confidence_level,
        'coverage': len(final_true) / n,
    }
 
 
def rf_oob_with_jackknife_variance(X: np.ndarray,
                                    y: np.ndarray,
                                    n_estimators: int = 500,
                                    random_state: int = 42) -> Dict:
    """
    Estimate RF OOB error with jackknife variance estimation.
    
    The jackknife (delete-one) approach estimates variance by
    looking at how the OOB error changes when each observation
    is removed from the dataset.
    
    Note: This is computationally expensive as it requires
    refitting the RF n times.
    """
    from sklearn.ensemble import RandomForestClassifier
    
    n = len(y)
    
    # Full dataset OOB
    rf_full = RandomForestClassifier(n_estimators=n_estimators, 
                                      oob_score=True, random_state=random_state)
    rf_full.fit(X, y)
    full_oob_error = 1 - rf_full.oob_score_
    
    # Jackknife: leave each observation out
    jackknife_errors = []
    
    for i in range(min(n, 100)):  # Limit for computational reasons
        mask = np.ones(n, dtype=bool)
        mask[i] = False
        
        rf_i = RandomForestClassifier(n_estimators=n_estimators,
                                       oob_score=True, random_state=random_state)
        rf_i.fit(X[mask], y[mask])
        jackknife_errors.append(1 - rf_i.oob_score_)
    
    # Jackknife variance estimate
    mean_jackknife = np.mean(jackknife_errors)
    n_jack = len(jackknife_errors)
    jackknife_variance = ((n_jack - 1) / n_jack) * np.sum(
        (np.array(jackknife_errors) - mean_jackknife)**2
    )
    jackknife_se = np.sqrt(jackknife_variance)
    
    return {
        'oob_error': full_oob_error,
        'jackknife_se': jackknife_se,
        'jackknife_variance': jackknife_variance,
        'n_jackknife_samples': n_jack,
    }
 
 
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.datasets import make_classification
    
    X, y = make_classification(n_samples=200, n_features=10, random_state=42)
    model = DecisionTreeClassifier(max_depth=5, random_state=42)
    
    results = oob_error_with_ci(X, y, model, n_bootstrap=1000)
    
    print("OOB Error with Confidence Intervals")
    print("=" * 50)
    print(f"OOB Error:     {results['oob_error']:.4f}")
    print(f"Bootstrap SE:  {results['se_bootstrap']:.4f}")
    print(f"95% CI (percentile): ({results['ci_percentile'][0]:.4f}, "
          f"{results['ci_percentile'][1]:.4f})")
    print(f"95% CI (normal):     ({results['ci_normal'][0]:.4f}, "
          f"{results['ci_normal'][1]:.4f})")
    print(f"Coverage:      {results['coverage']:.1%}")

Variance Underestimation

Implementation Details and Pitfalls

Correct implementation of OOB error requires attention to several subtle details.

Common Implementation Pitfalls

•Missing observations: With small B, some observations may never be OOB. Ensure B ≥ 50 minimum; B ≥ 200 recommended.
•Probability vs. hard predictions: For classification, aggregating probabilities then thresholding is generally better than voting on hard predictions.
•Unbalanced classes: With rare classes, ensure OOB predictions exist for all classes. Use stratified approaches if needed.
•Duplicate handling: Some implementations count duplicates in bootstrap; others use unique indices only. Be consistent.
•Regression aggregation: Use mean, not median, unless you have specific reasons (robustness to outliers).

oob_implementation_robust.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import numpy as np
from sklearn.base import clone
from collections import defaultdict
from typing import Dict, Optional, Callable
 
class RobustOOBEstimator:
    """
    Robust Out-of-Bootstrap error estimator with comprehensive
    diagnostics and edge case handling.
    """
    
    def __init__(self, 
                 n_bootstrap: int = 200,
                 min_oob_percentage: float = 0.95,
                 probability_aggregation: bool = True,
                 random_state: int = 42):
        """
        Parameters
        ----------
        n_bootstrap : int
            Number of bootstrap iterations
        min_oob_percentage : float
            Minimum percentage of observations that must have OOB predictions
        probability_aggregation : bool
            If True (and model has predict_proba), aggregate probabilities
            instead of hard predictions
        random_state : int
            Random seed
        """
        self.n_bootstrap = n_bootstrap
        self.min_oob_percentage = min_oob_percentage
        self.probability_aggregation = probability_aggregation
        self.random_state = random_state
        
    def compute(self, X: np.ndarray, y: np.ndarray, model,
                loss_fn: Optional[Callable] = None) -> Dict:
        """
        Compute robust OOB error estimate.
        """
        rng = np.random.RandomState(self.random_state)
        n = len(y)
        
        # Determine task type
        unique_classes = np.unique(y)
        is_classification = len(unique_classes) <= 20
        has_predict_proba = (hasattr(model, 'predict_proba') and 
                            is_classification and 
                            self.probability_aggregation)
        
        # Default loss function
        if loss_fn is None:
            if is_classification:
                loss_fn = lambda yt, yp: np.mean(yt != yp)
            else:
                loss_fn = lambda yt, yp: np.mean((yt - yp)**2)
        
        # Initialize storage
        if has_predict_proba:
            # Store probability sums and counts for averaging
            n_classes = len(unique_classes)
            oob_proba_sum = np.zeros((n, n_classes))
            oob_counts = np.zeros(n)
        else:
            oob_predictions = defaultdict(list)
        
        bootstrap_errors = []
        oob_sizes = []
        
        for b in range(self.n_bootstrap):
            # Bootstrap sample
            indices = rng.choice(n, size=n, replace=True)
            in_bag = set(indices)
            out_of_bag = np.array([i for i in range(n) if i not in in_bag])
            
            if len(out_of_bag) == 0:
                continue
            
            oob_sizes.append(len(out_of_bag))
            
            # Train model
            m = clone(model)
            m.fit(X[indices], y[indices])
            
            # OOB predictions
            X_oob = X[out_of_bag]
            y_oob_true = y[out_of_bag]
            
            if has_predict_proba:
                proba = m.predict_proba(X_oob)
                for i, idx in enumerate(out_of_bag):
                    oob_proba_sum[idx] += proba[i]
                    oob_counts[idx] += 1
                y_oob_pred = unique_classes[np.argmax(proba, axis=1)]
            else:
                y_oob_pred = m.predict(X_oob)
                for i, idx in enumerate(out_of_bag):
                    oob_predictions[idx].append(y_oob_pred[i])
            
            bootstrap_errors.append(loss_fn(y_oob_true, y_oob_pred))
        
        # Aggregate predictions
        final_predictions = np.full(n, np.nan)
        predictions_available = np.zeros(n, dtype=bool)
        
        for i in range(n):
            if has_predict_proba:
                if oob_counts[i] > 0:
                    avg_proba = oob_proba_sum[i] / oob_counts[i]
                    final_predictions[i] = unique_classes[np.argmax(avg_proba)]
                    predictions_available[i] = True
            else:
                if i in oob_predictions and len(oob_predictions[i]) > 0:
                    if is_classification:
                        from collections import Counter
                        final_predictions[i] = Counter(oob_predictions[i]).most_common(1)[0][0]
                    else:
                        final_predictions[i] = np.mean(oob_predictions[i])
                    predictions_available[i] = True
        
        # Check coverage
        coverage = np.sum(predictions_available) / n
        if coverage < self.min_oob_percentage:
            import warnings
            warnings.warn(
                f"Only {coverage:.1%} of observations have OOB predictions. "
                f"Consider increasing n_bootstrap from {self.n_bootstrap}."
            )
        
        # Compute final OOB error
        mask = predictions_available
        oob_error = loss_fn(y[mask], final_predictions[mask].astype(y.dtype))
        
        # Statistics
        return {
            'oob_error': oob_error,
            'se_bootstrap': np.std(bootstrap_errors, ddof=1),
            'mean_bootstrap_error': np.mean(bootstrap_errors),
            'coverage': coverage,
            'n_observations_with_oob': int(np.sum(predictions_available)),
            'mean_oob_size': np.mean(oob_sizes),
            'mean_oob_fraction': np.mean(oob_sizes) / n,
            'n_bootstrap_used': len(bootstrap_errors),
            'is_classification': is_classification,
            'used_probability_aggregation': has_predict_proba,
        }
 
 
if __name__ == "__main__":
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.datasets import make_classification
    
    X, y = make_classification(n_samples=300, n_features=15, random_state=42)
    
    estimator = RobustOOBEstimator(n_bootstrap=500, probability_aggregation=True)
    results = estimator.compute(X, y, GradientBoostingClassifier(n_estimators=50))
    
    print("Robust OOB Estimation")
    print("=" * 50)
    for k, v in results.items():
        if isinstance(v, float):
            print(f"{k}: {v:.4f}")
        else:
            print(f"{k}: {v}")

OOB vs. Cross-Validation: When to Use Each

Both OOB error and cross-validation estimate generalization performance, but they have different strengths. Understanding when to use each is crucial for efficient and accurate model evaluation.

Use OOB Error When

•Using Random Forests (it's free!)
•Using any bagging-based ensemble
•You want low variance estimates
•Dataset is moderate size (n < 5000)
•Models are expensive to train
•You accept slight pessimistic bias

Use Cross-Validation When

•Using single models (not ensembles)
•You need less biased estimates
•Models train quickly anyway
•You want fold-level diagnostics
•Doing hyperparameter tuning
•Standard practice in your domain

Detailed Comparison:

Scenario	Recommendation	Rationale
Random Forest training	Use built-in OOB	Free; well-calibrated for RF
Comparing RF to other models	Use CV for fair comparison	OOB isn't available for non-bagging models
Single decision tree	Use CV or bootstrap OOB	OOB requires multiple models
Neural network	Use CV	OOB not standard; CV with early stopping common
Hyperparameter tuning	Use CV	OOB can leak info across hyperparameter settings
Final model evaluation	Either; report method used	Consistency matters more than choice
Small sample (n < 100)	Consider .632 bootstrap	Both OOB and CV have issues with tiny n

Practical Wisdom

Summary: Out-of-Bootstrap Error

Key Takeaways

•Definition: OOB error evaluates each observation using only models trained on bootstrap samples that excluded it.
•Bias: Slightly pessimistic because effective training size is ~63.2% of n. Bias decreases as O(1/n).
•Variance: Lower than LOOCV due to ensemble averaging and less-correlated training sets.
•Random Forests: OOB comes for free during RF training—no additional computation required.
•Aggregation: Per-observation aggregation (voting/averaging OOB predictions) is preferred over per-bootstrap averaging.
•vs. CV: OOB has slight pessimistic bias but lower variance; CV is more widely applicable.

What's Next:

OOB Error Mastered

4 / 5