Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

3 / 5

The .632+ Bootstrap Estimator

Fixing the Overfit Problem

The .632 bootstrap, while elegant, has a critical weakness: it can dramatically underestimate generalization error for models that overfit their training data. When a model achieves near-zero training error (interpolating the data), the apparent error contribution becomes negligible, and the .632 estimator relies entirely on the OOB error with a fixed weight of 0.632.

Consider an extreme case: a 1-nearest-neighbor classifier achieves perfect training accuracy (each point is its own nearest neighbor). The apparent error is zero, so:

$$\widehat{\text{Err}}^{(.632)} = 0.368 \times 0 + 0.632 \times \widehat{\text{Err}}^{(1)} = 0.632 \times \widehat{\text{Err}}^{(1)}$$

But this underweights the OOB error! For a severely overfit model, the true error is closer to the OOB error than to 63.2% of it.

The .632+ bootstrap addresses this by introducing an adaptive weighting that accounts for the degree of overfitting.

What You Will Master

By the end of this page, you will understand: (1) The 'no-information' error rate and its role in detecting overfitting; (2) The relative overfitting rate that drives adaptive weighting; (3) Complete mathematical derivation of the .632+ formula; (4) Implementation with full diagnostics; (5) When .632+ is preferred over .632 and vice versa.

The No-Information Error Rate

The key innovation in .632+ is the no-information error rate (denoted $\gamma$), which quantifies the error rate when features provide no information about the response. This represents the baseline error achievable by pure chance.

Definition:

The no-information error rate $\gamma$ is the expected error when the relationship between features $X$ and response $Y$ is completely random—when we randomly permute the response labels to destroy any true association.

For classification, $\gamma$ is computed as: $$\gamma = \sum_{k=1}^{K} p_k (1 - q_k)$$

where:

$p_k$ = proportion of observations in class $k$ (response distribution)
$q_k$ = proportion of predictions in class $k$ (prediction distribution)
The sum is over all $K$ classes

Intuition: If the model predicts class $k$ with frequency $q_k$, and the true class is $k$ with frequency $p_k$, then under independence, the probability of a correct prediction for class $k$ is $p_k \times q_k$. The error rate is $1 - \sum_k p_k q_k$.

Wait—that's not quite the formula above. Let me clarify:

$$\gamma = 1 - \sum_{k=1}^{K} p_k q_k$$

This equals $\sum_k p_k(1 - q_k)$ only under specific conditions. The standard formula is: $$\gamma = \sum_{k=1}^{K} p_k (1 - q_k) = \sum_k p_k - \sum_k p_k q_k = 1 - \sum_k p_k q_k$$

Yes, these are equivalent.

No-Information Error for Regression

For regression with squared error loss, the no-information error is the variance of the response: γ = Var(Y). This is the error of predicting the mean response E[Y] for all observations—the best you can do without using features.

Computing the No-Information Error:

In practice, we estimate $\gamma$ from the data as follows:

For Classification:

Compute the observed class proportions: $\hat{p}_k = n_k / n$ where $n_k$ is the count of class $k$
Compute the predicted class proportions: $\hat{q}_k$ = proportion of predictions that are class $k$ (from training on full data)
Calculate: $\hat{\gamma} = \sum_{k=1}^{K} \hat{p}_k (1 - \hat{q}_k)$

Alternative Permutation Approach: A more general method:

For many iterations, randomly permute the response vector $y$
Train the model on $(X, y_{\text{permuted}})$
Compute error on the original $(X, y)$
Average the errors to get $\hat{\gamma}$

This 'permutation no-information' approach is computationally intensive but works for any model and loss function.

no_information_error.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import numpy as np
from collections import Counter
from typing import Callable, Optional
from sklearn.base import clone
 
def compute_no_information_error_classification(y_true: np.ndarray,
                                                  y_pred: np.ndarray) -> float:
    """
    Compute the no-information error rate for classification.
    
    The no-information error γ represents the error rate when
    features provide no information about the response.
    
    γ = Σ_k p_k (1 - q_k) = 1 - Σ_k p_k q_k
    
    where:
    - p_k = proportion of true labels in class k
    - q_k = proportion of predictions in class k
    
    Parameters
    ----------
    y_true : np.ndarray
        True class labels
    y_pred : np.ndarray
        Predicted class labels (from model trained on same data)
    
    Returns
    -------
    float
        No-information error rate γ
    """
    n = len(y_true)
    
    # Get all unique classes
    classes = np.unique(np.concatenate([y_true, y_pred]))
    
    # Compute class proportions in true labels
    true_counts = Counter(y_true)
    p = {k: true_counts.get(k, 0) / n for k in classes}
    
    # Compute class proportions in predictions
    pred_counts = Counter(y_pred)
    q = {k: pred_counts.get(k, 0) / n for k in classes}
    
    # Compute γ = 1 - Σ p_k q_k
    gamma = 1.0 - sum(p[k] * q[k] for k in classes)
    
    return gamma
 
 
def compute_no_information_error_permutation(X: np.ndarray,
                                               y: np.ndarray,
                                               model,
                                               loss_fn: Callable,
                                               n_permutations: int = 50,
                                               random_state: int = 42) -> float:
    """
    Compute no-information error via permutation.
    
    This is a more general approach that works for any model
    and loss function:
    1. Permute y to destroy X-y relationship
    2. Train model on (X, y_permuted)
    3. Evaluate on (X, y_original)
    4. Average over many permutations
    
    Parameters
    ----------
    X : np.ndarray
        Features
    y : np.ndarray
        Original responses
    model : sklearn estimator
        Model to use
    loss_fn : Callable
        Loss function(y_true, y_pred) -> float
    n_permutations : int
        Number of permutations
    random_state : int
        Random seed
    
    Returns
    -------
    float
        Permutation-based no-information error
    """
    rng = np.random.RandomState(random_state)
    n = len(y)
    
    permutation_errors = []
    
    for _ in range(n_permutations):
        # Permute response
        perm_indices = rng.permutation(n)
        y_permuted = y[perm_indices]
        
        # Train on permuted data
        model_perm = clone(model)
        model_perm.fit(X, y_permuted)
        
        # Predict on original features
        y_pred = model_perm.predict(X)
        
        # Evaluate against original (unpermuted) labels
        permutation_errors.append(loss_fn(y, y_pred))
    
    return np.mean(permutation_errors)
 
 
def compute_no_information_error_regression(y: np.ndarray) -> float:
    """
    Compute no-information error for regression (MSE).
    
    For regression with squared error, the no-information error
    is simply the variance of y - the error from predicting ȳ
    for all observations.
    
    Parameters
    ----------
    y : np.ndarray
        Response values
    
    Returns
    -------
    float
        No-information error (sample variance)
    """
    return np.var(y, ddof=1)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.datasets import make_classification
    
    # Generate data
    X, y = make_classification(n_samples=200, n_features=10, 
                                n_classes=3, n_informative=5,
                                random_state=42)
    
    # Train model
    model = DecisionTreeClassifier(max_depth=10, random_state=42)
    model.fit(X, y)
    y_pred = model.predict(X)
    
    # Compute no-information error both ways
    gamma_formula = compute_no_information_error_classification(y, y_pred)
    gamma_permutation = compute_no_information_error_permutation(
        X, y, DecisionTreeClassifier(max_depth=10, random_state=42),
        loss_fn=lambda yt, yp: np.mean(yt != yp),
        n_permutations=100
    )
    
    # True class proportions
    unique, counts = np.unique(y, return_counts=True)
    print("Class proportions:", dict(zip(unique, counts/len(y))))
    
    print(f"\nNo-information error (formula):      γ = {gamma_formula:.4f}")
    print(f"No-information error (permutation):  γ = {gamma_permutation:.4f}")
    print(f"\nNote: For balanced 3-class problem, theoretical γ ≈ 2/3 = 0.667")

The Relative Overfitting Rate

The relative overfitting rate (denoted $R$) measures how much the model overfits relative to the maximum possible overfitting. This quantity drives the adaptive weighting in .632+.

Definition:

$$R = \frac{\widehat{\text{Err}}^{(1)} - \bar{\text{err}}}{\gamma - \bar{\text{err}}}$$

where:

$\widehat{\text{Err}}^{(1)}$ = OOB error (out-of-bootstrap error)
$\bar{\text{err}}$ = apparent error (training error)
$\gamma$ = no-information error

Interpretation:

Numerator $\widehat{\text{Err}}^{(1)} - \bar{\text{err}}$: The gap between OOB error and training error. This is the 'optimism'—how much better the model performs on training data than on new data.
Denominator $\gamma - \bar{\text{err}}$: The maximum possible optimism. If the model memorized the training data while capturing no true signal, the training error would approach $\bar{\text{err}} \to 0$, and the test error would approach the no-information rate $\gamma$.
$R$: The proportion of maximum possible overfitting that the model exhibits.
- $R = 0$: No overfitting (OOB error equals apparent error)
- $R = 1$: Complete overfitting (OOB error equals no-information error)
- $0 < R < 1$: Partial overfitting (typical case)

Edge Cases in R

R can exceed 1 if OOB error is worse than the no-information rate (possible due to variance or model pathology). R can be negative if OOB error is less than training error (shouldn't happen but can occur with small samples). In practice, R is clipped to [0, 1] to ensure valid weights.

Why R Matters:

The relative overfitting rate tells us how much to trust the apparent error:

When $R \approx 0$ (low overfitting): The model generalizes well. We can trust the apparent error more, and the .632 weights are appropriate.
When $R \approx 1$ (high overfitting): The model has memorized the training data. The apparent error is misleading, and we should weight the OOB error more heavily—potentially giving it full weight.

This insight leads directly to the .632+ weighting scheme.

Interpretation of Relative Overfitting Rate
R Value	Overfitting Level	Apparent Error	Model Behavior	Implication for Weighting
R ≈ 0	None	Close to OOB	Generalizes well	Standard .632 weights are fine
R ≈ 0.3	Low	Below OOB	Slight memorization	.632 weights still reasonable
R ≈ 0.5	Moderate	Well below OOB	Significant fitting to noise	Weight OOB more heavily
R ≈ 0.8	High	Near zero	Memorizing data	Heavily weight OOB
R ≈ 1	Extreme	Zero	Interpolating completely	Ignore apparent error entirely

Deriving the .632+ Formula

With the relative overfitting rate $R$, we can derive adaptive weights that interpolate between the .632 estimator and pure OOB error based on the degree of overfitting.

The .632+ Estimator:

$$\widehat{\text{Err}}^{(.632+)} = (1 - \hat{w}) \times \bar{\text{err}} + \hat{w} \times \widehat{\text{Err}}^{(1)}$$

where the adaptive weight $\hat{w}$ is:

$$\hat{w} = \frac{0.632}{1 - 0.368 \times R}$$

Key Properties of $\hat{w}$:

When $R = 0$ (no overfitting): $\hat{w} = 0.632/1 = 0.632$, recovering the standard .632 estimator
When $R = 1$ (complete overfitting): $\hat{w} = 0.632/(1 - 0.368) = 0.632/0.632 = 1$, giving full weight to OOB

Thus, $\hat{w} \in [0.632, 1]$, providing adaptive weighting that increases OOB weight as overfitting increases.

Understanding the Weight Formula

The formula ŵ = 0.632/(1 - 0.368×R) is constructed so that: (1) At R=0, we get standard .632 weights; (2) At R=1, we get ŵ=1 (all OOB, no apparent error); (3) The transition is smooth. The 0.368 coefficient is exactly 1 - 0.632, ensuring the boundary conditions work out.

Alternative Formulation:

Some presentations express .632+ differently. An equivalent form is:

$$\widehat{\text{Err}}^{(.632+)} = \widehat{\text{Err}}^{(.632)} + (\widehat{\text{Err}}^{(1)} - \bar{\text{err}}) \times \frac{0.368 \times 0.632 \times R}{1 - 0.368 \times R}$$

This shows .632+ as a correction to .632, where the correction increases with $R$.

Another Common Formulation:

Define the minimum error rate: $$\widehat{\text{Err}}^{(1)*} = \min(\widehat{\text{Err}}^{(1)}, \gamma)$$

Then compute: $$R = \frac{\widehat{\text{Err}}^{(1)} - \bar{\text{err}}}{\gamma - \bar{\text{err}}}$$

clipped to $[0, 1]$.

The .632+ error is: $$\widehat{\text{Err}}^{(.632+)} = \widehat{\text{Err}}^{(.632)} + (\widehat{\text{Err}}^{(1)*} - \bar{\text{err}}) \times \frac{0.368 \times 0.632 \times R}{1 - 0.368 \times R}$$

Complete Implementation

Let's implement the .632+ bootstrap with full detail, handling all edge cases.

point_632_plus_bootstrap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
import numpy as np
from sklearn.base import clone
from typing import Callable, Dict, Optional
from collections import Counter
 
def point_632_plus_bootstrap(X: np.ndarray,
                              y: np.ndarray,
                              model,
                              n_bootstrap: int = 200,
                              loss_fn: Optional[Callable] = None,
                              random_state: int = 42) -> Dict:
    """
    Compute the .632+ bootstrap error estimate.
    
    The .632+ estimator adaptively adjusts weights based on the
    relative overfitting rate R, providing better estimates for
    models that overfit.
    
    Formula:
        w = 0.632 / (1 - 0.368 * R)
        Err_632+ = (1 - w) * apparent_error + w * OOB_error
    
    where R is the relative overfitting rate:
        R = (OOB_error - apparent_error) / (gamma - apparent_error)
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix of shape (n_samples, n_features)
    y : np.ndarray
        Target vector of shape (n_samples,)
    model : sklearn estimator
        Model with fit() and predict() methods
    n_bootstrap : int
        Number of bootstrap iterations
    loss_fn : Callable, optional
        Loss function(y_true, y_pred) -> float
    random_state : int
        Random seed for reproducibility
    
    Returns
    -------
    Dict
        Comprehensive results including .632+ error and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n_samples = len(y)
    
    # Determine if classification or regression
    unique_classes = np.unique(y)
    is_classification = len(unique_classes) <= 20  # Heuristic
    
    if loss_fn is None:
        if is_classification:
            loss_fn = lambda y_true, y_pred: np.mean(y_true != y_pred)
        else:
            loss_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2)
    
    # =========================================
    # Step 1: Compute Apparent Error
    # =========================================
    model_full = clone(model)
    model_full.fit(X, y)
    y_pred_train = model_full.predict(X)
    apparent_error = loss_fn(y, y_pred_train)
    
    # =========================================
    # Step 2: Compute No-Information Error (γ)
    # =========================================
    if is_classification:
        # γ = 1 - Σ p_k q_k
        # p_k = proportion of true labels in class k
        # q_k = proportion of predictions in class k
        
        n = len(y)
        true_counts = Counter(y)
        pred_counts = Counter(y_pred_train)
        
        gamma = 1.0
        for k in unique_classes:
            p_k = true_counts.get(k, 0) / n
            q_k = pred_counts.get(k, 0) / n
            gamma -= p_k * q_k
    else:
        # For regression: γ = Var(y)
        gamma = np.var(y, ddof=1)
    
    # =========================================
    # Step 3: Compute OOB Error
    # =========================================
    oob_predictions = {i: [] for i in range(n_samples)}
    bootstrap_oob_errors = []
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices
        bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Identify out-of-bag samples
        in_bag_set = set(bootstrap_indices)
        out_of_bag = [i for i in range(n_samples) if i not in in_bag_set]
        
        if len(out_of_bag) == 0:
            continue
        
        # Train on bootstrap sample
        X_boot = X[bootstrap_indices]
        y_boot = y[bootstrap_indices]
        
        model_b = clone(model)
        model_b.fit(X_boot, y_boot)
        
        # OOB predictions
        X_oob = X[out_of_bag]
        y_oob_true = y[out_of_bag]
        y_oob_pred = model_b.predict(X_oob)
        
        for i, obs_idx in enumerate(out_of_bag):
            oob_predictions[obs_idx].append(y_oob_pred[i])
        
        bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred))
    
    # Aggregate OOB predictions
    final_oob_predictions = []
    final_oob_true = []
    
    for i in range(n_samples):
        if len(oob_predictions[i]) > 0:
            if is_classification:
                votes = Counter(oob_predictions[i])
                final_oob_predictions.append(votes.most_common(1)[0][0])
            else:
                final_oob_predictions.append(np.mean(oob_predictions[i]))
            final_oob_true.append(y[i])
    
    oob_error = loss_fn(
        np.array(final_oob_true), 
        np.array(final_oob_predictions)
    )
    
    # =========================================
    # Step 4: Compute Relative Overfitting Rate (R)
    # =========================================
    # R = (OOB - apparent) / (gamma - apparent)
    
    # Handle edge cases
    if gamma <= apparent_error:
        # Unusual: model appears to use features well
        # or gamma calculation issue
        R = 0.0
    else:
        R = (oob_error - apparent_error) / (gamma - apparent_error)
        # Clip to [0, 1]
        R = np.clip(R, 0.0, 1.0)
    
    # =========================================
    # Step 5: Compute .632+ Error
    # =========================================
    # w = 0.632 / (1 - 0.368 * R)
    w_632_plus = 0.632 / (1 - 0.368 * R)
    
    # Ensure w is in valid range [0.632, 1]
    w_632_plus = np.clip(w_632_plus, 0.632, 1.0)
    
    # .632+ error
    point_632_plus_error = (1 - w_632_plus) * apparent_error + w_632_plus * oob_error
    
    # Also compute standard .632 for comparison
    point_632_error = 0.368 * apparent_error + 0.632 * oob_error
    
    # =========================================
    # Step 6: Compute Standard Error
    # =========================================
    # Bootstrap SE of the .632+ estimate
    bootstrap_632_plus = []
    for oob_err in bootstrap_oob_errors:
        bootstrap_632_plus.append(
            (1 - w_632_plus) * apparent_error + w_632_plus * oob_err
        )
    standard_error = np.std(bootstrap_632_plus, ddof=1) if bootstrap_632_plus else 0.0
    
    return {
        'point_632_plus_error': point_632_plus_error,
        'point_632_error': point_632_error,
        'apparent_error': apparent_error,
        'oob_error': oob_error,
        'no_information_error': gamma,
        'relative_overfitting_rate': R,
        'adaptive_weight': w_632_plus,
        'standard_error': standard_error,
        'diagnostics': {
            'n_bootstrap': n_bootstrap,
            'n_samples': n_samples,
            'is_classification': is_classification,
            'observations_with_oob': len(final_oob_true),
            'weight_difference_from_632': w_632_plus - 0.632,
        }
    }
 
 
# Demonstration comparing .632 and .632+ for overfitting models
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    
    # Generate data
    X, y = make_classification(
        n_samples=150, n_features=20, n_informative=5,
        n_redundant=10, random_state=42
    )
    
    models = [
        ("Logistic Regression", LogisticRegression(max_iter=1000)),
        ("Decision Tree (depth=3)", DecisionTreeClassifier(max_depth=3)),
        ("Decision Tree (depth=20)", DecisionTreeClassifier(max_depth=20)),
        ("1-NN (extreme overfit)", KNeighborsClassifier(n_neighbors=1)),
    ]
    
    print("Comparing .632 vs .632+ for Different Model Flexibilities")
    print("=" * 75)
    
    for name, model in models:
        results = point_632_plus_bootstrap(X, y, model, n_bootstrap=300)
        
        # CV for comparison
        cv_error = 1 - np.mean(cross_val_score(model, X, y, cv=10))
        
        print(f"\n{name}:")
        print(f"  Apparent Error:  {results['apparent_error']:.4f}")
        print(f"  OOB Error:       {results['oob_error']:.4f}")
        print(f"  γ (no-info):     {results['no_information_error']:.4f}")
        print(f"  R (overfitting): {results['relative_overfitting_rate']:.4f}")
        print(f"  Weight (ŵ):      {results['adaptive_weight']:.4f}")
        print(f"  .632 Error:      {results['point_632_error']:.4f}")
        print(f"  .632+ Error:     {results['point_632_plus_error']:.4f}")
        print(f"  10-Fold CV:      {cv_error:.4f}")
        print(f"  Δ(.632+ - .632): {results['point_632_plus_error'] - results['point_632_error']:.4f}")

Understanding the Adaptive Behavior

Let's visualize and analyze how the .632+ weighting adapts to different overfitting scenarios.

adaptive_weight_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_632_plus_weights():
    """
    Analyze how .632+ weights change with overfitting rate R.
    
    Key insights:
    - At R=0 (no overfitting): w = 0.632, same as .632
    - At R=1 (max overfitting): w = 1.0, ignore training error
    - The relationship is non-linear (hyperbolic)
    """
    # R values from 0 to 1
    R_values = np.linspace(0, 1, 100)
    
    # Compute adaptive weight
    # w = 0.632 / (1 - 0.368 * R)
    weights = 0.632 / (1 - 0.368 * R_values)
    
    # Standard .632 weight for comparison
    standard_weight = 0.632 * np.ones_like(R_values)
    
    # Print key values
    print("Adaptive Weight Analysis")
    print("=" * 50)
    print(f"{'R (overfitting)':<20} {'Weight (ŵ)':<15} {'Δ from 0.632':<15}")
    print("-" * 50)
    
    for R in [0.0, 0.25, 0.50, 0.75, 0.90, 0.95, 1.0]:
        w = 0.632 / (1 - 0.368 * R)
        delta = w - 0.632
        print(f"{R:<20.2f} {w:<15.4f} {delta:<15.4f}")
    
    # Implications for error estimation
    print("\n" + "=" * 50)
    print("Implications for Error Estimation")
    print("=" * 50)
    print("""
When R is high (model overfits):
- .632+ gives MORE weight to OOB error (up to 100%)
- .632+ gives LESS weight to apparent error (down to 0%)
- Result: .632+ produces HIGHER error estimates than .632
 
Key insight: .632+ is ALWAYS >= .632 when apparent_error < OOB_error,
which is the normal case for any model with generalization gap.
    """)
    
    return R_values, weights, standard_weight
 
 
def demonstrate_correction_magnitude(apparent_error: float,
                                      oob_error: float,
                                      gamma: float) -> dict:
    """
    Demonstrate how .632+ correction depends on the error landscape.
    
    Parameters
    ----------
    apparent_error : float
        Training error
    oob_error : float
        Out-of-bootstrap error
    gamma : float
        No-information error rate
    
    Returns
    -------
    dict
        Detailed breakdown of the calculation
    """
    # Compute R
    if gamma <= apparent_error:
        R = 0.0  # Edge case
    else:
        R = (oob_error - apparent_error) / (gamma - apparent_error)
        R = np.clip(R, 0, 1)
    
    # Compute weight
    w = 0.632 / (1 - 0.368 * R)
    w = np.clip(w, 0.632, 1.0)
    
    # Compute errors
    err_632 = 0.368 * apparent_error + 0.632 * oob_error
    err_632_plus = (1 - w) * apparent_error + w * oob_error
    
    # Correction
    correction = err_632_plus - err_632
    
    return {
        'apparent_error': apparent_error,
        'oob_error': oob_error,
        'gamma': gamma,
        'R': R,
        'weight': w,
        'err_632': err_632,
        'err_632_plus': err_632_plus,
        'correction': correction,
        'interpretation': _interpret_result(R, correction)
    }
 
 
def _interpret_result(R: float, correction: float) -> str:
    """Generate human-readable interpretation."""
    if R < 0.1:
        overfit_level = "minimal overfitting"
        recommendation = ".632 and .632+ give similar results"
    elif R < 0.4:
        overfit_level = "moderate overfitting"
        recommendation = ".632+ provides small but useful correction"
    elif R < 0.7:
        overfit_level = "substantial overfitting"
        recommendation = ".632+ significantly corrects .632 optimism"
    else:
        overfit_level = "severe overfitting"
        recommendation = ".632+ essential; apparent error uninformative"
    
    return f"Model shows {overfit_level} (R={R:.3f}). {recommendation}."
 
 
if __name__ == "__main__":
    # Analyze weights
    analyze_632_plus_weights()
    
    # Demonstrate with concrete examples
    print("\n" + "=" * 60)
    print("Concrete Examples")
    print("=" * 60)
    
    examples = [
        {"name": "Linear model (low overfit)", 
         "apparent": 0.20, "oob": 0.25, "gamma": 0.50},
        {"name": "Tree depth=5 (moderate overfit)", 
         "apparent": 0.05, "oob": 0.15, "gamma": 0.50},
        {"name": "Deep tree (high overfit)", 
         "apparent": 0.01, "oob": 0.22, "gamma": 0.50},
        {"name": "1-NN (extreme overfit)", 
         "apparent": 0.00, "oob": 0.30, "gamma": 0.50},
    ]
    
    for ex in examples:
        print(f"\n{ex['name']}:")
        result = demonstrate_correction_magnitude(
            ex['apparent'], ex['oob'], ex['gamma']
        )
        print(f"  Apparent: {result['apparent_error']:.4f}")
        print(f"  OOB:      {result['oob_error']:.4f}")
        print(f"  R:        {result['R']:.4f}")
        print(f"  Weight:   {result['weight']:.4f}")
        print(f"  .632:     {result['err_632']:.4f}")
        print(f"  .632+:    {result['err_632_plus']:.4f}")
        print(f"  Correction: +{result['correction']:.4f}")
        print(f"  >> {result['interpretation']}")

The Direction of Correction

The .632+ correction is always upward (increases estimated error) relative to .632 when apparent_error < OOB_error—the typical case. This is exactly what we want: for overfit models, .632 is too optimistic, and .632+ corrects this by weighting OOB error more heavily.

When to Use .632 vs .632+

Both .632 and .632+ have their place in the practitioner's toolkit. The choice depends on the model flexibility and the data characteristics.

Use .632 When

•Model is low-to-moderate complexity
•Apparent error is bounded away from zero
•You want simpler computation
•The no-information error is hard to compute
•You're doing exploratory analysis

Use .632+ When

•Model can heavily overfit (k-NN, deep trees)
•Apparent error is near zero
•You need accurate error estimates for model comparison
•Working with high-dimensional data (p >> n)
•Doing final reporting for publication

Model-Specific Guidance:

Model Family	Overfitting Risk	Recommendation
Linear/Logistic Regression	Low	.632 usually sufficient
Ridge/Lasso	Low-Moderate	.632 unless extreme regularization
Decision Trees (shallow)	Moderate	.632 or .632+
Decision Trees (deep)	High	.632+ recommended
Random Forests	Moderate	Use built-in OOB; .632/+ for single trees
k-NN (k≥5)	Moderate	.632+ recommended
k-NN (k=1)	Extreme	.632+ essential
SVM	Depends on kernel	.632+ for RBF/complex kernels
Neural Networks	High	.632+ essential; CV may be better

Practical Decision Rule

A simple heuristic: if your model's training error is less than half the OOB error (indicating significant overfitting), use .632+. Otherwise, .632 is probably fine. When in doubt, report both—the difference reveals the degree of overfitting.

Limitations and Alternatives

While .632+ is a significant improvement over .632, it's not a panacea. Understanding its limitations helps you choose the right tool.

Limitations of .632+

•No-information error computation: For complex multi-class problems or regression, γ can be tricky to compute correctly.
•Assumes homogeneous overfitting: The single R value applies globally; local overfitting patterns are not captured.
•Still based on ~63% training data: The fundamental pessimistic bias from reduced effective training size remains, though corrected.
•Can still be optimistic for extreme cases: Very flexible models (deep neural networks) may defeat even .632+.
•Computational overhead: Computing γ adds cost, especially via permutation methods.

Alternatives to Consider:

Repeated k-Fold Cross-Validation: For most practical purposes, 10×10 repeated CV provides excellent error estimates with lower variance than bootstrap methods.
Nested Cross-Validation: When model selection is involved, nested CV provides less biased estimates than any bootstrap variant.
Holdout with Large Test Set: If data is abundant, a large held-out test set is the gold standard.
Leave-One-Out Bootstrap (pure OOB): Simpler than .632+, slightly pessimistic but often adequate.
Subsampling without replacement: Avoids some bootstrap issues related to duplicate observations.

For Deep Learning

For deep neural networks and other extremely flexible models, bootstrap methods (including .632+) may not work well. The models can overfit so dramatically that even .632+ is optimistic. Use a proper train/validation/test split or k-fold CV with careful early stopping instead.

Summary: The .632+ Bootstrap

The .632+ bootstrap represents a sophisticated evolution of the .632 estimator, addressing its key weakness: optimistic bias for overfit models. By introducing the no-information error rate and relative overfitting rate, .632+ achieves adaptive weighting that ranges from standard .632 (for well-generalizing models) to pure OOB error (for interpolating models).

Key Takeaways

•No-Information Error (γ): The error rate when features provide no information—baseline achievable by chance.
•Relative Overfitting Rate (R): R = (OOB - apparent)/(γ - apparent), measuring overfitting as fraction of maximum possible.
•Adaptive Weight: ŵ = 0.632/(1 - 0.368×R), ranging from 0.632 (R=0) to 1.0 (R=1).
•The Formula: Err_632+ = (1 - ŵ) × apparent + ŵ × OOB.
•When to Use: .632+ is essential for highly flexible models (k-NN, deep trees); .632 suffices for regularized models.
•Always Corrects Upward: .632+ ≥ .632 when apparent < OOB (the typical case).

What's Next:

We've now covered the two main bias-corrected bootstrap estimators. In the next page, we'll take a deeper dive into the out-of-bootstrap error itself—understanding its properties, when it's a good standalone estimator, and its relationship to cross-validation.

Advanced Concept Mastered

You now understand the .632+ bootstrap—the go-to method for estimating generalization error when dealing with potentially overfit models. This adaptive approach automatically adjusts for overfitting, providing robust error estimates across a wide range of model flexibilities.

3 / 5

Loading learning content...

Machine LearningBootstrap Methods

Bootstrap Methods

LevelIntermediate

Duration90 mins

TopicBootstrap Methods

3 / 5

The .632+ Bootstrap Estimator

Fixing the Overfit Problem

Consider an extreme case: a 1-nearest-neighbor classifier achieves perfect training accuracy (each point is its own nearest neighbor). The apparent error is zero, so:

$$\widehat{\text{Err}}^{(.632)} = 0.368 \times 0 + 0.632 \times \widehat{\text{Err}}^{(1)} = 0.632 \times \widehat{\text{Err}}^{(1)}$$

But this underweights the OOB error! For a severely overfit model, the true error is closer to the OOB error than to 63.2% of it.

The .632+ bootstrap addresses this by introducing an adaptive weighting that accounts for the degree of overfitting.

What You Will Master

The No-Information Error Rate

Definition:

For classification, $\gamma$ is computed as: $$\gamma = \sum_{k=1}^{K} p_k (1 - q_k)$$

where:

$p_k$ = proportion of observations in class $k$ (response distribution)
$q_k$ = proportion of predictions in class $k$ (prediction distribution)
The sum is over all $K$ classes

Wait—that's not quite the formula above. Let me clarify:

$$\gamma = 1 - \sum_{k=1}^{K} p_k q_k$$

This equals $\sum_k p_k(1 - q_k)$ only under specific conditions. The standard formula is: $$\gamma = \sum_{k=1}^{K} p_k (1 - q_k) = \sum_k p_k - \sum_k p_k q_k = 1 - \sum_k p_k q_k$$

Yes, these are equivalent.

No-Information Error for Regression

Computing the No-Information Error:

In practice, we estimate $\gamma$ from the data as follows:

For Classification:

Compute the observed class proportions: $\hat{p}_k = n_k / n$ where $n_k$ is the count of class $k$
Compute the predicted class proportions: $\hat{q}_k$ = proportion of predictions that are class $k$ (from training on full data)
Calculate: $\hat{\gamma} = \sum_{k=1}^{K} \hat{p}_k (1 - \hat{q}_k)$

Alternative Permutation Approach: A more general method:

For many iterations, randomly permute the response vector $y$
Train the model on $(X, y_{\text{permuted}})$
Compute error on the original $(X, y)$
Average the errors to get $\hat{\gamma}$

This 'permutation no-information' approach is computationally intensive but works for any model and loss function.

no_information_error.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import numpy as np
from collections import Counter
from typing import Callable, Optional
from sklearn.base import clone
 
def compute_no_information_error_classification(y_true: np.ndarray,
                                                  y_pred: np.ndarray) -> float:
    """
    Compute the no-information error rate for classification.
    
    The no-information error γ represents the error rate when
    features provide no information about the response.
    
    γ = Σ_k p_k (1 - q_k) = 1 - Σ_k p_k q_k
    
    where:
    - p_k = proportion of true labels in class k
    - q_k = proportion of predictions in class k
    
    Parameters
    ----------
    y_true : np.ndarray
        True class labels
    y_pred : np.ndarray
        Predicted class labels (from model trained on same data)
    
    Returns
    -------
    float
        No-information error rate γ
    """
    n = len(y_true)
    
    # Get all unique classes
    classes = np.unique(np.concatenate([y_true, y_pred]))
    
    # Compute class proportions in true labels
    true_counts = Counter(y_true)
    p = {k: true_counts.get(k, 0) / n for k in classes}
    
    # Compute class proportions in predictions
    pred_counts = Counter(y_pred)
    q = {k: pred_counts.get(k, 0) / n for k in classes}
    
    # Compute γ = 1 - Σ p_k q_k
    gamma = 1.0 - sum(p[k] * q[k] for k in classes)
    
    return gamma
 
 
def compute_no_information_error_permutation(X: np.ndarray,
                                               y: np.ndarray,
                                               model,
                                               loss_fn: Callable,
                                               n_permutations: int = 50,
                                               random_state: int = 42) -> float:
    """
    Compute no-information error via permutation.
    
    This is a more general approach that works for any model
    and loss function:
    1. Permute y to destroy X-y relationship
    2. Train model on (X, y_permuted)
    3. Evaluate on (X, y_original)
    4. Average over many permutations
    
    Parameters
    ----------
    X : np.ndarray
        Features
    y : np.ndarray
        Original responses
    model : sklearn estimator
        Model to use
    loss_fn : Callable
        Loss function(y_true, y_pred) -> float
    n_permutations : int
        Number of permutations
    random_state : int
        Random seed
    
    Returns
    -------
    float
        Permutation-based no-information error
    """
    rng = np.random.RandomState(random_state)
    n = len(y)
    
    permutation_errors = []
    
    for _ in range(n_permutations):
        # Permute response
        perm_indices = rng.permutation(n)
        y_permuted = y[perm_indices]
        
        # Train on permuted data
        model_perm = clone(model)
        model_perm.fit(X, y_permuted)
        
        # Predict on original features
        y_pred = model_perm.predict(X)
        
        # Evaluate against original (unpermuted) labels
        permutation_errors.append(loss_fn(y, y_pred))
    
    return np.mean(permutation_errors)
 
 
def compute_no_information_error_regression(y: np.ndarray) -> float:
    """
    Compute no-information error for regression (MSE).
    
    For regression with squared error, the no-information error
    is simply the variance of y - the error from predicting ȳ
    for all observations.
    
    Parameters
    ----------
    y : np.ndarray
        Response values
    
    Returns
    -------
    float
        No-information error (sample variance)
    """
    return np.var(y, ddof=1)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.datasets import make_classification
    
    # Generate data
    X, y = make_classification(n_samples=200, n_features=10, 
                                n_classes=3, n_informative=5,
                                random_state=42)
    
    # Train model
    model = DecisionTreeClassifier(max_depth=10, random_state=42)
    model.fit(X, y)
    y_pred = model.predict(X)
    
    # Compute no-information error both ways
    gamma_formula = compute_no_information_error_classification(y, y_pred)
    gamma_permutation = compute_no_information_error_permutation(
        X, y, DecisionTreeClassifier(max_depth=10, random_state=42),
        loss_fn=lambda yt, yp: np.mean(yt != yp),
        n_permutations=100
    )
    
    # True class proportions
    unique, counts = np.unique(y, return_counts=True)
    print("Class proportions:", dict(zip(unique, counts/len(y))))
    
    print(f"\nNo-information error (formula):      γ = {gamma_formula:.4f}")
    print(f"No-information error (permutation):  γ = {gamma_permutation:.4f}")
    print(f"\nNote: For balanced 3-class problem, theoretical γ ≈ 2/3 = 0.667")

The Relative Overfitting Rate

The relative overfitting rate (denoted $R$) measures how much the model overfits relative to the maximum possible overfitting. This quantity drives the adaptive weighting in .632+.

Definition:

$$R = \frac{\widehat{\text{Err}}^{(1)} - \bar{\text{err}}}{\gamma - \bar{\text{err}}}$$

where:

$\widehat{\text{Err}}^{(1)}$ = OOB error (out-of-bootstrap error)
$\bar{\text{err}}$ = apparent error (training error)
$\gamma$ = no-information error

Interpretation:

Numerator $\widehat{\text{Err}}^{(1)} - \bar{\text{err}}$: The gap between OOB error and training error. This is the 'optimism'—how much better the model performs on training data than on new data.
Denominator $\gamma - \bar{\text{err}}$: The maximum possible optimism. If the model memorized the training data while capturing no true signal, the training error would approach $\bar{\text{err}} \to 0$, and the test error would approach the no-information rate $\gamma$.
$R$: The proportion of maximum possible overfitting that the model exhibits.
- $R = 0$: No overfitting (OOB error equals apparent error)
- $R = 1$: Complete overfitting (OOB error equals no-information error)
- $0 < R < 1$: Partial overfitting (typical case)

Edge Cases in R

Why R Matters:

The relative overfitting rate tells us how much to trust the apparent error:

When $R \approx 0$ (low overfitting): The model generalizes well. We can trust the apparent error more, and the .632 weights are appropriate.
When $R \approx 1$ (high overfitting): The model has memorized the training data. The apparent error is misleading, and we should weight the OOB error more heavily—potentially giving it full weight.

This insight leads directly to the .632+ weighting scheme.

Interpretation of Relative Overfitting Rate
R Value	Overfitting Level	Apparent Error	Model Behavior	Implication for Weighting
R ≈ 0	None	Close to OOB	Generalizes well	Standard .632 weights are fine
R ≈ 0.3	Low	Below OOB	Slight memorization	.632 weights still reasonable
R ≈ 0.5	Moderate	Well below OOB	Significant fitting to noise	Weight OOB more heavily
R ≈ 0.8	High	Near zero	Memorizing data	Heavily weight OOB
R ≈ 1	Extreme	Zero	Interpolating completely	Ignore apparent error entirely

Deriving the .632+ Formula

With the relative overfitting rate $R$, we can derive adaptive weights that interpolate between the .632 estimator and pure OOB error based on the degree of overfitting.

The .632+ Estimator:

$$\widehat{\text{Err}}^{(.632+)} = (1 - \hat{w}) \times \bar{\text{err}} + \hat{w} \times \widehat{\text{Err}}^{(1)}$$

where the adaptive weight $\hat{w}$ is:

$$\hat{w} = \frac{0.632}{1 - 0.368 \times R}$$

Key Properties of $\hat{w}$:

When $R = 0$ (no overfitting): $\hat{w} = 0.632/1 = 0.632$, recovering the standard .632 estimator
When $R = 1$ (complete overfitting): $\hat{w} = 0.632/(1 - 0.368) = 0.632/0.632 = 1$, giving full weight to OOB

Thus, $\hat{w} \in [0.632, 1]$, providing adaptive weighting that increases OOB weight as overfitting increases.

Understanding the Weight Formula

Alternative Formulation:

Some presentations express .632+ differently. An equivalent form is:

$$\widehat{\text{Err}}^{(.632+)} = \widehat{\text{Err}}^{(.632)} + (\widehat{\text{Err}}^{(1)} - \bar{\text{err}}) \times \frac{0.368 \times 0.632 \times R}{1 - 0.368 \times R}$$

This shows .632+ as a correction to .632, where the correction increases with $R$.

Another Common Formulation:

Define the minimum error rate: $$\widehat{\text{Err}}^{(1)*} = \min(\widehat{\text{Err}}^{(1)}, \gamma)$$

Then compute: $$R = \frac{\widehat{\text{Err}}^{(1)} - \bar{\text{err}}}{\gamma - \bar{\text{err}}}$$

clipped to $[0, 1]$.

The .632+ error is: $$\widehat{\text{Err}}^{(.632+)} = \widehat{\text{Err}}^{(.632)} + (\widehat{\text{Err}}^{(1)*} - \bar{\text{err}}) \times \frac{0.368 \times 0.632 \times R}{1 - 0.368 \times R}$$

Complete Implementation

Let's implement the .632+ bootstrap with full detail, handling all edge cases.

point_632_plus_bootstrap.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
import numpy as np
from sklearn.base import clone
from typing import Callable, Dict, Optional
from collections import Counter
 
def point_632_plus_bootstrap(X: np.ndarray,
                              y: np.ndarray,
                              model,
                              n_bootstrap: int = 200,
                              loss_fn: Optional[Callable] = None,
                              random_state: int = 42) -> Dict:
    """
    Compute the .632+ bootstrap error estimate.
    
    The .632+ estimator adaptively adjusts weights based on the
    relative overfitting rate R, providing better estimates for
    models that overfit.
    
    Formula:
        w = 0.632 / (1 - 0.368 * R)
        Err_632+ = (1 - w) * apparent_error + w * OOB_error
    
    where R is the relative overfitting rate:
        R = (OOB_error - apparent_error) / (gamma - apparent_error)
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix of shape (n_samples, n_features)
    y : np.ndarray
        Target vector of shape (n_samples,)
    model : sklearn estimator
        Model with fit() and predict() methods
    n_bootstrap : int
        Number of bootstrap iterations
    loss_fn : Callable, optional
        Loss function(y_true, y_pred) -> float
    random_state : int
        Random seed for reproducibility
    
    Returns
    -------
    Dict
        Comprehensive results including .632+ error and diagnostics
    """
    rng = np.random.RandomState(random_state)
    n_samples = len(y)
    
    # Determine if classification or regression
    unique_classes = np.unique(y)
    is_classification = len(unique_classes) <= 20  # Heuristic
    
    if loss_fn is None:
        if is_classification:
            loss_fn = lambda y_true, y_pred: np.mean(y_true != y_pred)
        else:
            loss_fn = lambda y_true, y_pred: np.mean((y_true - y_pred)**2)
    
    # =========================================
    # Step 1: Compute Apparent Error
    # =========================================
    model_full = clone(model)
    model_full.fit(X, y)
    y_pred_train = model_full.predict(X)
    apparent_error = loss_fn(y, y_pred_train)
    
    # =========================================
    # Step 2: Compute No-Information Error (γ)
    # =========================================
    if is_classification:
        # γ = 1 - Σ p_k q_k
        # p_k = proportion of true labels in class k
        # q_k = proportion of predictions in class k
        
        n = len(y)
        true_counts = Counter(y)
        pred_counts = Counter(y_pred_train)
        
        gamma = 1.0
        for k in unique_classes:
            p_k = true_counts.get(k, 0) / n
            q_k = pred_counts.get(k, 0) / n
            gamma -= p_k * q_k
    else:
        # For regression: γ = Var(y)
        gamma = np.var(y, ddof=1)
    
    # =========================================
    # Step 3: Compute OOB Error
    # =========================================
    oob_predictions = {i: [] for i in range(n_samples)}
    bootstrap_oob_errors = []
    
    for b in range(n_bootstrap):
        # Generate bootstrap indices
        bootstrap_indices = rng.choice(n_samples, size=n_samples, replace=True)
        
        # Identify out-of-bag samples
        in_bag_set = set(bootstrap_indices)
        out_of_bag = [i for i in range(n_samples) if i not in in_bag_set]
        
        if len(out_of_bag) == 0:
            continue
        
        # Train on bootstrap sample
        X_boot = X[bootstrap_indices]
        y_boot = y[bootstrap_indices]
        
        model_b = clone(model)
        model_b.fit(X_boot, y_boot)
        
        # OOB predictions
        X_oob = X[out_of_bag]
        y_oob_true = y[out_of_bag]
        y_oob_pred = model_b.predict(X_oob)
        
        for i, obs_idx in enumerate(out_of_bag):
            oob_predictions[obs_idx].append(y_oob_pred[i])
        
        bootstrap_oob_errors.append(loss_fn(y_oob_true, y_oob_pred))
    
    # Aggregate OOB predictions
    final_oob_predictions = []
    final_oob_true = []
    
    for i in range(n_samples):
        if len(oob_predictions[i]) > 0:
            if is_classification:
                votes = Counter(oob_predictions[i])
                final_oob_predictions.append(votes.most_common(1)[0][0])
            else:
                final_oob_predictions.append(np.mean(oob_predictions[i]))
            final_oob_true.append(y[i])
    
    oob_error = loss_fn(
        np.array(final_oob_true), 
        np.array(final_oob_predictions)
    )
    
    # =========================================
    # Step 4: Compute Relative Overfitting Rate (R)
    # =========================================
    # R = (OOB - apparent) / (gamma - apparent)
    
    # Handle edge cases
    if gamma <= apparent_error:
        # Unusual: model appears to use features well
        # or gamma calculation issue
        R = 0.0
    else:
        R = (oob_error - apparent_error) / (gamma - apparent_error)
        # Clip to [0, 1]
        R = np.clip(R, 0.0, 1.0)
    
    # =========================================
    # Step 5: Compute .632+ Error
    # =========================================
    # w = 0.632 / (1 - 0.368 * R)
    w_632_plus = 0.632 / (1 - 0.368 * R)
    
    # Ensure w is in valid range [0.632, 1]
    w_632_plus = np.clip(w_632_plus, 0.632, 1.0)
    
    # .632+ error
    point_632_plus_error = (1 - w_632_plus) * apparent_error + w_632_plus * oob_error
    
    # Also compute standard .632 for comparison
    point_632_error = 0.368 * apparent_error + 0.632 * oob_error
    
    # =========================================
    # Step 6: Compute Standard Error
    # =========================================
    # Bootstrap SE of the .632+ estimate
    bootstrap_632_plus = []
    for oob_err in bootstrap_oob_errors:
        bootstrap_632_plus.append(
            (1 - w_632_plus) * apparent_error + w_632_plus * oob_err
        )
    standard_error = np.std(bootstrap_632_plus, ddof=1) if bootstrap_632_plus else 0.0
    
    return {
        'point_632_plus_error': point_632_plus_error,
        'point_632_error': point_632_error,
        'apparent_error': apparent_error,
        'oob_error': oob_error,
        'no_information_error': gamma,
        'relative_overfitting_rate': R,
        'adaptive_weight': w_632_plus,
        'standard_error': standard_error,
        'diagnostics': {
            'n_bootstrap': n_bootstrap,
            'n_samples': n_samples,
            'is_classification': is_classification,
            'observations_with_oob': len(final_oob_true),
            'weight_difference_from_632': w_632_plus - 0.632,
        }
    }
 
 
# Demonstration comparing .632 and .632+ for overfitting models
if __name__ == "__main__":
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import make_classification
    from sklearn.model_selection import cross_val_score
    
    # Generate data
    X, y = make_classification(
        n_samples=150, n_features=20, n_informative=5,
        n_redundant=10, random_state=42
    )
    
    models = [
        ("Logistic Regression", LogisticRegression(max_iter=1000)),
        ("Decision Tree (depth=3)", DecisionTreeClassifier(max_depth=3)),
        ("Decision Tree (depth=20)", DecisionTreeClassifier(max_depth=20)),
        ("1-NN (extreme overfit)", KNeighborsClassifier(n_neighbors=1)),
    ]
    
    print("Comparing .632 vs .632+ for Different Model Flexibilities")
    print("=" * 75)
    
    for name, model in models:
        results = point_632_plus_bootstrap(X, y, model, n_bootstrap=300)
        
        # CV for comparison
        cv_error = 1 - np.mean(cross_val_score(model, X, y, cv=10))
        
        print(f"\n{name}:")
        print(f"  Apparent Error:  {results['apparent_error']:.4f}")
        print(f"  OOB Error:       {results['oob_error']:.4f}")
        print(f"  γ (no-info):     {results['no_information_error']:.4f}")
        print(f"  R (overfitting): {results['relative_overfitting_rate']:.4f}")
        print(f"  Weight (ŵ):      {results['adaptive_weight']:.4f}")
        print(f"  .632 Error:      {results['point_632_error']:.4f}")
        print(f"  .632+ Error:     {results['point_632_plus_error']:.4f}")
        print(f"  10-Fold CV:      {cv_error:.4f}")
        print(f"  Δ(.632+ - .632): {results['point_632_plus_error'] - results['point_632_error']:.4f}")

Understanding the Adaptive Behavior

Let's visualize and analyze how the .632+ weighting adapts to different overfitting scenarios.

adaptive_weight_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_632_plus_weights():
    """
    Analyze how .632+ weights change with overfitting rate R.
    
    Key insights:
    - At R=0 (no overfitting): w = 0.632, same as .632
    - At R=1 (max overfitting): w = 1.0, ignore training error
    - The relationship is non-linear (hyperbolic)
    """
    # R values from 0 to 1
    R_values = np.linspace(0, 1, 100)
    
    # Compute adaptive weight
    # w = 0.632 / (1 - 0.368 * R)
    weights = 0.632 / (1 - 0.368 * R_values)
    
    # Standard .632 weight for comparison
    standard_weight = 0.632 * np.ones_like(R_values)
    
    # Print key values
    print("Adaptive Weight Analysis")
    print("=" * 50)
    print(f"{'R (overfitting)':<20} {'Weight (ŵ)':<15} {'Δ from 0.632':<15}")
    print("-" * 50)
    
    for R in [0.0, 0.25, 0.50, 0.75, 0.90, 0.95, 1.0]:
        w = 0.632 / (1 - 0.368 * R)
        delta = w - 0.632
        print(f"{R:<20.2f} {w:<15.4f} {delta:<15.4f}")
    
    # Implications for error estimation
    print("\n" + "=" * 50)
    print("Implications for Error Estimation")
    print("=" * 50)
    print("""
When R is high (model overfits):
- .632+ gives MORE weight to OOB error (up to 100%)
- .632+ gives LESS weight to apparent error (down to 0%)
- Result: .632+ produces HIGHER error estimates than .632
 
Key insight: .632+ is ALWAYS >= .632 when apparent_error < OOB_error,
which is the normal case for any model with generalization gap.
    """)
    
    return R_values, weights, standard_weight
 
 
def demonstrate_correction_magnitude(apparent_error: float,
                                      oob_error: float,
                                      gamma: float) -> dict:
    """
    Demonstrate how .632+ correction depends on the error landscape.
    
    Parameters
    ----------
    apparent_error : float
        Training error
    oob_error : float
        Out-of-bootstrap error
    gamma : float
        No-information error rate
    
    Returns
    -------
    dict
        Detailed breakdown of the calculation
    """
    # Compute R
    if gamma <= apparent_error:
        R = 0.0  # Edge case
    else:
        R = (oob_error - apparent_error) / (gamma - apparent_error)
        R = np.clip(R, 0, 1)
    
    # Compute weight
    w = 0.632 / (1 - 0.368 * R)
    w = np.clip(w, 0.632, 1.0)
    
    # Compute errors
    err_632 = 0.368 * apparent_error + 0.632 * oob_error
    err_632_plus = (1 - w) * apparent_error + w * oob_error
    
    # Correction
    correction = err_632_plus - err_632
    
    return {
        'apparent_error': apparent_error,
        'oob_error': oob_error,
        'gamma': gamma,
        'R': R,
        'weight': w,
        'err_632': err_632,
        'err_632_plus': err_632_plus,
        'correction': correction,
        'interpretation': _interpret_result(R, correction)
    }
 
 
def _interpret_result(R: float, correction: float) -> str:
    """Generate human-readable interpretation."""
    if R < 0.1:
        overfit_level = "minimal overfitting"
        recommendation = ".632 and .632+ give similar results"
    elif R < 0.4:
        overfit_level = "moderate overfitting"
        recommendation = ".632+ provides small but useful correction"
    elif R < 0.7:
        overfit_level = "substantial overfitting"
        recommendation = ".632+ significantly corrects .632 optimism"
    else:
        overfit_level = "severe overfitting"
        recommendation = ".632+ essential; apparent error uninformative"
    
    return f"Model shows {overfit_level} (R={R:.3f}). {recommendation}."
 
 
if __name__ == "__main__":
    # Analyze weights
    analyze_632_plus_weights()
    
    # Demonstrate with concrete examples
    print("\n" + "=" * 60)
    print("Concrete Examples")
    print("=" * 60)
    
    examples = [
        {"name": "Linear model (low overfit)", 
         "apparent": 0.20, "oob": 0.25, "gamma": 0.50},
        {"name": "Tree depth=5 (moderate overfit)", 
         "apparent": 0.05, "oob": 0.15, "gamma": 0.50},
        {"name": "Deep tree (high overfit)", 
         "apparent": 0.01, "oob": 0.22, "gamma": 0.50},
        {"name": "1-NN (extreme overfit)", 
         "apparent": 0.00, "oob": 0.30, "gamma": 0.50},
    ]
    
    for ex in examples:
        print(f"\n{ex['name']}:")
        result = demonstrate_correction_magnitude(
            ex['apparent'], ex['oob'], ex['gamma']
        )
        print(f"  Apparent: {result['apparent_error']:.4f}")
        print(f"  OOB:      {result['oob_error']:.4f}")
        print(f"  R:        {result['R']:.4f}")
        print(f"  Weight:   {result['weight']:.4f}")
        print(f"  .632:     {result['err_632']:.4f}")
        print(f"  .632+:    {result['err_632_plus']:.4f}")
        print(f"  Correction: +{result['correction']:.4f}")
        print(f"  >> {result['interpretation']}")

The Direction of Correction

When to Use .632 vs .632+

Both .632 and .632+ have their place in the practitioner's toolkit. The choice depends on the model flexibility and the data characteristics.

Use .632 When

•Model is low-to-moderate complexity
•Apparent error is bounded away from zero
•You want simpler computation
•The no-information error is hard to compute
•You're doing exploratory analysis

Use .632+ When

•Model can heavily overfit (k-NN, deep trees)
•Apparent error is near zero
•You need accurate error estimates for model comparison
•Working with high-dimensional data (p >> n)
•Doing final reporting for publication

Model-Specific Guidance:

Model Family	Overfitting Risk	Recommendation
Linear/Logistic Regression	Low	.632 usually sufficient
Ridge/Lasso	Low-Moderate	.632 unless extreme regularization
Decision Trees (shallow)	Moderate	.632 or .632+
Decision Trees (deep)	High	.632+ recommended
Random Forests	Moderate	Use built-in OOB; .632/+ for single trees
k-NN (k≥5)	Moderate	.632+ recommended
k-NN (k=1)	Extreme	.632+ essential
SVM	Depends on kernel	.632+ for RBF/complex kernels
Neural Networks	High	.632+ essential; CV may be better

Practical Decision Rule

Limitations and Alternatives

While .632+ is a significant improvement over .632, it's not a panacea. Understanding its limitations helps you choose the right tool.

Limitations of .632+

•No-information error computation: For complex multi-class problems or regression, γ can be tricky to compute correctly.
•Assumes homogeneous overfitting: The single R value applies globally; local overfitting patterns are not captured.
•Still based on ~63% training data: The fundamental pessimistic bias from reduced effective training size remains, though corrected.
•Can still be optimistic for extreme cases: Very flexible models (deep neural networks) may defeat even .632+.
•Computational overhead: Computing γ adds cost, especially via permutation methods.

Alternatives to Consider:

Repeated k-Fold Cross-Validation: For most practical purposes, 10×10 repeated CV provides excellent error estimates with lower variance than bootstrap methods.
Nested Cross-Validation: When model selection is involved, nested CV provides less biased estimates than any bootstrap variant.
Holdout with Large Test Set: If data is abundant, a large held-out test set is the gold standard.
Leave-One-Out Bootstrap (pure OOB): Simpler than .632+, slightly pessimistic but often adequate.
Subsampling without replacement: Avoids some bootstrap issues related to duplicate observations.

For Deep Learning

Summary: The .632+ Bootstrap

Key Takeaways

•No-Information Error (γ): The error rate when features provide no information—baseline achievable by chance.
•Relative Overfitting Rate (R): R = (OOB - apparent)/(γ - apparent), measuring overfitting as fraction of maximum possible.
•Adaptive Weight: ŵ = 0.632/(1 - 0.368×R), ranging from 0.632 (R=0) to 1.0 (R=1).
•The Formula: Err_632+ = (1 - ŵ) × apparent + ŵ × OOB.
•When to Use: .632+ is essential for highly flexible models (k-NN, deep trees); .632 suffices for regularized models.
•Always Corrects Upward: .632+ ≥ .632 when apparent < OOB (the typical case).

What's Next:

Advanced Concept Mastered

3 / 5