Nested Cross Validation - Learning Module

Loading content...

0/245

Unbiased Performance Estimation

The Mathematical Foundation of Honest Evaluation

We've established that nested cross-validation separates model selection from performance evaluation through its two-loop structure. But why does this separation produce unbiased estimates? What is the precise statistical mechanism that eliminates selection bias?

This page provides the rigorous theoretical foundation for nested CV's unbiasedness. Understanding these principles doesn't just verify that nested CV works—it reveals when and why it might fail, how to interpret its outputs correctly, and how to extend it to novel situations.

What You'll Learn

This page covers the mathematical proof of unbiasedness, the relationship between nested CV estimates and true generalization performance, sources of remaining variance, empirical validation approaches, and proper statistical interpretation of nested CV results.

What Nested CV Actually Estimates

Before proving unbiasedness, we must be precise about what nested CV estimates. This is often misunderstood, leading to incorrect interpretation.

The estimand (what we're trying to measure):

Nested CV estimates the expected generalization performance of the model selection procedure, not the performance of any specific selected model.

Formally, let:

$\mathcal{A}$ denote the model selection algorithm (hyperparameter tuning procedure)
$D$ denote a training dataset of size $n$ drawn from distribution $P$
$\mathcal{A}(D)$ denote the model selected by $\mathcal{A}$ when trained on $D$
$R(f, P)$ denote the true risk (expected loss) of model $f$ on distribution $P$

Nested CV estimates:

$$\theta = \mathbb{E}_{D \sim P^n}[R(\mathcal{A}(D), P)]$$

In words: the expected test performance of the model you would get by applying your selection procedure to a random training set of size $n$ from the same distribution.

The Subtle but Important Distinction

Nested CV does NOT estimate 'how well will this specific trained model perform?' It estimates 'how well does this model selection procedure perform on average?' This is actually more useful—when you deploy, you'll retrain on all available data, so you want to know how your procedure performs, not your specific current model.

Why this estimand is the right one:

Reproducibility: If a colleague runs your procedure on similar data, what should they expect?
Generalization of the process: You're not deploying your validation model; you're deploying a retrained model from the same procedure
Decision-making: Should you use this model selection approach or a different one?

Contrast with standard (non-nested) CV:

Standard CV with hyperparameter selection estimates:

$$\hat{\theta}{\text{biased}} = \max{\lambda \in \Lambda} \hat{R}{CV}(\mathcal{A}\lambda(D))$$

This is the best CV score across hyperparameter configurations—which, as we showed, is optimistically biased because the maximum of noisy estimates tends to exceed the true maximum.

The Unbiasedness Proof

We now prove that nested CV provides an unbiased estimate of the expected generalization performance.

Theorem (Unbiasedness of Nested CV):

Let $\hat{\theta}_{NCV}$ be the nested cross-validation estimate. Under the assumption that outer folds are exchangeable (satisfied by random splitting), we have:

$$\mathbb{E}[\hat{\theta}{NCV}] = \mathbb{E}{D \sim P^n}[R(\mathcal{A}(D), P)]$$

where the left expectation is over the randomness in both the dataset and the CV splits.

Proof:

Let the full dataset $D$ be split into $K$ outer folds: $D = D_1 \cup D_2 \cup ... \cup D_K$.

For outer fold $k$:

Let $D_{-k} = D \setminus D_k$ be the outer training set (all folds except $k$)
Let $\hat{f}k = \mathcal{A}(D{-k})$ be the model selected by applying procedure $\mathcal{A}$ to the outer training set
Let $\hat{R}k = \frac{1}{|D_k|} \sum{(x,y) \in D_k} L(\hat{f}_k(x), y)$ be the empirical risk on outer test fold $k$

The nested CV estimate is:

$$\hat{\theta}{NCV} = \frac{1}{K} \sum{k=1}^{K} \hat{R}_k$$

Step 1: Conditional unbiasedness

Conditioned on $D_{-k}$ (and hence on the selected model $\hat{f}_k$), the outer test fold $D_k$ consists of i.i.d. samples from $P$ that were not used in model selection. Therefore:

$$\mathbb{E}[\hat{R}k | D{-k}] = R(\hat{f}k, P) = R(\mathcal{A}(D{-k}), P)$$

This is the key step: because $D_k$ was held out during the entire inner loop, evaluating on $D_k$ gives an unbiased estimate of the selected model's true risk.

Step 2: Marginalizing over training sets

Taking the expectation over $D_{-k}$:

$$\mathbb{E}[\hat{R}k] = \mathbb{E}{D_{-k}}[R(\mathcal{A}(D_{-k}), P)]$$

This equals $\mathbb{E}_{D'}[R(\mathcal{A}(D'), P)]$ where $D'$ is a dataset of size $(K-1)n/K$ drawn from $P$.

Step 3: Averaging over folds

By symmetry (exchangeability of folds):

$$\mathbb{E}[\hat{\theta}{NCV}] = \mathbb{E}\left[\frac{1}{K} \sum{k=1}^{K} \hat{R}k\right] = \frac{1}{K} \sum{k=1}^{K} \mathbb{E}[\hat{R}k] = \mathbb{E}{D'}[R(\mathcal{A}(D'), P)]$$

Step 4: Addressing the sample size discrepancy

Note that $D'$ has size $(K-1)n/K$ rather than $n$. Nested CV technically estimates performance on training sets of size $(K-1)/K$ times the available data. This introduces a small upward bias because smaller training sets generally yield worse models.

For K = 5: Outer training sets are 80% of full data For K = 10: Outer training sets are 90% of full data

This bias is typically small and conservative (you underestimate final performance because your final model uses 100% of data). Many practitioners consider this acceptable.

Conclusion:

Nested CV is unbiased with respect to the performance of the model selection procedure on training sets of size $(K-1)n/K$. It provides a conservative estimate of full-data performance. □

The Core Insight

Unbiasedness comes from the independence of the outer test fold from the selection process. The inner loop cannot 'see' the outer test set, so評evaluating on it gives honest performance measurement. This is the same principle as a proper train/test split, but applied systematically across all data.

Sources of Variance in Nested CV Estimates

While nested CV is unbiased, its estimates have variance. Understanding variance sources helps interpret results and design better experiments.

Decomposition of variance:

The total variance of the nested CV estimate can be decomposed as:

$$\text{Var}(\hat{\theta}{NCV}) = \text{Var}{between}(outer\ folds) + \text{Var}_{within}(inner\ CV)$$

1. Between-fold variance (outer loop):

Different outer folds yield different selected models and different test sets. This contributes:

$$\text{Var}{between} = \frac{1}{K{outer}^2} \sum_{k=1}^{K_{outer}} \text{Var}(\hat{R}k) + \frac{2}{K{outer}^2} \sum_{k<k'} \text{Cov}(\hat{R}k, \hat{R}{k'})$$

The covariance term arises because outer folds share training data (non-independence). This increases variance compared to independent samples.

2. Within-fold variance (selection uncertainty):

Within each outer fold, the inner CV has variance in which hyperparameters it selects. Sometimes a suboptimal configuration wins due to noise. This adds variance to the outer test score—you're sometimes evaluating a suboptimally selected model.

Factors affecting total variance:

Factor	Effect on Variance	Why
Dataset size ↑	Variance ↓	More stable estimates per fold
K_outer ↑	Variance ↓ (usually)	More folds averaged, but smaller test sets
K_inner ↑	Variance ↓ (slightly)	Better hyperparameter selection
Model complexity ↑	Variance ↑	More overfitting, noisier estimates
Hyperparameters ↑	Variance ↑	More selection uncertainty
Similar candidate performance	Variance ↑	Selection driven by noise

Variance Lower Bound

Nested CV variance cannot be reduced below the irreducible variance of evaluating on held-out test data. With K_outer = 5, each test set is only 20% of data. Small test sets mean noisy performance estimates, regardless of how carefully you run nested CV.

Estimating variance:

The standard error of nested CV is often estimated as:

$$\hat{\text{SE}}(\hat{\theta}{NCV}) = \frac{s}{\sqrt{K{outer}}}$$

where $s$ is the sample standard deviation of outer fold scores. However, this underestimates true variance because it ignores the correlation between folds (they share training data).

Corrected variance estimation:

More accurate variance estimation requires:

Repeated nested CV with different outer splits
Bootstrap methods
Analytic corrections for fold correlation

In practice, reporting the uncorrected standard error with a note about potential underestimation is common.

Empirical Demonstration of Unbiasedness

Let's verify nested CV's unbiasedness empirically through simulation. This also illustrates the magnitude of selection bias that nested CV avoids.

demonstrate_unbiasedness.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
 
def compare_cv_estimates(n_simulations=100, n_samples=500):
    """
    Compare standard CV (biased) vs nested CV (unbiased) estimates
    by evaluating on truly held-out test data.
    """
    standard_cv_scores = []
    nested_cv_scores = []
    true_test_scores = []
    
    param_grid = {'svc__C': [0.1, 1, 10, 100], 'svc__gamma': [0.01, 0.1, 1]}
    
    for sim in range(n_simulations):
        # Generate fresh dataset
        X, y = make_classification(
            n_samples=n_samples + 500,  # Extra 500 for true test
            n_features=20, 
            n_informative=10,
            random_state=sim
        )
        
        # Hold out true test set (never seen during CV)
        X_dev, X_true_test = X[:n_samples], X[n_samples:]
        y_dev, y_true_test = y[:n_samples], y[n_samples:]
        
        model = Pipeline([
            ('scaler', StandardScaler()),
            ('svc', SVC())
        ])
        
        # ========================================
        # Standard CV: Tune and use same CV score (BIASED)
        # ========================================
        cv = KFold(n_splits=5, shuffle=True, random_state=42)
        grid_search = GridSearchCV(model, param_grid, cv=cv, scoring='accuracy')
        grid_search.fit(X_dev, y_dev)
        
        # This is what people often incorrectly report:
        biased_estimate = grid_search.best_score_
        standard_cv_scores.append(biased_estimate)
        
        # ========================================
        # Nested CV: Proper unbiased estimate
        # ========================================
        outer_cv = KFold(n_splits=5, shuffle=True, random_state=42)
        inner_cv = KFold(n_splits=5, shuffle=True, random_state=123)
        
        grid_search_nested = GridSearchCV(model, param_grid, cv=inner_cv)
        nested_scores = cross_val_score(grid_search_nested, X_dev, y_dev, 
                                         cv=outer_cv, scoring='accuracy')
        nested_estimate = nested_scores.mean()
        nested_cv_scores.append(nested_estimate)
        
        # ========================================
        # True test performance (ground truth)
        # ========================================
        # Retrain best model on full dev set, evaluate on true test
        grid_search.fit(X_dev, y_dev)  # Already done above
        true_score = grid_search.score(X_true_test, y_true_test)
        true_test_scores.append(true_score)
    
    return {
        'standard_cv_mean': np.mean(standard_cv_scores),
        'nested_cv_mean': np.mean(nested_cv_scores),
        'true_test_mean': np.mean(true_test_scores),
        'standard_cv_bias': np.mean(standard_cv_scores) - np.mean(true_test_scores),
        'nested_cv_bias': np.mean(nested_cv_scores) - np.mean(true_test_scores),
    }
 
# Run simulation
results = compare_cv_estimates(n_simulations=100)
 
print("=" * 60)
print("SIMULATION RESULTS: Standard CV vs Nested CV")
print("=" * 60)
print(f"True average test performance:  {results['true_test_mean']:.4f}")
print(f"Standard CV estimate:           {results['standard_cv_mean']:.4f} (bias: +{results['standard_cv_bias']:.4f})")
print(f"Nested CV estimate:             {results['nested_cv_mean']:.4f} (bias: {results['nested_cv_bias']:+.4f})")
print("=" * 60)

Typical Simulation Results (SVM with 12 hyperparameter combinations)
Metric	Estimate	Bias
True Test Performance (ground truth)	0.8234	—
Standard CV (best_score_)	0.8512	+0.0278 (optimistic!)
Nested CV	0.8247	+0.0013 (approximately unbiased)

Key Finding

Standard CV overestimates true performance by ~2.8 percentage points. Nested CV is approximately unbiased with only ~0.1 percentage point deviation. Over 100 simulations, nested CV's estimate converges to true performance.

The Conservative Bias: Training Set Size

Nested CV has a small conservative bias that's often acceptable in practice.

The source of conservative bias:

In nested CV with K_outer folds:

Each outer training set contains $(K-1)/K$ of the data
The final deployed model will train on $100%$ of the data
Larger training sets generally produce better models

Therefore, nested CV slightly underestimates the performance of the final deployed model because it evaluates models trained on less data than will be available at deployment.

Training Set Size in Nested CV vs Final Deployment
K_outer	Outer Train Size	Gap vs Full Data	Typical Impact
3	66.7%	33.3% less	Noticeable underestimate
5	80.0%	20.0% less	Moderate underestimate
10	90.0%	10.0% less	Slight underestimate
20	95.0%	5.0% less	Minimal underestimate

Learning curve perspective:

The relationship between training set size and performance follows a learning curve. For many models:

$$\text{Error}(n) \approx a + \frac{b}{n^\alpha}$$

where $\alpha$ is typically 0.5-1.0. The relative improvement from 80% to 100% of data is:

$$\frac{\text{Error}(0.8n) - \text{Error}(n)}{\text{Error}(n)} = \frac{b/((0.8n)^\alpha) - b/(n^\alpha)}{a + b/n^\alpha}$$

For typical learning curves, this yields 1-5% relative performance difference. A model reported at 85% accuracy by nested CV might achieve 86-87% with full data training.

Why conservative bias is acceptable:

Honest: You don't overpromise to stakeholders
Safe for decision-making: Underestimating beats overestimating
Predictable direction: You know the bias direction
Decreases with K: Using K=10 reduces it substantially

Practical Advice

If conservative bias is a concern, you can fit a learning curve to your nested CV outer training sets (at 80% data) and extrapolate to 100%. This provides a debiased estimate, though with increased variance.

Constructing Confidence Intervals

Beyond point estimates, nested CV can provide confidence intervals for generalization performance. However, constructing valid intervals requires care.

Naive approach (often used, somewhat problematic):

With K outer fold scores $\hat{R}_1, ..., \hat{R}_K$, the simplest interval is:

$$\hat{\theta}{NCV} \pm t{K-1, \alpha/2} \cdot \frac{s}{\sqrt{K}}$$

where $s$ is the sample standard deviation of outer scores.

Problems with this approach:

Non-independence: Outer folds overlap in training data, violating the independence assumption for the t-interval
Underestimated width: The correlation inflates true variance beyond $s^2/K$
Non-normality: With K=5 folds, normality is questionable

Improved approaches:

confidence_intervals.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import numpy as np
from scipy import stats
 
def nested_cv_confidence_interval(outer_scores, alpha=0.05, method='corrected_t'):
    """
    Compute confidence interval for nested CV estimate.
    
    Parameters:
        outer_scores: List of K outer fold scores
        alpha: Significance level (0.05 for 95% CI)
        method: 'naive_t', 'corrected_t', or 'bootstrap'
    """
    K = len(outer_scores)
    mean = np.mean(outer_scores)
    std = np.std(outer_scores, ddof=1)
    
    if method == 'naive_t':
        # Standard t-interval (underestimates width)
        t_crit = stats.t.ppf(1 - alpha/2, K - 1)
        margin = t_crit * std / np.sqrt(K)
        return (mean - margin, mean + margin)
    
    elif method == 'corrected_t':
        # Nadeau & Bengio correction for correlated folds
        # Assumes folds are 1/K of data each, with correlation correction
        rho = 1/K  # Approximate correlation between fold estimates
        variance_corrected = (1/K + rho/(1-rho)) * std**2
        se_corrected = np.sqrt(variance_corrected / K)
        t_crit = stats.t.ppf(1 - alpha/2, K - 1)
        margin = t_crit * se_corrected
        return (mean - margin, mean + margin)
    
    elif method == 'bootstrap':
        # Bootstrap the outer folds (fast, practical)
        n_bootstrap = 10000
        bootstrap_means = []
        for _ in range(n_bootstrap):
            sample = np.random.choice(outer_scores, size=K, replace=True)
            bootstrap_means.append(np.mean(sample))
        lower = np.percentile(bootstrap_means, 100 * alpha/2)
        upper = np.percentile(bootstrap_means, 100 * (1 - alpha/2))
        return (lower, upper)
 
# Example usage
outer_scores = [0.823, 0.847, 0.831, 0.819, 0.842]
 
print("Point estimate:", np.mean(outer_scores))
print("Naive CI:      ", nested_cv_confidence_interval(outer_scores, method='naive_t'))
print("Corrected CI:  ", nested_cv_confidence_interval(outer_scores, method='corrected_t'))
print("Bootstrap CI:  ", nested_cv_confidence_interval(outer_scores, method='bootstrap'))

The Nadeau & Bengio Correction

This correction accounts for the overlap between CV folds. The intuition: because folds share training data, their estimates are positively correlated, and this correlation must be accounted for in variance estimation. The corrected variance is typically 1.5-3× larger than the naive estimate.

Repeated nested CV for better intervals:

The most reliable approach: repeat nested CV with different random splits, then use the distribution of estimates:

from sklearn.model_selection import RepeatedKFold

# Repeat nested CV 10 times with different splits
outer_cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

# This gives 50 outer fold scores instead of 5
# CI from 50 scores is more reliable (though still correlated)

With R repetitions and K folds, you get R×K scores. The interval is more reliable, though correlation still exists.

Using Nested CV for Model Comparison

Nested CV enables valid statistical comparison of different model families (e.g., SVM vs Random Forest), not just hyperparameter configurations within a family.

The comparison setup:

Run nested CV separately for each model family
Use the same outer folds for all models (paired comparison)
Compare outer fold score distributions

Why same outer folds matter:

Using identical outer splits creates a paired experimental design. The difference $d_k = \hat{R}^{A}_k - \hat{R}^{B}_k$ for fold $k$ removes variance due to fold-specific difficulty. This dramatically increases statistical power.

model_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
 
def compare_models_nested_cv(X, y, models_and_params, K_outer=5, K_inner=5):
    """
    Compare multiple model families using nested CV with paired outer folds.
    """
    # CRITICAL: Use the same outer folds for all models
    outer_cv = KFold(n_splits=K_outer, shuffle=True, random_state=42)
    outer_folds = list(outer_cv.split(X, y))
    
    results = {}
    
    for name, (model, param_grid) in models_and_params.items():
        inner_cv = KFold(n_splits=K_inner, shuffle=True, random_state=123)
        grid_search = GridSearchCV(model, param_grid, cv=inner_cv)
        
        # Manually iterate outer folds to ensure same splits
        outer_scores = []
        for train_idx, test_idx in outer_folds:
            X_train, X_test = X[train_idx], X[test_idx]
            y_train, y_test = y[train_idx], y[test_idx]
            
            grid_search.fit(X_train, y_train)
            score = grid_search.score(X_test, y_test)
            outer_scores.append(score)
        
        results[name] = outer_scores
    
    return results
 
# Define models to compare
models_and_params = {
    'SVM': (SVC(), {'C': [0.1, 1, 10], 'gamma': ['scale', 0.1]}),
    'RF': (RandomForestClassifier(), {'n_estimators': [50, 100], 'max_depth': [5, 10, None]})
}
 
# Run comparison
# results = compare_models_nested_cv(X, y, models_and_params)
 
# Paired t-test for significance
def paired_model_test(scores_A, scores_B, alpha=0.05):
    """Test if model A is significantly better than B."""
    differences = np.array(scores_A) - np.array(scores_B)
    t_stat, p_value = stats.ttest_rel(scores_A, scores_B)
    
    mean_diff = np.mean(differences)
    se_diff = np.std(differences, ddof=1) / np.sqrt(len(differences))
    
    return {
        'mean_difference': mean_diff,
        'p_value': p_value,
        'significant': p_value < alpha,
        'ci_95': (mean_diff - 1.96*se_diff, mean_diff + 1.96*se_diff)
    }

Multiple Comparison Correction

When comparing more than 2 models, apply multiple testing correction (e.g., Bonferroni, Holm-Bonferroni). Comparing 5 models involves 10 pairwise comparisons; without correction, family-wise error rate is high.

Correctly Interpreting Nested CV Results

Nested CV produces several outputs. Understanding what each means—and doesn't mean—is essential for correct interpretation.

Valid Interpretations

•Mean outer score: Expected performance of your tuning procedure
•Outer score variance: Stability/reliability of the procedure
•Selected params variability: How sensitive selection is to data
•Comparison with baselines: Whether your approach beats simpler methods

Invalid Interpretations

•Inner CV scores: Biased, not for reporting
•Performance of specific model: NCV estimates procedure, not a model
•Exact production performance: Slight conservative bias exists
•Single fold's params as 'best': Different folds give different optima

What to report:

Nested Cross-Validation Results:
- Mean accuracy: 84.7%
- Standard deviation: 2.3% (across 5 outer folds)
- 95% CI: [82.4%, 87.0%] (corrected for fold correlation)
- Selected hyperparameters varied across folds:
  - 3/5 folds: C=1.0, gamma=0.1
  - 2/5 folds: C=10.0, gamma=0.1

Interpretation: When applying this hyperparameter tuning procedure
to datasets of this size from this distribution, we expect accuracy
of approximately 84.7%, with typical variation of ±2.3%.

What NOT to report:

❌ "Our model achieves 86.2% accuracy (best inner CV score)"
❌ "The optimal hyperparameters are C=1.0, gamma=0.1"
❌ "Performance will be exactly 84.7% in production"

The Right Mental Model

Think of nested CV as evaluating your entire ML pipeline, not a specific model. The question you're answering is: 'How good is my approach to building models?' not 'How good is this particular model?'

Summary: Unbiased Performance Estimation

We've established the theoretical and practical foundations for why nested CV provides unbiased estimates. Here are the essential concepts:

Key Takeaways

•Nested CV estimates procedure performance, not specific model performance—this is actually the more useful quantity.
•Unbiasedness comes from separation: outer test folds are never seen during inner loop selection.
•Conservative bias exists due to smaller outer training sets, but it's predictable and acceptable.
•Variance has multiple sources: fold variability, selection uncertainty, and fold correlation all contribute.
•Confidence intervals require correction for the correlation between overlapping folds.
•Model comparison should use paired folds for maximum statistical power.
•Interpret results as procedure evaluation, not as statements about a specific trained model.

Page Complete

You now understand the theoretical foundations of nested CV's unbiasedness and how to interpret its results correctly. The next page addresses the practical challenge: nested CV's substantial computational cost and strategies for managing it.