Machine LearningK-Fold Cross-Validation

K-Fold Cross-Validation

LevelIntermediate

Duration90 mins

TopicK-Fold Cross-Validation

5 / 5

Confidence Intervals

Quantifying Uncertainty in Performance Estimates

Reporting a model's accuracy as "87.3%" without uncertainty quantification is incomplete and potentially misleading. Does this mean we're confident the true performance is between 87.2% and 87.4%? Or could it reasonably be anywhere from 82% to 92%?

Confidence intervals provide the answer. They transform point estimates into ranges that capture our uncertainty about the true value. For cross-validation estimates, constructing proper confidence intervals is surprisingly subtle—the standard formulas taught in statistics courses don't directly apply because CV fold estimates are correlated.

This page develops rigorous methods for uncertainty quantification in CV, ensuring your performance claims are both precise and honest.

What You Will Master

By the end of this page, you will understand: (1) Why standard CI formulas underestimate CV uncertainty, (2) Corrected variance estimators that account for fold correlation, (3) Bootstrap methods for CV confidence intervals, (4) Proper interpretation of CV confidence intervals, and (5) Best practices for reporting uncertainty.

Why Standard Confidence Intervals Fail

The naive approach computes a confidence interval from k fold scores as if they were independent observations:

$$\text{CI}{\text{naive}} = \bar{E} \pm t{\alpha/2, k-1} \cdot \frac{s}{\sqrt{k}}$$

where:

$\bar{E}$ is the mean CV score
$s$ is the sample standard deviation of fold scores
$t_{\alpha/2, k-1}$ is the t-critical value for k-1 degrees of freedom

The Problem: Fold Estimates Are Not Independent

In k-fold CV, each training set shares (k-2)/(k-1) of its samples with every other training set. For 10-fold CV:

Training sets for folds 1 and 2 share 8/9 ≈ 89% of their data
The resulting models are similar
Their errors on their respective validation sets are correlated

When observations are positively correlated, the sample variance underestimates the true variance of the mean.

naive_ci_failure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from scipy import stats
 
def demonstrate_naive_ci_failure(X, y, model_factory, 
                                  n_simulations=200, k=10):
    """
    Show that naive CIs have incorrect coverage.
    
    If CIs are correct, ~95% should contain the 'true' value.
    We estimate 'true' value as the mean across many CV runs.
    """
    # Run many CV iterations to estimate the 'true' expected CV value
    all_cv_means = []
    for seed in range(n_simulations):
        kfold = KFold(n_splits=k, shuffle=True, random_state=seed)
        scores = cross_val_score(model_factory(), X, y, cv=kfold)
        all_cv_means.append(scores.mean())
    
    # Use the grand mean as our 'true value' estimate
    true_value = np.mean(all_cv_means)
    true_variance_of_mean = np.var(all_cv_means)
    
    print(f"Estimated 'true' CV expected value: {true_value:.4f}")
    print(f"Observed variance of CV means: {true_variance_of_mean:.6f}")
    print()
    
    # Now check coverage: for each CV run, does its naive CI contain true_value?
    naive_coverage = 0
    naive_ci_widths = []
    
    for seed in range(n_simulations):
        kfold = KFold(n_splits=k, shuffle=True, random_state=seed)
        scores = cross_val_score(model_factory(), X, y, cv=kfold)
        
        # Naive CI
        mean = scores.mean()
        se_naive = scores.std() / np.sqrt(k)
        t_crit = stats.t.ppf(0.975, k-1)
        ci_low = mean - t_crit * se_naive
        ci_high = mean + t_crit * se_naive
        
        naive_ci_widths.append(ci_high - ci_low)
        
        if ci_low <= true_value <= ci_high:
            naive_coverage += 1
    
    naive_coverage_rate = naive_coverage / n_simulations
    avg_naive_ci_width = np.mean(naive_ci_widths)
    
    print("Naive CI Analysis:")
    print(f"  Expected coverage: 95.0%")
    print(f"  Actual coverage:   {naive_coverage_rate:.1%}")
    print(f"  Average CI width:  {avg_naive_ci_width:.4f}")
    
    if naive_coverage_rate < 0.90:
        print(f"  ⚠️  Coverage is too low! CIs are too narrow (overconfident)")
    elif naive_coverage_rate > 0.99:
        print(f"  ⚠️  Coverage is too high! CIs are too wide (overly conservative)")
    
    return {
        'true_value': true_value,
        'true_variance': true_variance_of_mean,
        'naive_coverage': naive_coverage_rate,
        'all_cv_means': all_cv_means
    }
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
print("=" * 60)
print("Demonstrating Naive CI Failure")
print("=" * 60)
 
result = demonstrate_naive_ci_failure(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)

The Overconfidence Problem

Naive CIs for CV typically have 70-85% actual coverage instead of the claimed 95%. This means published claims like 'accuracy 87% ± 2%' often underestimate uncertainty by 2-3× the true range should be wider. Overconfident CIs lead to false claims of significant differences between models.

The Correlation Effect Mathematically:

For k correlated observations with variance $\sigma^2$ and pairwise correlation $\rho$:

$$\text{Var}[\bar{X}] = \frac{\sigma^2}{k}(1 + (k-1)\rho)$$

The naive formula assumes $\rho = 0$, giving $\text{Var}[\bar{X}] = \sigma^2/k$.

If $\rho = 0.5$ and $k = 10$:

True variance: $\sigma^2/10 \times (1 + 9 \times 0.5) = 0.55\sigma^2$
Naive variance: $\sigma^2/10 = 0.1\sigma^2$

Naive underestimates by factor of 5.5! The CI is $\sqrt{5.5} \approx 2.3\times$ too narrow.

Corrected Variance Estimators

Several methods have been proposed to correct for the correlation between fold estimates. We'll examine the most important ones.

1. The Nadeau-Bengio Corrected Variance (2003):

For a k-fold CV estimate, add a correction factor:

$$\widehat{\text{Var}}{\text{corrected}} = \left(\frac{1}{k} + \frac{n{\text{test}}}{n_{\text{train}}}\right) \cdot \hat{\sigma}^2$$

where:

$k$ is the number of folds
$n_{\text{test}} = n/k$ is the test set size
$n_{\text{train}} = n(k-1)/k$ is the training set size
$\hat{\sigma}^2$ is the sample variance of fold scores

This adds the ratio $n_{\text{test}}/n_{\text{train}}$ to account for training set overlap.

corrected_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def nadeau_bengio_corrected_ci(scores, n_samples, n_folds, 
                                confidence=0.95):
    """
    Compute corrected CI using Nadeau-Bengio method.
    
    Parameters:
    -----------
    scores : array of k fold scores
    n_samples : total number of samples
    n_folds : k
    confidence : confidence level (default 0.95)
    
    Returns:
    --------
    dict with mean, corrected SE, CI
    """
    k = n_folds
    n = n_samples
    
    n_test = n / k
    n_train = n - n_test
    
    # Sample variance of fold scores
    sample_var = np.var(scores, ddof=1)  # Use ddof=1 for unbiased estimate
    
    # Nadeau-Bengio correction
    correction_factor = (1/k) + (n_test / n_train)
    corrected_var = correction_factor * sample_var
    corrected_se = np.sqrt(corrected_var)
    
    # Naive (uncorrected) for comparison
    naive_se = np.std(scores, ddof=1) / np.sqrt(k)
    
    # t-critical value
    # Degrees of freedom is debatable; k-1 is common
    t_crit = stats.t.ppf((1 + confidence) / 2, k - 1)
    
    mean = np.mean(scores)
    
    # CIs
    ci_corrected = (mean - t_crit * corrected_se, 
                    mean + t_crit * corrected_se)
    ci_naive = (mean - t_crit * naive_se, 
                mean + t_crit * naive_se)
    
    return {
        'mean': mean,
        'naive_se': naive_se,
        'corrected_se': corrected_se,
        'ci_naive': ci_naive,
        'ci_corrected': ci_corrected,
        'correction_factor': correction_factor,
        'inflation_ratio': corrected_se / naive_se
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
n = len(X)
k = 10
 
kfold = KFold(n_splits=k, shuffle=True, random_state=42)
scores = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, cv=kfold
)
 
result = nadeau_bengio_corrected_ci(scores, n, k)
 
print("Nadeau-Bengio Corrected Confidence Interval")
print("=" * 60)
print(f"Mean accuracy: {result['mean']:.4f}")
print()
print("Standard Errors:")
print(f"  Naive SE:     {result['naive_se']:.4f}")
print(f"  Corrected SE: {result['corrected_se']:.4f}")
print(f"  Inflation:    {result['inflation_ratio']:.2f}x")
print()
print("95% Confidence Intervals:")
print(f"  Naive:     [{result['ci_naive'][0]:.4f}, {result['ci_naive'][1]:.4f}]")
print(f"  Corrected: [{result['ci_corrected'][0]:.4f}, {result['ci_corrected'][1]:.4f}]")
print()
print(f"CI width ratio: {(result['ci_corrected'][1] - result['ci_corrected'][0]) / (result['ci_naive'][1] - result['ci_naive'][0]):.2f}x")
 
# Show correction factor for different k
print("\n" + "=" * 60)
print("Correction Factors for Different k (n=500 samples)")
print("=" * 60)
 
for k in [2, 3, 5, 10, 20, 50]:
    n_test = n / k
    n_train = n - n_test
    correction = (1/k) + (n_test/n_train)
    naive = 1/k
    inflation = np.sqrt(correction / naive)
    
    print(f"k={k:2d}: correction={correction:.4f}, "
          f"naive=1/{k}={naive:.4f}, "
          f"SE inflation={inflation:.2f}x")

2. The Bates-Granger Adjusted Variance:

An alternative formulation adjusts for the expected correlation:

$$\widehat{\text{Var}}_{\text{BG}} = \frac{\hat{\sigma}^2}{k}\left(1 + (k-1)\hat{\rho}\right)$$

where $\hat{\rho}$ is an estimated correlation between fold scores.

Estimating $\rho$ directly is difficult with only k observations. The Nadeau-Bengio approach avoids this by using the known training set overlap structure.

3. The Conservative Approach:

When in doubt, simply don't divide by $\sqrt{k}$:

$$\text{SE}_{\text{conservative}} = s$$ (the sample standard deviation itself)

This assumes perfect correlation ($\rho = 1$) and gives very wide CIs—but they're guaranteed to have at least nominal coverage.

Practical Recommendation

Use the Nadeau-Bengio corrected variance by default. It's principled, easy to compute, and produces reasonable CIs. For critical applications, supplement with bootstrap CIs (covered next) for robustness.

Bootstrap Confidence Intervals for CV

Bootstrap methods provide a flexible, distribution-free approach to constructing confidence intervals. For CV, several bootstrap strategies are applicable.

Strategy 1: Bootstrap-CV (Outer Bootstrap)

Resample the entire dataset with replacement, then run CV on each bootstrap sample:

for b = 1 to B:
    Sample D* from D with replacement
    Run k-fold CV on D* to get CV*(b)
CI = [percentile(2.5%, CV*), percentile(97.5%, CV*)]

This captures variability from both the data and the CV partition.

bootstrap_cv_ci.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.utils import resample
from typing import Tuple, List, Callable
 
def bootstrap_cv_ci(X, y, model_factory, n_bootstrap=200, k=10, 
                    confidence=0.95, random_state=42):
    """
    Bootstrap confidence interval for CV estimate.
    
    Strategy: Resample dataset, run CV on resampled data.
    """
    np.random.seed(random_state)
    
    cv_means = []
    
    for b in range(n_bootstrap):
        # Bootstrap sample (with replacement)
        X_boot, y_boot = resample(X, y, random_state=random_state + b)
        
        # Run CV on bootstrap sample
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + b)
        try:
            scores = cross_val_score(model_factory(), X_boot, y_boot, cv=cv)
            cv_means.append(scores.mean())
        except:
            # Skip if CV fails (e.g., class missing in fold)
            continue
    
    cv_means = np.array(cv_means)
    
    # Percentile method
    alpha = 1 - confidence
    ci_low = np.percentile(cv_means, 100 * alpha/2)
    ci_high = np.percentile(cv_means, 100 * (1 - alpha/2))
    
    return {
        'mean': np.mean(cv_means),
        'std': np.std(cv_means),
        'ci': (ci_low, ci_high),
        'bootstrap_distribution': cv_means
    }
 
def nested_bootstrap_cv_ci(X, y, model_factory, n_outer=100, n_inner=3,
                           k=10, confidence=0.95, random_state=42):
    """
    Nested bootstrap: outer bootstrap for data, inner repeats for CV.
    
    More robust but computationally expensive.
    """
    np.random.seed(random_state)
    
    all_means = []
    
    for outer in range(n_outer):
        # Bootstrap sample
        X_boot, y_boot = resample(X, y, random_state=random_state + outer)
        
        # Multiple CV runs on this bootstrap sample
        run_means = []
        for inner in range(n_inner):
            cv = StratifiedKFold(n_splits=k, shuffle=True,
                                random_state=random_state + outer * n_inner + inner)
            try:
                scores = cross_val_score(model_factory(), X_boot, y_boot, cv=cv)
                run_means.append(scores.mean())
            except:
                continue
        
        if run_means:
            all_means.append(np.mean(run_means))
    
    all_means = np.array(all_means)
    
    alpha = 1 - confidence
    ci_low = np.percentile(all_means, 100 * alpha/2)
    ci_high = np.percentile(all_means, 100 * (1 - alpha/2))
    
    return {
        'mean': np.mean(all_means),
        'std': np.std(all_means),
        'ci': (ci_low, ci_high),
        'bootstrap_distribution': all_means
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
print("Bootstrap Methods for CV Confidence Intervals")
print("=" * 60)
 
# Simple bootstrap-CV
result_simple = bootstrap_cv_ci(
    X, y,
    lambda: RandomForestClassifier(n_estimators=50, random_state=42),
    n_bootstrap=200,
    k=10
)
 
print("Simple Bootstrap-CV (200 resamples):")
print(f"  Mean: {result_simple['mean']:.4f}")
print(f"  Std:  {result_simple['std']:.4f}")
print(f"  95% CI: [{result_simple['ci'][0]:.4f}, {result_simple['ci'][1]:.4f}]")
 
# Nested bootstrap (more expensive)
result_nested = nested_bootstrap_cv_ci(
    X, y,
    lambda: RandomForestClassifier(n_estimators=50, random_state=42),
    n_outer=100,
    n_inner=3,
    k=10
)
 
print("\nNested Bootstrap-CV (100 outer × 3 inner):")
print(f"  Mean: {result_nested['mean']:.4f}")
print(f"  Std:  {result_nested['std']:.4f}")
print(f"  95% CI: [{result_nested['ci'][0]:.4f}, {result_nested['ci'][1]:.4f}]")
 
# Compare with Nadeau-Bengio
from scipy import stats
 
cv_scores = cross_val_score(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y, cv=10
)
n_test = len(X) / 10
n_train = len(X) - n_test
correction = (1/10) + (n_test/n_train)
corrected_se = np.sqrt(correction * np.var(cv_scores, ddof=1))
t_crit = stats.t.ppf(0.975, 9)
 
print("\nNadeau-Bengio Corrected:")
print(f"  Mean: {cv_scores.mean():.4f}")
print(f"  95% CI: [{cv_scores.mean() - t_crit*corrected_se:.4f}, "
      f"{cv_scores.mean() + t_crit*corrected_se:.4f}]")

Strategy 2: Bootstrap of Fold Scores

A simpler approach bootstraps the k fold scores directly:

for b = 1 to B:
    Resample k scores from the k fold scores (with replacement)
    Compute mean of resampled scores
CI = percentiles of bootstrap means

This is fast but doesn't capture data variability—only partition variability given the observed fold scores.

Strategy 3: The .632+ Bootstrap

A sophisticated method that combines in-sample error with out-of-bootstrap error, applying a correction for optimism. More complex but can provide better estimates for small samples.

Choosing a Bootstrap Strategy:

Method	Captures Data Variability	Captures CV Variability	Computational Cost
Bootstrap-CV	Yes	Yes	High (B × k fits)
Bootstrap fold scores	No	Partially	Low (B resamples)
Nested Bootstrap	Yes	Yes	Very High (B × r × k)
.632+ Bootstrap	Yes	Partially	Moderate (B fits)

Bootstrap vs Analytical CIs

Bootstrap methods are more computationally expensive but make fewer assumptions. They're especially valuable when the distribution of CV estimates is non-normal (e.g., accuracy near 0 or 1). For routine use, Nadeau-Bengio corrected t-intervals are usually sufficient.

Confidence Intervals from Repeated CV

Repeated CV provides a natural basis for confidence intervals—we have r independent mean estimates from different partitions.

The Key Insight:

In repeated CV, the r repetition means are less correlated than the k fold scores within a single run. Each repetition uses a different random partition, providing some independence.

CI from Repetition Means:

Given r repetition means $\bar{E}_1, \bar{E}_2, ..., \bar{E}_r$:

$$\text{CI} = \bar{\bar{E}} \pm t_{\alpha/2, r-1} \cdot \frac{s_r}{\sqrt{r}}$$

where:

$\bar{\bar{E}}$ is the grand mean (mean of repetition means)
$s_r$ is the standard deviation of repetition means
$r$ is the number of repetitions

repeated_cv_ci.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def repeated_cv_with_proper_ci(X, y, model_factory, k=10, r=10,
                                confidence=0.95, random_state=42):
    """
    Run repeated k-fold CV and compute proper confidence intervals.
    
    Uses between-repetition variance, which is more appropriate than
    pooling all r×k scores.
    """
    rep_means = []
    all_scores = []
    
    for rep in range(r):
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + rep)
        scores = cross_val_score(model_factory(), X, y, cv=cv)
        rep_means.append(scores.mean())
        all_scores.extend(scores)
    
    rep_means = np.array(rep_means)
    all_scores = np.array(all_scores)
    
    # Grand mean
    grand_mean = rep_means.mean()
    
    # Method 1: CI from repetition means (more appropriate)
    se_reps = rep_means.std(ddof=1) / np.sqrt(r)
    t_crit = stats.t.ppf((1 + confidence) / 2, r - 1)
    ci_from_reps = (grand_mean - t_crit * se_reps, 
                    grand_mean + t_crit * se_reps)
    
    # Method 2: Naive CI from all r×k scores (underestimates variance)
    se_naive = all_scores.std(ddof=1) / np.sqrt(r * k)
    t_crit_naive = stats.t.ppf((1 + confidence) / 2, r*k - 1)
    ci_naive = (grand_mean - t_crit_naive * se_naive,
                grand_mean + t_crit_naive * se_naive)
    
    # Method 3: Nadeau-Bengio corrected on pooled scores
    n = len(X)
    n_test = n / k
    n_train = n - n_test
    correction = (1/(r*k)) + (n_test/n_train)
    se_nb = np.sqrt(correction * np.var(all_scores, ddof=1))
    ci_nb = (grand_mean - t_crit_naive * se_nb,
             grand_mean + t_crit_naive * se_nb)
    
    return {
        'grand_mean': grand_mean,
        'rep_means': rep_means,
        'rep_std': rep_means.std(ddof=1),
        'all_scores': all_scores,
        'ci_from_reps': ci_from_reps,
        'ci_naive': ci_naive,
        'ci_nb_corrected': ci_nb,
        'config': {'k': k, 'r': r}
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
print("Confidence Intervals from Repeated CV")
print("=" * 60)
 
result = repeated_cv_with_proper_ci(
    X, y,
    lambda: RandomForestClassifier(n_estimators=50, random_state=42),
    k=10, r=10
)
 
print(f"Configuration: {result['config']['r']}×{result['config']['k']} CV")
print(f"Grand mean: {result['grand_mean']:.4f}")
print(f"Per-repetition std: {result['rep_std']:.4f}")
print()
print("Per-repetition means:")
for i, m in enumerate(result['rep_means']):
    print(f"  Rep {i+1}: {m:.4f}")
print()
print("95% Confidence Intervals:")
print(f"  From rep means (r={result['config']['r']}): "
      f"[{result['ci_from_reps'][0]:.4f}, {result['ci_from_reps'][1]:.4f}]")
print(f"  Naive (r×k={result['config']['r']*result['config']['k']}):          "
      f"[{result['ci_naive'][0]:.4f}, {result['ci_naive'][1]:.4f}]")
print(f"  NB Corrected:           "
      f"[{result['ci_nb_corrected'][0]:.4f}, {result['ci_nb_corrected'][1]:.4f}]")
print()
 
# Compare CI widths
width_reps = result['ci_from_reps'][1] - result['ci_from_reps'][0]
width_naive = result['ci_naive'][1] - result['ci_naive'][0]
width_nb = result['ci_nb_corrected'][1] - result['ci_nb_corrected'][0]
 
print("CI Widths:")
print(f"  From rep means: {width_reps:.4f}")
print(f"  Naive:          {width_naive:.4f}")
print(f"  NB Corrected:   {width_nb:.4f}")
print()
print(f"Note: Naive CI is {width_reps/width_naive:.1f}x too narrow!")

Best Practice for Repeated CV CIs

Use the between-repetition variance (variance of r mean scores) to construct CIs. This appropriately captures partition-to-partition variability. With r=10 repetitions, you have 9 degrees of freedom—enough for reasonable t-intervals. This is often more reliable than corrections to pooled r×k scores.

Why Between-Repetition Variance Works:

Independence: Different repetitions use different random partitions—they're genuinely independent measurements.
Captures what matters: The repetition-to-repetition variance is exactly what we want to quantify: how much would our estimate change with a different partition?
No need to estimate ρ: We directly observe the variance we care about, rather than trying to correct for unknown correlations.

Caveat: With only r=5 repetitions, you have just 4 degrees of freedom. The t-distribution with 4 df has wide tails (t₀.₉₇₅,₄ = 2.78 vs t₀.₉₇₅,∞ = 1.96). This appropriately reflects our uncertainty but may give wider CIs than expected.

Correct Interpretation of CV Confidence Intervals

Confidence intervals are frequently misinterpreted. Let's clarify what CV confidence intervals do and don't tell us.

Correct Interpretations

•"We are 95% confident that the true expected CV estimate lies in this interval" ✓
•"If we repeated this entire CV procedure many times, 95% of the resulting intervals would contain the true expected performance" ✓
•"This interval quantifies uncertainty due to the random CV partition and finite sample" ✓
•"Approximately, our model's true generalization performance lies in this range" ✓ (with caveats about bias)

Incorrect Interpretations

•"95% of individual predictions will have accuracy in this range" ✗
•"The probability that true performance is in this interval is 95%" ✗ (frequentist CIs don't work this way)
•"This is the range of accuracies we'll see on different test sets" ✗ (that's prediction interval, not CI)
•"If two CIs don't overlap, the models are significantly different" ✗ (overlap ≠ non-significance)

The CI Overlap Fallacy

Many practitioners believe: 'If CIs don't overlap, the difference is significant.' This is WRONG. Two 95% CIs can overlap substantially and still be significantly different. Conversely, non-overlapping CIs guarantee significance, but it's a very conservative test. Always use proper statistical tests for comparisons, not visual CI overlap.

What a CV CI Does Capture:

Sampling uncertainty: If we had different data from the same distribution, how different might our estimate be?
Partition uncertainty: If we used a different random partition, how different would the result be?
Finite-sample effects: With limited data, estimates are inherently noisy.

What a CV CI Does NOT Capture:

Bias: If CV systematically over- or under-estimates true performance, the CI doesn't reveal this.
Data quality issues: Mislabeled data, distribution shift, etc.
Model selection bias: If you chose this model after looking at many, the CI doesn't reflect that.
Extrapolation uncertainty: Performance on truly different data.

ci_interpretation_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
 
def compare_models_properly(X, y, model_a_factory, model_b_factory,
                            model_a_name, model_b_name,
                            k=10, r=5, alpha=0.05, random_state=42):
    """
    Properly compare two models, avoiding the CI overlap fallacy.
    """
    # Run repeated CV with same splits for both models
    cv_results_a = []
    cv_results_b = []
    paired_diffs = []
    
    for rep in range(r):
        from sklearn.model_selection import StratifiedKFold
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + rep)
        
        rep_diffs = []
        for train_idx, val_idx in cv.split(X, y):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Same split for both
            model_a = model_a_factory().fit(X_train, y_train)
            model_b = model_b_factory().fit(X_train, y_train)
            
            score_a = model_a.score(X_val, y_val)
            score_b = model_b.score(X_val, y_val)
            
            cv_results_a.append(score_a)
            cv_results_b.append(score_b)
            rep_diffs.append(score_a - score_b)
        
        paired_diffs.extend(rep_diffs)
    
    cv_results_a = np.array(cv_results_a)
    cv_results_b = np.array(cv_results_b)
    paired_diffs = np.array(paired_diffs)
    
    # Individual CIs (using between-rep variance)
    rep_means_a = cv_results_a.reshape(r, k).mean(axis=1)
    rep_means_b = cv_results_b.reshape(r, k).mean(axis=1)
    
    def ci_from_rep_means(rep_means):
        t_crit = stats.t.ppf(1 - alpha/2, r - 1)
        se = rep_means.std(ddof=1) / np.sqrt(r)
        mean = rep_means.mean()
        return (mean - t_crit * se, mean + t_crit * se)
    
    ci_a = ci_from_rep_means(rep_means_a)
    ci_b = ci_from_rep_means(rep_means_b)
    
    # Check overlap
    overlap = not (ci_a[1] < ci_b[0] or ci_b[1] < ci_a[0])
    
    # Proper paired t-test on differences
    t_stat, p_value = stats.ttest_1samp(paired_diffs, 0)
    
    # Corrected t-test (Nadeau-Bengio style)
    n = len(X)
    n_test = n / k
    n_train = n - n_test
    var_correction = (1/(r*k)) + (n_test/n_train)
    corrected_t = paired_diffs.mean() / np.sqrt(var_correction * paired_diffs.var())
    corrected_p = 2 * (1 - stats.t.cdf(abs(corrected_t), r*k - 1))
    
    print("Model Comparison Results")
    print("=" * 60)
    print(f"\n{model_a_name}:")
    print(f"  Mean: {rep_means_a.mean():.4f}")
    print(f"  95% CI: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]")
    print(f"\n{model_b_name}:")
    print(f"  Mean: {rep_means_b.mean():.4f}")
    print(f"  95% CI: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")
    print(f"\nCIs overlap: {overlap}")
    print()
    print("Proper Statistical Tests:")
    print(f"  Mean difference (A - B): {paired_diffs.mean():+.4f}")
    print(f"  Naive paired t-test: p = {p_value:.4f}")
    print(f"  Corrected paired t-test: p = {corrected_p:.4f}")
    print()
    
    if overlap and corrected_p < alpha:
        print("⚠️  CIs overlap BUT difference IS significant at α=0.05!")
        print("   This demonstrates the CI overlap fallacy.")
    elif not overlap and corrected_p >= alpha:
        print("⚠️  CIs don't overlap BUT difference is NOT significant!")
        print("   (This is rare but possible.)")
    elif corrected_p < alpha:
        winner = model_a_name if paired_diffs.mean() > 0 else model_b_name
        print(f"✓  {winner} is significantly better (p = {corrected_p:.4f})")
    else:
        print("✓  No significant difference between models")
    
    return {
        'ci_a': ci_a, 'ci_b': ci_b,
        'overlap': overlap,
        'p_value': corrected_p,
        'mean_diff': paired_diffs.mean()
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
result = compare_models_properly(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=42),
    "Random Forest", "Gradient Boosting"
)

Best Practices for Reporting CV Results

Proper reporting of CV results is essential for reproducibility and honest communication of uncertainty. Here's a comprehensive guide.

What to Report

•Mean performance: The central estimate (e.g., 87.3% accuracy)
•Standard deviation or standard error: Measure of spread (specify which!)
•Confidence interval: With confidence level and method (e.g., '95% CI, Nadeau-Bengio corrected')
•CV configuration: k folds, r repetitions (e.g., '10×5 repeated stratified k-fold')
•Random seed: For reproducibility
•Sample size: Number of observations
•All individual fold scores (in appendix): For transparency

Examples of Good Reporting:

✓ "Our model achieves 87.3% accuracy (95% CI: [85.1%, 89.5%] using Nadeau-Bengio corrected intervals) based on 10×5 repeated stratified k-fold CV on n=1000 samples (random_state=42)."

✓ "Mean F1 = 0.823 ± 0.018 (std across 10 repetitions), 95% CI = [0.810, 0.837], using 10-fold stratified CV with 10 repetitions."

Examples of Inadequate Reporting:

✗ "Accuracy: 87.3%" (no uncertainty)

✗ "Accuracy: 87.3% ± 2.1%" (±2.1% of what? std? SE? CI? which method?)

✗ "Cross-validated accuracy: 87.3%" (what kind of CV? how many folds?)

reporting_template.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from scipy import stats
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from dataclasses import dataclass
from typing import Optional, List, Dict
 
@dataclass
class CVReport:
    """Complete CV reporting structure."""
    metric_name: str
    mean: float
    std: float
    se: float
    ci_low: float
    ci_high: float
    ci_method: str
    n_folds: int
    n_repeats: int
    n_samples: int
    random_state: int
    all_scores: np.ndarray
    rep_means: np.ndarray
    
    def summary(self) -> str:
        """Short summary for inline reporting."""
        return (f"{self.metric_name} = {self.mean:.3f} "
                f"(95% CI: [{self.ci_low:.3f}, {self.ci_high:.3f}])")
    
    def full_report(self) -> str:
        """Full publication-ready report."""
        lines = [
            "Cross-Validation Results",
            "=" * 50,
            f"Metric: {self.metric_name}",
            f"Sample size: n = {self.n_samples}",
            f"Configuration: {self.n_repeats}×{self.n_folds} repeated stratified k-fold",
            f"Random state: {self.random_state}",
            "",
            "Performance:",
            f"  Mean: {self.mean:.4f}",
            f"  Standard deviation: {self.std:.4f}",
            f"  Standard error: {self.se:.4f}",
            f"  95% CI: [{self.ci_low:.4f}, {self.ci_high:.4f}]",
            f"  CI method: {self.ci_method}",
            "",
            "Per-repetition means:",
        ]
        for i, m in enumerate(self.rep_means):
            lines.append(f"  Rep {i+1}: {m:.4f}")
        
        lines.extend([
            "",
            "For reproducibility, save all fold scores:",
            f"  Shape: ({self.n_repeats}, {self.n_folds})",
            f"  All scores: {self.all_scores.round(4).tolist()}"
        ])
        
        return "\n".join(lines)
    
    def latex_table_row(self) -> str:
        """LaTeX table row format."""
        return (f"{self.mean:.3f} & ({self.ci_low:.3f}, {self.ci_high:.3f}) & "
                f"{self.n_repeats}×{self.n_folds} & {self.n_samples}")
 
def create_cv_report(X, y, model_factory, model_name: str = "Model",
                      metric: str = 'accuracy', n_folds: int = 10, 
                      n_repeats: int = 5, random_state: int = 42) -> CVReport:
    """
    Create a complete, publication-ready CV report.
    """
    cv = RepeatedStratifiedKFold(n_splits=n_folds, n_repeats=n_repeats,
                                 random_state=random_state)
    
    all_scores = cross_val_score(model_factory(), X, y, cv=cv, 
                                  scoring=metric)
    
    # Reshape to (n_repeats, n_folds)
    scores_matrix = all_scores.reshape(n_repeats, n_folds)
    rep_means = scores_matrix.mean(axis=1)
    
    mean = all_scores.mean()
    std = all_scores.std()
    
    # Use between-repetition variance for CI
    se = rep_means.std(ddof=1) / np.sqrt(n_repeats)
    t_crit = stats.t.ppf(0.975, n_repeats - 1)
    ci_low = mean - t_crit * se
    ci_high = mean + t_crit * se
    
    return CVReport(
        metric_name=f"{model_name} {metric}",
        mean=mean,
        std=std,
        se=se,
        ci_low=ci_low,
        ci_high=ci_high,
        ci_method="Between-repetition t-interval",
        n_folds=n_folds,
        n_repeats=n_repeats,
        n_samples=len(X),
        random_state=random_state,
        all_scores=all_scores,
        rep_means=rep_means
    )
 
# Demo
from sklearn.datasets import make_classification
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
report = create_cv_report(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    model_name="Random Forest",
    metric='accuracy',
    n_folds=10,
    n_repeats=5,
    random_state=42
)
 
print(report.full_report())
print("\n" + "=" * 50)
print("Summary for inline use:")
print(report.summary())
print("\nLaTeX table row:")
print(report.latex_table_row())

Journal/Conference Requirements

Many ML venues now require proper uncertainty reporting. Check submission guidelines for specific requirements. Common expectations: (1) At least mean ± std, (2) Repeated CV for rigorous venues, (3) Statistical tests for model comparisons, (4) Random seeds for reproducibility.

Common Pitfalls and How to Avoid Them

Even experienced practitioners make mistakes with CV confidence intervals. Here are the most common pitfalls and how to avoid them.

CV Confidence Interval Pitfalls
Pitfall	Problem	Solution
Using naive SE = std/√k	Underestimates variance due to fold correlation	Use Nadeau-Bengio correction or between-rep variance
Reporting ± std instead of CI	Std is not a confidence interval	Convert to CI: mean ± t_crit × SE
Cherry-picking random seeds	Invalid inference	Pre-register seed or use repeated CV
Comparing CIs visually	CI overlap ≠ non-significance	Use proper paired tests
Ignoring degrees of freedom	Too narrow CIs with few folds	Use t-distribution with k-1 or r-1 df
Pooling r×k scores for SE	Ignores between-repetition variance	Use between-rep variance for SE
Not reporting method	Results not reproducible	Always state CI method used

The Multiple Comparisons Problem

If you compare 10 models and report 'best model is significantly better,' remember: with 10 comparisons at α=0.05, you expect ~0.5 false positives by chance. Use appropriate corrections (Bonferroni, Holm, etc.) for multiple comparisons, or use methods designed for multiple comparison (Nemenyi test, critical difference diagrams).

CI Checklist Before Publishing

•☐ Used corrected variance estimator (not naive std/√k)
•☐ Specified CI method explicitly
•☐ Reported confidence level (usually 95%)
•☐ Used appropriate degrees of freedom
•☐ Reported CV configuration (k, r, stratified?)
•☐ Set and reported random seed
•☐ Used paired tests for model comparisons
•☐ Corrected for multiple comparisons if needed

Summary: Confidence Intervals for CV

We've developed a comprehensive understanding of uncertainty quantification for cross-validation estimates. Here are the essential takeaways:

Key Takeaways

•Naive CIs (std/√k) underestimate uncertainty due to correlation between fold estimates—coverage is often 70-85% instead of 95%
•Nadeau-Bengio correction adds (n_test/n_train) to account for training set overlap: Var_corrected = (1/k + n_test/n_train) × s²
•Bootstrap methods provide flexible alternatives, especially when distributions are non-normal
•Repeated CV with between-repetition variance is often the most reliable: SE = std(rep_means) / √r
•CI overlap ≠ non-significance: Always use proper paired tests for model comparisons
•Complete reporting includes: mean, CI, CI method, CV config, n, and random seed
•Interpret CIs correctly: They quantify estimate uncertainty, not prediction variability
•Watch for multiple comparisons: Correct when comparing many models

Module Complete:

You've now mastered k-fold cross-validation comprehensively—from the basic procedure through bias-variance tradeoffs, k selection, repeated CV, and rigorous uncertainty quantification. You can now:

Choose appropriate CV configurations for any scenario
Understand and mitigate sources of estimation error
Report results with proper confidence intervals
Compare models using statistically sound methods

This foundation prepares you for advanced validation topics like stratified/group CV, time series validation, and nested CV for hyperparameter tuning.

Module Complete

Congratulations! You've completed the K-Fold Cross-Validation module. You now have world-class understanding of model evaluation methodology—knowledge that distinguishes practitioners who merely use ML from those who truly understand it.

5 / 5

Loading learning content...

Machine LearningK-Fold Cross-Validation

K-Fold Cross-Validation

LevelIntermediate

Duration90 mins

TopicK-Fold Cross-Validation

5 / 5

Confidence Intervals

Quantifying Uncertainty in Performance Estimates

This page develops rigorous methods for uncertainty quantification in CV, ensuring your performance claims are both precise and honest.

What You Will Master

Why Standard Confidence Intervals Fail

The naive approach computes a confidence interval from k fold scores as if they were independent observations:

$$\text{CI}{\text{naive}} = \bar{E} \pm t{\alpha/2, k-1} \cdot \frac{s}{\sqrt{k}}$$

where:

$\bar{E}$ is the mean CV score
$s$ is the sample standard deviation of fold scores
$t_{\alpha/2, k-1}$ is the t-critical value for k-1 degrees of freedom

The Problem: Fold Estimates Are Not Independent

In k-fold CV, each training set shares (k-2)/(k-1) of its samples with every other training set. For 10-fold CV:

Training sets for folds 1 and 2 share 8/9 ≈ 89% of their data
The resulting models are similar
Their errors on their respective validation sets are correlated

When observations are positively correlated, the sample variance underestimates the true variance of the mean.

naive_ci_failure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from scipy import stats
 
def demonstrate_naive_ci_failure(X, y, model_factory, 
                                  n_simulations=200, k=10):
    """
    Show that naive CIs have incorrect coverage.
    
    If CIs are correct, ~95% should contain the 'true' value.
    We estimate 'true' value as the mean across many CV runs.
    """
    # Run many CV iterations to estimate the 'true' expected CV value
    all_cv_means = []
    for seed in range(n_simulations):
        kfold = KFold(n_splits=k, shuffle=True, random_state=seed)
        scores = cross_val_score(model_factory(), X, y, cv=kfold)
        all_cv_means.append(scores.mean())
    
    # Use the grand mean as our 'true value' estimate
    true_value = np.mean(all_cv_means)
    true_variance_of_mean = np.var(all_cv_means)
    
    print(f"Estimated 'true' CV expected value: {true_value:.4f}")
    print(f"Observed variance of CV means: {true_variance_of_mean:.6f}")
    print()
    
    # Now check coverage: for each CV run, does its naive CI contain true_value?
    naive_coverage = 0
    naive_ci_widths = []
    
    for seed in range(n_simulations):
        kfold = KFold(n_splits=k, shuffle=True, random_state=seed)
        scores = cross_val_score(model_factory(), X, y, cv=kfold)
        
        # Naive CI
        mean = scores.mean()
        se_naive = scores.std() / np.sqrt(k)
        t_crit = stats.t.ppf(0.975, k-1)
        ci_low = mean - t_crit * se_naive
        ci_high = mean + t_crit * se_naive
        
        naive_ci_widths.append(ci_high - ci_low)
        
        if ci_low <= true_value <= ci_high:
            naive_coverage += 1
    
    naive_coverage_rate = naive_coverage / n_simulations
    avg_naive_ci_width = np.mean(naive_ci_widths)
    
    print("Naive CI Analysis:")
    print(f"  Expected coverage: 95.0%")
    print(f"  Actual coverage:   {naive_coverage_rate:.1%}")
    print(f"  Average CI width:  {avg_naive_ci_width:.4f}")
    
    if naive_coverage_rate < 0.90:
        print(f"  ⚠️  Coverage is too low! CIs are too narrow (overconfident)")
    elif naive_coverage_rate > 0.99:
        print(f"  ⚠️  Coverage is too high! CIs are too wide (overly conservative)")
    
    return {
        'true_value': true_value,
        'true_variance': true_variance_of_mean,
        'naive_coverage': naive_coverage_rate,
        'all_cv_means': all_cv_means
    }
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
print("=" * 60)
print("Demonstrating Naive CI Failure")
print("=" * 60)
 
result = demonstrate_naive_ci_failure(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)

The Overconfidence Problem

The Correlation Effect Mathematically:

For k correlated observations with variance $\sigma^2$ and pairwise correlation $\rho$:

$$\text{Var}[\bar{X}] = \frac{\sigma^2}{k}(1 + (k-1)\rho)$$

The naive formula assumes $\rho = 0$, giving $\text{Var}[\bar{X}] = \sigma^2/k$.

If $\rho = 0.5$ and $k = 10$:

True variance: $\sigma^2/10 \times (1 + 9 \times 0.5) = 0.55\sigma^2$
Naive variance: $\sigma^2/10 = 0.1\sigma^2$

Naive underestimates by factor of 5.5! The CI is $\sqrt{5.5} \approx 2.3\times$ too narrow.

Corrected Variance Estimators

Several methods have been proposed to correct for the correlation between fold estimates. We'll examine the most important ones.

1. The Nadeau-Bengio Corrected Variance (2003):

For a k-fold CV estimate, add a correction factor:

$$\widehat{\text{Var}}{\text{corrected}} = \left(\frac{1}{k} + \frac{n{\text{test}}}{n_{\text{train}}}\right) \cdot \hat{\sigma}^2$$

where:

$k$ is the number of folds
$n_{\text{test}} = n/k$ is the test set size
$n_{\text{train}} = n(k-1)/k$ is the training set size
$\hat{\sigma}^2$ is the sample variance of fold scores

This adds the ratio $n_{\text{test}}/n_{\text{train}}$ to account for training set overlap.

corrected_variance.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def nadeau_bengio_corrected_ci(scores, n_samples, n_folds, 
                                confidence=0.95):
    """
    Compute corrected CI using Nadeau-Bengio method.
    
    Parameters:
    -----------
    scores : array of k fold scores
    n_samples : total number of samples
    n_folds : k
    confidence : confidence level (default 0.95)
    
    Returns:
    --------
    dict with mean, corrected SE, CI
    """
    k = n_folds
    n = n_samples
    
    n_test = n / k
    n_train = n - n_test
    
    # Sample variance of fold scores
    sample_var = np.var(scores, ddof=1)  # Use ddof=1 for unbiased estimate
    
    # Nadeau-Bengio correction
    correction_factor = (1/k) + (n_test / n_train)
    corrected_var = correction_factor * sample_var
    corrected_se = np.sqrt(corrected_var)
    
    # Naive (uncorrected) for comparison
    naive_se = np.std(scores, ddof=1) / np.sqrt(k)
    
    # t-critical value
    # Degrees of freedom is debatable; k-1 is common
    t_crit = stats.t.ppf((1 + confidence) / 2, k - 1)
    
    mean = np.mean(scores)
    
    # CIs
    ci_corrected = (mean - t_crit * corrected_se, 
                    mean + t_crit * corrected_se)
    ci_naive = (mean - t_crit * naive_se, 
                mean + t_crit * naive_se)
    
    return {
        'mean': mean,
        'naive_se': naive_se,
        'corrected_se': corrected_se,
        'ci_naive': ci_naive,
        'ci_corrected': ci_corrected,
        'correction_factor': correction_factor,
        'inflation_ratio': corrected_se / naive_se
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
n = len(X)
k = 10
 
kfold = KFold(n_splits=k, shuffle=True, random_state=42)
scores = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, cv=kfold
)
 
result = nadeau_bengio_corrected_ci(scores, n, k)
 
print("Nadeau-Bengio Corrected Confidence Interval")
print("=" * 60)
print(f"Mean accuracy: {result['mean']:.4f}")
print()
print("Standard Errors:")
print(f"  Naive SE:     {result['naive_se']:.4f}")
print(f"  Corrected SE: {result['corrected_se']:.4f}")
print(f"  Inflation:    {result['inflation_ratio']:.2f}x")
print()
print("95% Confidence Intervals:")
print(f"  Naive:     [{result['ci_naive'][0]:.4f}, {result['ci_naive'][1]:.4f}]")
print(f"  Corrected: [{result['ci_corrected'][0]:.4f}, {result['ci_corrected'][1]:.4f}]")
print()
print(f"CI width ratio: {(result['ci_corrected'][1] - result['ci_corrected'][0]) / (result['ci_naive'][1] - result['ci_naive'][0]):.2f}x")
 
# Show correction factor for different k
print("\n" + "=" * 60)
print("Correction Factors for Different k (n=500 samples)")
print("=" * 60)
 
for k in [2, 3, 5, 10, 20, 50]:
    n_test = n / k
    n_train = n - n_test
    correction = (1/k) + (n_test/n_train)
    naive = 1/k
    inflation = np.sqrt(correction / naive)
    
    print(f"k={k:2d}: correction={correction:.4f}, "
          f"naive=1/{k}={naive:.4f}, "
          f"SE inflation={inflation:.2f}x")

2. The Bates-Granger Adjusted Variance:

An alternative formulation adjusts for the expected correlation:

$$\widehat{\text{Var}}_{\text{BG}} = \frac{\hat{\sigma}^2}{k}\left(1 + (k-1)\hat{\rho}\right)$$

where $\hat{\rho}$ is an estimated correlation between fold scores.

Estimating $\rho$ directly is difficult with only k observations. The Nadeau-Bengio approach avoids this by using the known training set overlap structure.

3. The Conservative Approach:

When in doubt, simply don't divide by $\sqrt{k}$:

$$\text{SE}_{\text{conservative}} = s$$ (the sample standard deviation itself)

This assumes perfect correlation ($\rho = 1$) and gives very wide CIs—but they're guaranteed to have at least nominal coverage.

Practical Recommendation

Bootstrap Confidence Intervals for CV

Bootstrap methods provide a flexible, distribution-free approach to constructing confidence intervals. For CV, several bootstrap strategies are applicable.

Strategy 1: Bootstrap-CV (Outer Bootstrap)

Resample the entire dataset with replacement, then run CV on each bootstrap sample:

for b = 1 to B:
    Sample D* from D with replacement
    Run k-fold CV on D* to get CV*(b)
CI = [percentile(2.5%, CV*), percentile(97.5%, CV*)]

This captures variability from both the data and the CV partition.

bootstrap_cv_ci.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.utils import resample
from typing import Tuple, List, Callable
 
def bootstrap_cv_ci(X, y, model_factory, n_bootstrap=200, k=10, 
                    confidence=0.95, random_state=42):
    """
    Bootstrap confidence interval for CV estimate.
    
    Strategy: Resample dataset, run CV on resampled data.
    """
    np.random.seed(random_state)
    
    cv_means = []
    
    for b in range(n_bootstrap):
        # Bootstrap sample (with replacement)
        X_boot, y_boot = resample(X, y, random_state=random_state + b)
        
        # Run CV on bootstrap sample
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + b)
        try:
            scores = cross_val_score(model_factory(), X_boot, y_boot, cv=cv)
            cv_means.append(scores.mean())
        except:
            # Skip if CV fails (e.g., class missing in fold)
            continue
    
    cv_means = np.array(cv_means)
    
    # Percentile method
    alpha = 1 - confidence
    ci_low = np.percentile(cv_means, 100 * alpha/2)
    ci_high = np.percentile(cv_means, 100 * (1 - alpha/2))
    
    return {
        'mean': np.mean(cv_means),
        'std': np.std(cv_means),
        'ci': (ci_low, ci_high),
        'bootstrap_distribution': cv_means
    }
 
def nested_bootstrap_cv_ci(X, y, model_factory, n_outer=100, n_inner=3,
                           k=10, confidence=0.95, random_state=42):
    """
    Nested bootstrap: outer bootstrap for data, inner repeats for CV.
    
    More robust but computationally expensive.
    """
    np.random.seed(random_state)
    
    all_means = []
    
    for outer in range(n_outer):
        # Bootstrap sample
        X_boot, y_boot = resample(X, y, random_state=random_state + outer)
        
        # Multiple CV runs on this bootstrap sample
        run_means = []
        for inner in range(n_inner):
            cv = StratifiedKFold(n_splits=k, shuffle=True,
                                random_state=random_state + outer * n_inner + inner)
            try:
                scores = cross_val_score(model_factory(), X_boot, y_boot, cv=cv)
                run_means.append(scores.mean())
            except:
                continue
        
        if run_means:
            all_means.append(np.mean(run_means))
    
    all_means = np.array(all_means)
    
    alpha = 1 - confidence
    ci_low = np.percentile(all_means, 100 * alpha/2)
    ci_high = np.percentile(all_means, 100 * (1 - alpha/2))
    
    return {
        'mean': np.mean(all_means),
        'std': np.std(all_means),
        'ci': (ci_low, ci_high),
        'bootstrap_distribution': all_means
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
print("Bootstrap Methods for CV Confidence Intervals")
print("=" * 60)
 
# Simple bootstrap-CV
result_simple = bootstrap_cv_ci(
    X, y,
    lambda: RandomForestClassifier(n_estimators=50, random_state=42),
    n_bootstrap=200,
    k=10
)
 
print("Simple Bootstrap-CV (200 resamples):")
print(f"  Mean: {result_simple['mean']:.4f}")
print(f"  Std:  {result_simple['std']:.4f}")
print(f"  95% CI: [{result_simple['ci'][0]:.4f}, {result_simple['ci'][1]:.4f}]")
 
# Nested bootstrap (more expensive)
result_nested = nested_bootstrap_cv_ci(
    X, y,
    lambda: RandomForestClassifier(n_estimators=50, random_state=42),
    n_outer=100,
    n_inner=3,
    k=10
)
 
print("\nNested Bootstrap-CV (100 outer × 3 inner):")
print(f"  Mean: {result_nested['mean']:.4f}")
print(f"  Std:  {result_nested['std']:.4f}")
print(f"  95% CI: [{result_nested['ci'][0]:.4f}, {result_nested['ci'][1]:.4f}]")
 
# Compare with Nadeau-Bengio
from scipy import stats
 
cv_scores = cross_val_score(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X, y, cv=10
)
n_test = len(X) / 10
n_train = len(X) - n_test
correction = (1/10) + (n_test/n_train)
corrected_se = np.sqrt(correction * np.var(cv_scores, ddof=1))
t_crit = stats.t.ppf(0.975, 9)
 
print("\nNadeau-Bengio Corrected:")
print(f"  Mean: {cv_scores.mean():.4f}")
print(f"  95% CI: [{cv_scores.mean() - t_crit*corrected_se:.4f}, "
      f"{cv_scores.mean() + t_crit*corrected_se:.4f}]")

Strategy 2: Bootstrap of Fold Scores

A simpler approach bootstraps the k fold scores directly:

for b = 1 to B:
    Resample k scores from the k fold scores (with replacement)
    Compute mean of resampled scores
CI = percentiles of bootstrap means

This is fast but doesn't capture data variability—only partition variability given the observed fold scores.

Strategy 3: The .632+ Bootstrap

A sophisticated method that combines in-sample error with out-of-bootstrap error, applying a correction for optimism. More complex but can provide better estimates for small samples.

Choosing a Bootstrap Strategy:

Method	Captures Data Variability	Captures CV Variability	Computational Cost
Bootstrap-CV	Yes	Yes	High (B × k fits)
Bootstrap fold scores	No	Partially	Low (B resamples)
Nested Bootstrap	Yes	Yes	Very High (B × r × k)
.632+ Bootstrap	Yes	Partially	Moderate (B fits)

Bootstrap vs Analytical CIs

Confidence Intervals from Repeated CV

Repeated CV provides a natural basis for confidence intervals—we have r independent mean estimates from different partitions.

The Key Insight:

In repeated CV, the r repetition means are less correlated than the k fold scores within a single run. Each repetition uses a different random partition, providing some independence.

CI from Repetition Means:

Given r repetition means $\bar{E}_1, \bar{E}_2, ..., \bar{E}_r$:

$$\text{CI} = \bar{\bar{E}} \pm t_{\alpha/2, r-1} \cdot \frac{s_r}{\sqrt{r}}$$

where:

$\bar{\bar{E}}$ is the grand mean (mean of repetition means)
$s_r$ is the standard deviation of repetition means
$r$ is the number of repetitions

repeated_cv_ci.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def repeated_cv_with_proper_ci(X, y, model_factory, k=10, r=10,
                                confidence=0.95, random_state=42):
    """
    Run repeated k-fold CV and compute proper confidence intervals.
    
    Uses between-repetition variance, which is more appropriate than
    pooling all r×k scores.
    """
    rep_means = []
    all_scores = []
    
    for rep in range(r):
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + rep)
        scores = cross_val_score(model_factory(), X, y, cv=cv)
        rep_means.append(scores.mean())
        all_scores.extend(scores)
    
    rep_means = np.array(rep_means)
    all_scores = np.array(all_scores)
    
    # Grand mean
    grand_mean = rep_means.mean()
    
    # Method 1: CI from repetition means (more appropriate)
    se_reps = rep_means.std(ddof=1) / np.sqrt(r)
    t_crit = stats.t.ppf((1 + confidence) / 2, r - 1)
    ci_from_reps = (grand_mean - t_crit * se_reps, 
                    grand_mean + t_crit * se_reps)
    
    # Method 2: Naive CI from all r×k scores (underestimates variance)
    se_naive = all_scores.std(ddof=1) / np.sqrt(r * k)
    t_crit_naive = stats.t.ppf((1 + confidence) / 2, r*k - 1)
    ci_naive = (grand_mean - t_crit_naive * se_naive,
                grand_mean + t_crit_naive * se_naive)
    
    # Method 3: Nadeau-Bengio corrected on pooled scores
    n = len(X)
    n_test = n / k
    n_train = n - n_test
    correction = (1/(r*k)) + (n_test/n_train)
    se_nb = np.sqrt(correction * np.var(all_scores, ddof=1))
    ci_nb = (grand_mean - t_crit_naive * se_nb,
             grand_mean + t_crit_naive * se_nb)
    
    return {
        'grand_mean': grand_mean,
        'rep_means': rep_means,
        'rep_std': rep_means.std(ddof=1),
        'all_scores': all_scores,
        'ci_from_reps': ci_from_reps,
        'ci_naive': ci_naive,
        'ci_nb_corrected': ci_nb,
        'config': {'k': k, 'r': r}
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
print("Confidence Intervals from Repeated CV")
print("=" * 60)
 
result = repeated_cv_with_proper_ci(
    X, y,
    lambda: RandomForestClassifier(n_estimators=50, random_state=42),
    k=10, r=10
)
 
print(f"Configuration: {result['config']['r']}×{result['config']['k']} CV")
print(f"Grand mean: {result['grand_mean']:.4f}")
print(f"Per-repetition std: {result['rep_std']:.4f}")
print()
print("Per-repetition means:")
for i, m in enumerate(result['rep_means']):
    print(f"  Rep {i+1}: {m:.4f}")
print()
print("95% Confidence Intervals:")
print(f"  From rep means (r={result['config']['r']}): "
      f"[{result['ci_from_reps'][0]:.4f}, {result['ci_from_reps'][1]:.4f}]")
print(f"  Naive (r×k={result['config']['r']*result['config']['k']}):          "
      f"[{result['ci_naive'][0]:.4f}, {result['ci_naive'][1]:.4f}]")
print(f"  NB Corrected:           "
      f"[{result['ci_nb_corrected'][0]:.4f}, {result['ci_nb_corrected'][1]:.4f}]")
print()
 
# Compare CI widths
width_reps = result['ci_from_reps'][1] - result['ci_from_reps'][0]
width_naive = result['ci_naive'][1] - result['ci_naive'][0]
width_nb = result['ci_nb_corrected'][1] - result['ci_nb_corrected'][0]
 
print("CI Widths:")
print(f"  From rep means: {width_reps:.4f}")
print(f"  Naive:          {width_naive:.4f}")
print(f"  NB Corrected:   {width_nb:.4f}")
print()
print(f"Note: Naive CI is {width_reps/width_naive:.1f}x too narrow!")

Best Practice for Repeated CV CIs

Why Between-Repetition Variance Works:

Independence: Different repetitions use different random partitions—they're genuinely independent measurements.
Captures what matters: The repetition-to-repetition variance is exactly what we want to quantify: how much would our estimate change with a different partition?
No need to estimate ρ: We directly observe the variance we care about, rather than trying to correct for unknown correlations.

Correct Interpretation of CV Confidence Intervals

Confidence intervals are frequently misinterpreted. Let's clarify what CV confidence intervals do and don't tell us.

Correct Interpretations

•"We are 95% confident that the true expected CV estimate lies in this interval" ✓
•"If we repeated this entire CV procedure many times, 95% of the resulting intervals would contain the true expected performance" ✓
•"This interval quantifies uncertainty due to the random CV partition and finite sample" ✓
•"Approximately, our model's true generalization performance lies in this range" ✓ (with caveats about bias)

Incorrect Interpretations

•"95% of individual predictions will have accuracy in this range" ✗
•"The probability that true performance is in this interval is 95%" ✗ (frequentist CIs don't work this way)
•"This is the range of accuracies we'll see on different test sets" ✗ (that's prediction interval, not CI)
•"If two CIs don't overlap, the models are significantly different" ✗ (overlap ≠ non-significance)

The CI Overlap Fallacy

What a CV CI Does Capture:

Sampling uncertainty: If we had different data from the same distribution, how different might our estimate be?
Partition uncertainty: If we used a different random partition, how different would the result be?
Finite-sample effects: With limited data, estimates are inherently noisy.

What a CV CI Does NOT Capture:

Bias: If CV systematically over- or under-estimates true performance, the CI doesn't reveal this.
Data quality issues: Mislabeled data, distribution shift, etc.
Model selection bias: If you chose this model after looking at many, the CI doesn't reflect that.
Extrapolation uncertainty: Performance on truly different data.

ci_interpretation_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
 
def compare_models_properly(X, y, model_a_factory, model_b_factory,
                            model_a_name, model_b_name,
                            k=10, r=5, alpha=0.05, random_state=42):
    """
    Properly compare two models, avoiding the CI overlap fallacy.
    """
    # Run repeated CV with same splits for both models
    cv_results_a = []
    cv_results_b = []
    paired_diffs = []
    
    for rep in range(r):
        from sklearn.model_selection import StratifiedKFold
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + rep)
        
        rep_diffs = []
        for train_idx, val_idx in cv.split(X, y):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Same split for both
            model_a = model_a_factory().fit(X_train, y_train)
            model_b = model_b_factory().fit(X_train, y_train)
            
            score_a = model_a.score(X_val, y_val)
            score_b = model_b.score(X_val, y_val)
            
            cv_results_a.append(score_a)
            cv_results_b.append(score_b)
            rep_diffs.append(score_a - score_b)
        
        paired_diffs.extend(rep_diffs)
    
    cv_results_a = np.array(cv_results_a)
    cv_results_b = np.array(cv_results_b)
    paired_diffs = np.array(paired_diffs)
    
    # Individual CIs (using between-rep variance)
    rep_means_a = cv_results_a.reshape(r, k).mean(axis=1)
    rep_means_b = cv_results_b.reshape(r, k).mean(axis=1)
    
    def ci_from_rep_means(rep_means):
        t_crit = stats.t.ppf(1 - alpha/2, r - 1)
        se = rep_means.std(ddof=1) / np.sqrt(r)
        mean = rep_means.mean()
        return (mean - t_crit * se, mean + t_crit * se)
    
    ci_a = ci_from_rep_means(rep_means_a)
    ci_b = ci_from_rep_means(rep_means_b)
    
    # Check overlap
    overlap = not (ci_a[1] < ci_b[0] or ci_b[1] < ci_a[0])
    
    # Proper paired t-test on differences
    t_stat, p_value = stats.ttest_1samp(paired_diffs, 0)
    
    # Corrected t-test (Nadeau-Bengio style)
    n = len(X)
    n_test = n / k
    n_train = n - n_test
    var_correction = (1/(r*k)) + (n_test/n_train)
    corrected_t = paired_diffs.mean() / np.sqrt(var_correction * paired_diffs.var())
    corrected_p = 2 * (1 - stats.t.cdf(abs(corrected_t), r*k - 1))
    
    print("Model Comparison Results")
    print("=" * 60)
    print(f"\n{model_a_name}:")
    print(f"  Mean: {rep_means_a.mean():.4f}")
    print(f"  95% CI: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]")
    print(f"\n{model_b_name}:")
    print(f"  Mean: {rep_means_b.mean():.4f}")
    print(f"  95% CI: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")
    print(f"\nCIs overlap: {overlap}")
    print()
    print("Proper Statistical Tests:")
    print(f"  Mean difference (A - B): {paired_diffs.mean():+.4f}")
    print(f"  Naive paired t-test: p = {p_value:.4f}")
    print(f"  Corrected paired t-test: p = {corrected_p:.4f}")
    print()
    
    if overlap and corrected_p < alpha:
        print("⚠️  CIs overlap BUT difference IS significant at α=0.05!")
        print("   This demonstrates the CI overlap fallacy.")
    elif not overlap and corrected_p >= alpha:
        print("⚠️  CIs don't overlap BUT difference is NOT significant!")
        print("   (This is rare but possible.)")
    elif corrected_p < alpha:
        winner = model_a_name if paired_diffs.mean() > 0 else model_b_name
        print(f"✓  {winner} is significantly better (p = {corrected_p:.4f})")
    else:
        print("✓  No significant difference between models")
    
    return {
        'ci_a': ci_a, 'ci_b': ci_b,
        'overlap': overlap,
        'p_value': corrected_p,
        'mean_diff': paired_diffs.mean()
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
result = compare_models_properly(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=42),
    "Random Forest", "Gradient Boosting"
)

Best Practices for Reporting CV Results

Proper reporting of CV results is essential for reproducibility and honest communication of uncertainty. Here's a comprehensive guide.

What to Report

•Mean performance: The central estimate (e.g., 87.3% accuracy)
•Standard deviation or standard error: Measure of spread (specify which!)
•Confidence interval: With confidence level and method (e.g., '95% CI, Nadeau-Bengio corrected')
•CV configuration: k folds, r repetitions (e.g., '10×5 repeated stratified k-fold')
•Random seed: For reproducibility
•Sample size: Number of observations
•All individual fold scores (in appendix): For transparency

Examples of Good Reporting:

✓ "Our model achieves 87.3% accuracy (95% CI: [85.1%, 89.5%] using Nadeau-Bengio corrected intervals) based on 10×5 repeated stratified k-fold CV on n=1000 samples (random_state=42)."

✓ "Mean F1 = 0.823 ± 0.018 (std across 10 repetitions), 95% CI = [0.810, 0.837], using 10-fold stratified CV with 10 repetitions."

Examples of Inadequate Reporting:

✗ "Accuracy: 87.3%" (no uncertainty)

✗ "Accuracy: 87.3% ± 2.1%" (±2.1% of what? std? SE? CI? which method?)

✗ "Cross-validated accuracy: 87.3%" (what kind of CV? how many folds?)

reporting_template.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from scipy import stats
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from dataclasses import dataclass
from typing import Optional, List, Dict
 
@dataclass
class CVReport:
    """Complete CV reporting structure."""
    metric_name: str
    mean: float
    std: float
    se: float
    ci_low: float
    ci_high: float
    ci_method: str
    n_folds: int
    n_repeats: int
    n_samples: int
    random_state: int
    all_scores: np.ndarray
    rep_means: np.ndarray
    
    def summary(self) -> str:
        """Short summary for inline reporting."""
        return (f"{self.metric_name} = {self.mean:.3f} "
                f"(95% CI: [{self.ci_low:.3f}, {self.ci_high:.3f}])")
    
    def full_report(self) -> str:
        """Full publication-ready report."""
        lines = [
            "Cross-Validation Results",
            "=" * 50,
            f"Metric: {self.metric_name}",
            f"Sample size: n = {self.n_samples}",
            f"Configuration: {self.n_repeats}×{self.n_folds} repeated stratified k-fold",
            f"Random state: {self.random_state}",
            "",
            "Performance:",
            f"  Mean: {self.mean:.4f}",
            f"  Standard deviation: {self.std:.4f}",
            f"  Standard error: {self.se:.4f}",
            f"  95% CI: [{self.ci_low:.4f}, {self.ci_high:.4f}]",
            f"  CI method: {self.ci_method}",
            "",
            "Per-repetition means:",
        ]
        for i, m in enumerate(self.rep_means):
            lines.append(f"  Rep {i+1}: {m:.4f}")
        
        lines.extend([
            "",
            "For reproducibility, save all fold scores:",
            f"  Shape: ({self.n_repeats}, {self.n_folds})",
            f"  All scores: {self.all_scores.round(4).tolist()}"
        ])
        
        return "\n".join(lines)
    
    def latex_table_row(self) -> str:
        """LaTeX table row format."""
        return (f"{self.mean:.3f} & ({self.ci_low:.3f}, {self.ci_high:.3f}) & "
                f"{self.n_repeats}×{self.n_folds} & {self.n_samples}")
 
def create_cv_report(X, y, model_factory, model_name: str = "Model",
                      metric: str = 'accuracy', n_folds: int = 10, 
                      n_repeats: int = 5, random_state: int = 42) -> CVReport:
    """
    Create a complete, publication-ready CV report.
    """
    cv = RepeatedStratifiedKFold(n_splits=n_folds, n_repeats=n_repeats,
                                 random_state=random_state)
    
    all_scores = cross_val_score(model_factory(), X, y, cv=cv, 
                                  scoring=metric)
    
    # Reshape to (n_repeats, n_folds)
    scores_matrix = all_scores.reshape(n_repeats, n_folds)
    rep_means = scores_matrix.mean(axis=1)
    
    mean = all_scores.mean()
    std = all_scores.std()
    
    # Use between-repetition variance for CI
    se = rep_means.std(ddof=1) / np.sqrt(n_repeats)
    t_crit = stats.t.ppf(0.975, n_repeats - 1)
    ci_low = mean - t_crit * se
    ci_high = mean + t_crit * se
    
    return CVReport(
        metric_name=f"{model_name} {metric}",
        mean=mean,
        std=std,
        se=se,
        ci_low=ci_low,
        ci_high=ci_high,
        ci_method="Between-repetition t-interval",
        n_folds=n_folds,
        n_repeats=n_repeats,
        n_samples=len(X),
        random_state=random_state,
        all_scores=all_scores,
        rep_means=rep_means
    )
 
# Demo
from sklearn.datasets import make_classification
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
report = create_cv_report(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    model_name="Random Forest",
    metric='accuracy',
    n_folds=10,
    n_repeats=5,
    random_state=42
)
 
print(report.full_report())
print("\n" + "=" * 50)
print("Summary for inline use:")
print(report.summary())
print("\nLaTeX table row:")
print(report.latex_table_row())

Journal/Conference Requirements

Common Pitfalls and How to Avoid Them

Even experienced practitioners make mistakes with CV confidence intervals. Here are the most common pitfalls and how to avoid them.

CV Confidence Interval Pitfalls
Pitfall	Problem	Solution
Using naive SE = std/√k	Underestimates variance due to fold correlation	Use Nadeau-Bengio correction or between-rep variance
Reporting ± std instead of CI	Std is not a confidence interval	Convert to CI: mean ± t_crit × SE
Cherry-picking random seeds	Invalid inference	Pre-register seed or use repeated CV
Comparing CIs visually	CI overlap ≠ non-significance	Use proper paired tests
Ignoring degrees of freedom	Too narrow CIs with few folds	Use t-distribution with k-1 or r-1 df
Pooling r×k scores for SE	Ignores between-repetition variance	Use between-rep variance for SE
Not reporting method	Results not reproducible	Always state CI method used

The Multiple Comparisons Problem

CI Checklist Before Publishing

•☐ Used corrected variance estimator (not naive std/√k)
•☐ Specified CI method explicitly
•☐ Reported confidence level (usually 95%)
•☐ Used appropriate degrees of freedom
•☐ Reported CV configuration (k, r, stratified?)
•☐ Set and reported random seed
•☐ Used paired tests for model comparisons
•☐ Corrected for multiple comparisons if needed

Summary: Confidence Intervals for CV

We've developed a comprehensive understanding of uncertainty quantification for cross-validation estimates. Here are the essential takeaways:

Key Takeaways

•Naive CIs (std/√k) underestimate uncertainty due to correlation between fold estimates—coverage is often 70-85% instead of 95%
•Nadeau-Bengio correction adds (n_test/n_train) to account for training set overlap: Var_corrected = (1/k + n_test/n_train) × s²
•Bootstrap methods provide flexible alternatives, especially when distributions are non-normal
•Repeated CV with between-repetition variance is often the most reliable: SE = std(rep_means) / √r
•CI overlap ≠ non-significance: Always use proper paired tests for model comparisons
•Complete reporting includes: mean, CI, CI method, CV config, n, and random seed
•Interpret CIs correctly: They quantify estimate uncertainty, not prediction variability
•Watch for multiple comparisons: Correct when comparing many models

Module Complete:

You've now mastered k-fold cross-validation comprehensively—from the basic procedure through bias-variance tradeoffs, k selection, repeated CV, and rigorous uncertainty quantification. You can now:

Choose appropriate CV configurations for any scenario
Understand and mitigate sources of estimation error
Report results with proper confidence intervals
Compare models using statistically sound methods

This foundation prepares you for advanced validation topics like stratified/group CV, time series validation, and nested CV for hyperparameter tuning.

Module Complete

5 / 5