Loading learning content...
Reporting a model's accuracy as "87.3%" without uncertainty quantification is incomplete and potentially misleading. Does this mean we're confident the true performance is between 87.2% and 87.4%? Or could it reasonably be anywhere from 82% to 92%?
Confidence intervals provide the answer. They transform point estimates into ranges that capture our uncertainty about the true value. For cross-validation estimates, constructing proper confidence intervals is surprisingly subtle—the standard formulas taught in statistics courses don't directly apply because CV fold estimates are correlated.
This page develops rigorous methods for uncertainty quantification in CV, ensuring your performance claims are both precise and honest.
By the end of this page, you will understand: (1) Why standard CI formulas underestimate CV uncertainty, (2) Corrected variance estimators that account for fold correlation, (3) Bootstrap methods for CV confidence intervals, (4) Proper interpretation of CV confidence intervals, and (5) Best practices for reporting uncertainty.
The naive approach computes a confidence interval from k fold scores as if they were independent observations:
$$\text{CI}{\text{naive}} = \bar{E} \pm t{\alpha/2, k-1} \cdot \frac{s}{\sqrt{k}}$$
where:
The Problem: Fold Estimates Are Not Independent
In k-fold CV, each training set shares (k-2)/(k-1) of its samples with every other training set. For 10-fold CV:
When observations are positively correlated, the sample variance underestimates the true variance of the mean.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom sklearn.model_selection import cross_val_score, KFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classificationfrom scipy import stats def demonstrate_naive_ci_failure(X, y, model_factory, n_simulations=200, k=10): """ Show that naive CIs have incorrect coverage. If CIs are correct, ~95% should contain the 'true' value. We estimate 'true' value as the mean across many CV runs. """ # Run many CV iterations to estimate the 'true' expected CV value all_cv_means = [] for seed in range(n_simulations): kfold = KFold(n_splits=k, shuffle=True, random_state=seed) scores = cross_val_score(model_factory(), X, y, cv=kfold) all_cv_means.append(scores.mean()) # Use the grand mean as our 'true value' estimate true_value = np.mean(all_cv_means) true_variance_of_mean = np.var(all_cv_means) print(f"Estimated 'true' CV expected value: {true_value:.4f}") print(f"Observed variance of CV means: {true_variance_of_mean:.6f}") print() # Now check coverage: for each CV run, does its naive CI contain true_value? naive_coverage = 0 naive_ci_widths = [] for seed in range(n_simulations): kfold = KFold(n_splits=k, shuffle=True, random_state=seed) scores = cross_val_score(model_factory(), X, y, cv=kfold) # Naive CI mean = scores.mean() se_naive = scores.std() / np.sqrt(k) t_crit = stats.t.ppf(0.975, k-1) ci_low = mean - t_crit * se_naive ci_high = mean + t_crit * se_naive naive_ci_widths.append(ci_high - ci_low) if ci_low <= true_value <= ci_high: naive_coverage += 1 naive_coverage_rate = naive_coverage / n_simulations avg_naive_ci_width = np.mean(naive_ci_widths) print("Naive CI Analysis:") print(f" Expected coverage: 95.0%") print(f" Actual coverage: {naive_coverage_rate:.1%}") print(f" Average CI width: {avg_naive_ci_width:.4f}") if naive_coverage_rate < 0.90: print(f" ⚠️ Coverage is too low! CIs are too narrow (overconfident)") elif naive_coverage_rate > 0.99: print(f" ⚠️ Coverage is too high! CIs are too wide (overly conservative)") return { 'true_value': true_value, 'true_variance': true_variance_of_mean, 'naive_coverage': naive_coverage_rate, 'all_cv_means': all_cv_means } # Generate datanp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) print("=" * 60)print("Demonstrating Naive CI Failure")print("=" * 60) result = demonstrate_naive_ci_failure( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42))Naive CIs for CV typically have 70-85% actual coverage instead of the claimed 95%. This means published claims like 'accuracy 87% ± 2%' often underestimate uncertainty by 2-3× the true range should be wider. Overconfident CIs lead to false claims of significant differences between models.
The Correlation Effect Mathematically:
For k correlated observations with variance $\sigma^2$ and pairwise correlation $\rho$:
$$\text{Var}[\bar{X}] = \frac{\sigma^2}{k}(1 + (k-1)\rho)$$
The naive formula assumes $\rho = 0$, giving $\text{Var}[\bar{X}] = \sigma^2/k$.
If $\rho = 0.5$ and $k = 10$:
Naive underestimates by factor of 5.5! The CI is $\sqrt{5.5} \approx 2.3\times$ too narrow.
Several methods have been proposed to correct for the correlation between fold estimates. We'll examine the most important ones.
1. The Nadeau-Bengio Corrected Variance (2003):
For a k-fold CV estimate, add a correction factor:
$$\widehat{\text{Var}}{\text{corrected}} = \left(\frac{1}{k} + \frac{n{\text{test}}}{n_{\text{train}}}\right) \cdot \hat{\sigma}^2$$
where:
This adds the ratio $n_{\text{test}}/n_{\text{train}}$ to account for training set overlap.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
import numpy as npfrom scipy import statsfrom sklearn.model_selection import cross_val_score, KFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification def nadeau_bengio_corrected_ci(scores, n_samples, n_folds, confidence=0.95): """ Compute corrected CI using Nadeau-Bengio method. Parameters: ----------- scores : array of k fold scores n_samples : total number of samples n_folds : k confidence : confidence level (default 0.95) Returns: -------- dict with mean, corrected SE, CI """ k = n_folds n = n_samples n_test = n / k n_train = n - n_test # Sample variance of fold scores sample_var = np.var(scores, ddof=1) # Use ddof=1 for unbiased estimate # Nadeau-Bengio correction correction_factor = (1/k) + (n_test / n_train) corrected_var = correction_factor * sample_var corrected_se = np.sqrt(corrected_var) # Naive (uncorrected) for comparison naive_se = np.std(scores, ddof=1) / np.sqrt(k) # t-critical value # Degrees of freedom is debatable; k-1 is common t_crit = stats.t.ppf((1 + confidence) / 2, k - 1) mean = np.mean(scores) # CIs ci_corrected = (mean - t_crit * corrected_se, mean + t_crit * corrected_se) ci_naive = (mean - t_crit * naive_se, mean + t_crit * naive_se) return { 'mean': mean, 'naive_se': naive_se, 'corrected_se': corrected_se, 'ci_naive': ci_naive, 'ci_corrected': ci_corrected, 'correction_factor': correction_factor, 'inflation_ratio': corrected_se / naive_se } # Demonp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42)n = len(X)k = 10 kfold = KFold(n_splits=k, shuffle=True, random_state=42)scores = cross_val_score( RandomForestClassifier(n_estimators=100, random_state=42), X, y, cv=kfold) result = nadeau_bengio_corrected_ci(scores, n, k) print("Nadeau-Bengio Corrected Confidence Interval")print("=" * 60)print(f"Mean accuracy: {result['mean']:.4f}")print()print("Standard Errors:")print(f" Naive SE: {result['naive_se']:.4f}")print(f" Corrected SE: {result['corrected_se']:.4f}")print(f" Inflation: {result['inflation_ratio']:.2f}x")print()print("95% Confidence Intervals:")print(f" Naive: [{result['ci_naive'][0]:.4f}, {result['ci_naive'][1]:.4f}]")print(f" Corrected: [{result['ci_corrected'][0]:.4f}, {result['ci_corrected'][1]:.4f}]")print()print(f"CI width ratio: {(result['ci_corrected'][1] - result['ci_corrected'][0]) / (result['ci_naive'][1] - result['ci_naive'][0]):.2f}x") # Show correction factor for different kprint("\n" + "=" * 60)print("Correction Factors for Different k (n=500 samples)")print("=" * 60) for k in [2, 3, 5, 10, 20, 50]: n_test = n / k n_train = n - n_test correction = (1/k) + (n_test/n_train) naive = 1/k inflation = np.sqrt(correction / naive) print(f"k={k:2d}: correction={correction:.4f}, " f"naive=1/{k}={naive:.4f}, " f"SE inflation={inflation:.2f}x")2. The Bates-Granger Adjusted Variance:
An alternative formulation adjusts for the expected correlation:
$$\widehat{\text{Var}}_{\text{BG}} = \frac{\hat{\sigma}^2}{k}\left(1 + (k-1)\hat{\rho}\right)$$
where $\hat{\rho}$ is an estimated correlation between fold scores.
Estimating $\rho$ directly is difficult with only k observations. The Nadeau-Bengio approach avoids this by using the known training set overlap structure.
3. The Conservative Approach:
When in doubt, simply don't divide by $\sqrt{k}$:
$$\text{SE}_{\text{conservative}} = s$$ (the sample standard deviation itself)
This assumes perfect correlation ($\rho = 1$) and gives very wide CIs—but they're guaranteed to have at least nominal coverage.
Use the Nadeau-Bengio corrected variance by default. It's principled, easy to compute, and produces reasonable CIs. For critical applications, supplement with bootstrap CIs (covered next) for robustness.
Bootstrap methods provide a flexible, distribution-free approach to constructing confidence intervals. For CV, several bootstrap strategies are applicable.
Strategy 1: Bootstrap-CV (Outer Bootstrap)
Resample the entire dataset with replacement, then run CV on each bootstrap sample:
for b = 1 to B:
Sample D* from D with replacement
Run k-fold CV on D* to get CV*(b)
CI = [percentile(2.5%, CV*), percentile(97.5%, CV*)]
This captures variability from both the data and the CV partition.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import numpy as npfrom sklearn.model_selection import cross_val_score, StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classificationfrom sklearn.utils import resamplefrom typing import Tuple, List, Callable def bootstrap_cv_ci(X, y, model_factory, n_bootstrap=200, k=10, confidence=0.95, random_state=42): """ Bootstrap confidence interval for CV estimate. Strategy: Resample dataset, run CV on resampled data. """ np.random.seed(random_state) cv_means = [] for b in range(n_bootstrap): # Bootstrap sample (with replacement) X_boot, y_boot = resample(X, y, random_state=random_state + b) # Run CV on bootstrap sample cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=random_state + b) try: scores = cross_val_score(model_factory(), X_boot, y_boot, cv=cv) cv_means.append(scores.mean()) except: # Skip if CV fails (e.g., class missing in fold) continue cv_means = np.array(cv_means) # Percentile method alpha = 1 - confidence ci_low = np.percentile(cv_means, 100 * alpha/2) ci_high = np.percentile(cv_means, 100 * (1 - alpha/2)) return { 'mean': np.mean(cv_means), 'std': np.std(cv_means), 'ci': (ci_low, ci_high), 'bootstrap_distribution': cv_means } def nested_bootstrap_cv_ci(X, y, model_factory, n_outer=100, n_inner=3, k=10, confidence=0.95, random_state=42): """ Nested bootstrap: outer bootstrap for data, inner repeats for CV. More robust but computationally expensive. """ np.random.seed(random_state) all_means = [] for outer in range(n_outer): # Bootstrap sample X_boot, y_boot = resample(X, y, random_state=random_state + outer) # Multiple CV runs on this bootstrap sample run_means = [] for inner in range(n_inner): cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=random_state + outer * n_inner + inner) try: scores = cross_val_score(model_factory(), X_boot, y_boot, cv=cv) run_means.append(scores.mean()) except: continue if run_means: all_means.append(np.mean(run_means)) all_means = np.array(all_means) alpha = 1 - confidence ci_low = np.percentile(all_means, 100 * alpha/2) ci_high = np.percentile(all_means, 100 * (1 - alpha/2)) return { 'mean': np.mean(all_means), 'std': np.std(all_means), 'ci': (ci_low, ci_high), 'bootstrap_distribution': all_means } # Demonp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) print("Bootstrap Methods for CV Confidence Intervals")print("=" * 60) # Simple bootstrap-CVresult_simple = bootstrap_cv_ci( X, y, lambda: RandomForestClassifier(n_estimators=50, random_state=42), n_bootstrap=200, k=10) print("Simple Bootstrap-CV (200 resamples):")print(f" Mean: {result_simple['mean']:.4f}")print(f" Std: {result_simple['std']:.4f}")print(f" 95% CI: [{result_simple['ci'][0]:.4f}, {result_simple['ci'][1]:.4f}]") # Nested bootstrap (more expensive)result_nested = nested_bootstrap_cv_ci( X, y, lambda: RandomForestClassifier(n_estimators=50, random_state=42), n_outer=100, n_inner=3, k=10) print("\nNested Bootstrap-CV (100 outer × 3 inner):")print(f" Mean: {result_nested['mean']:.4f}")print(f" Std: {result_nested['std']:.4f}")print(f" 95% CI: [{result_nested['ci'][0]:.4f}, {result_nested['ci'][1]:.4f}]") # Compare with Nadeau-Bengiofrom scipy import stats cv_scores = cross_val_score( RandomForestClassifier(n_estimators=50, random_state=42), X, y, cv=10)n_test = len(X) / 10n_train = len(X) - n_testcorrection = (1/10) + (n_test/n_train)corrected_se = np.sqrt(correction * np.var(cv_scores, ddof=1))t_crit = stats.t.ppf(0.975, 9) print("\nNadeau-Bengio Corrected:")print(f" Mean: {cv_scores.mean():.4f}")print(f" 95% CI: [{cv_scores.mean() - t_crit*corrected_se:.4f}, " f"{cv_scores.mean() + t_crit*corrected_se:.4f}]")Strategy 2: Bootstrap of Fold Scores
A simpler approach bootstraps the k fold scores directly:
for b = 1 to B:
Resample k scores from the k fold scores (with replacement)
Compute mean of resampled scores
CI = percentiles of bootstrap means
This is fast but doesn't capture data variability—only partition variability given the observed fold scores.
Strategy 3: The .632+ Bootstrap
A sophisticated method that combines in-sample error with out-of-bootstrap error, applying a correction for optimism. More complex but can provide better estimates for small samples.
Choosing a Bootstrap Strategy:
| Method | Captures Data Variability | Captures CV Variability | Computational Cost |
|---|---|---|---|
| Bootstrap-CV | Yes | Yes | High (B × k fits) |
| Bootstrap fold scores | No | Partially | Low (B resamples) |
| Nested Bootstrap | Yes | Yes | Very High (B × r × k) |
| .632+ Bootstrap | Yes | Partially | Moderate (B fits) |
Bootstrap methods are more computationally expensive but make fewer assumptions. They're especially valuable when the distribution of CV estimates is non-normal (e.g., accuracy near 0 or 1). For routine use, Nadeau-Bengio corrected t-intervals are usually sufficient.
Repeated CV provides a natural basis for confidence intervals—we have r independent mean estimates from different partitions.
The Key Insight:
In repeated CV, the r repetition means are less correlated than the k fold scores within a single run. Each repetition uses a different random partition, providing some independence.
CI from Repetition Means:
Given r repetition means $\bar{E}_1, \bar{E}_2, ..., \bar{E}_r$:
$$\text{CI} = \bar{\bar{E}} \pm t_{\alpha/2, r-1} \cdot \frac{s_r}{\sqrt{r}}$$
where:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom scipy import statsfrom sklearn.model_selection import cross_val_score, StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification def repeated_cv_with_proper_ci(X, y, model_factory, k=10, r=10, confidence=0.95, random_state=42): """ Run repeated k-fold CV and compute proper confidence intervals. Uses between-repetition variance, which is more appropriate than pooling all r×k scores. """ rep_means = [] all_scores = [] for rep in range(r): cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=random_state + rep) scores = cross_val_score(model_factory(), X, y, cv=cv) rep_means.append(scores.mean()) all_scores.extend(scores) rep_means = np.array(rep_means) all_scores = np.array(all_scores) # Grand mean grand_mean = rep_means.mean() # Method 1: CI from repetition means (more appropriate) se_reps = rep_means.std(ddof=1) / np.sqrt(r) t_crit = stats.t.ppf((1 + confidence) / 2, r - 1) ci_from_reps = (grand_mean - t_crit * se_reps, grand_mean + t_crit * se_reps) # Method 2: Naive CI from all r×k scores (underestimates variance) se_naive = all_scores.std(ddof=1) / np.sqrt(r * k) t_crit_naive = stats.t.ppf((1 + confidence) / 2, r*k - 1) ci_naive = (grand_mean - t_crit_naive * se_naive, grand_mean + t_crit_naive * se_naive) # Method 3: Nadeau-Bengio corrected on pooled scores n = len(X) n_test = n / k n_train = n - n_test correction = (1/(r*k)) + (n_test/n_train) se_nb = np.sqrt(correction * np.var(all_scores, ddof=1)) ci_nb = (grand_mean - t_crit_naive * se_nb, grand_mean + t_crit_naive * se_nb) return { 'grand_mean': grand_mean, 'rep_means': rep_means, 'rep_std': rep_means.std(ddof=1), 'all_scores': all_scores, 'ci_from_reps': ci_from_reps, 'ci_naive': ci_naive, 'ci_nb_corrected': ci_nb, 'config': {'k': k, 'r': r} } # Demonp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) print("Confidence Intervals from Repeated CV")print("=" * 60) result = repeated_cv_with_proper_ci( X, y, lambda: RandomForestClassifier(n_estimators=50, random_state=42), k=10, r=10) print(f"Configuration: {result['config']['r']}×{result['config']['k']} CV")print(f"Grand mean: {result['grand_mean']:.4f}")print(f"Per-repetition std: {result['rep_std']:.4f}")print()print("Per-repetition means:")for i, m in enumerate(result['rep_means']): print(f" Rep {i+1}: {m:.4f}")print()print("95% Confidence Intervals:")print(f" From rep means (r={result['config']['r']}): " f"[{result['ci_from_reps'][0]:.4f}, {result['ci_from_reps'][1]:.4f}]")print(f" Naive (r×k={result['config']['r']*result['config']['k']}): " f"[{result['ci_naive'][0]:.4f}, {result['ci_naive'][1]:.4f}]")print(f" NB Corrected: " f"[{result['ci_nb_corrected'][0]:.4f}, {result['ci_nb_corrected'][1]:.4f}]")print() # Compare CI widthswidth_reps = result['ci_from_reps'][1] - result['ci_from_reps'][0]width_naive = result['ci_naive'][1] - result['ci_naive'][0]width_nb = result['ci_nb_corrected'][1] - result['ci_nb_corrected'][0] print("CI Widths:")print(f" From rep means: {width_reps:.4f}")print(f" Naive: {width_naive:.4f}")print(f" NB Corrected: {width_nb:.4f}")print()print(f"Note: Naive CI is {width_reps/width_naive:.1f}x too narrow!")Use the between-repetition variance (variance of r mean scores) to construct CIs. This appropriately captures partition-to-partition variability. With r=10 repetitions, you have 9 degrees of freedom—enough for reasonable t-intervals. This is often more reliable than corrections to pooled r×k scores.
Why Between-Repetition Variance Works:
Independence: Different repetitions use different random partitions—they're genuinely independent measurements.
Captures what matters: The repetition-to-repetition variance is exactly what we want to quantify: how much would our estimate change with a different partition?
No need to estimate ρ: We directly observe the variance we care about, rather than trying to correct for unknown correlations.
Caveat: With only r=5 repetitions, you have just 4 degrees of freedom. The t-distribution with 4 df has wide tails (t₀.₉₇₅,₄ = 2.78 vs t₀.₉₇₅,∞ = 1.96). This appropriately reflects our uncertainty but may give wider CIs than expected.
Confidence intervals are frequently misinterpreted. Let's clarify what CV confidence intervals do and don't tell us.
Many practitioners believe: 'If CIs don't overlap, the difference is significant.' This is WRONG. Two 95% CIs can overlap substantially and still be significantly different. Conversely, non-overlapping CIs guarantee significance, but it's a very conservative test. Always use proper statistical tests for comparisons, not visual CI overlap.
What a CV CI Does Capture:
Sampling uncertainty: If we had different data from the same distribution, how different might our estimate be?
Partition uncertainty: If we used a different random partition, how different would the result be?
Finite-sample effects: With limited data, estimates are inherently noisy.
What a CV CI Does NOT Capture:
Bias: If CV systematically over- or under-estimates true performance, the CI doesn't reveal this.
Data quality issues: Mislabeled data, distribution shift, etc.
Model selection bias: If you chose this model after looking at many, the CI doesn't reflect that.
Extrapolation uncertainty: Performance on truly different data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
import numpy as npfrom scipy import statsfrom sklearn.model_selection import cross_val_score, RepeatedStratifiedKFoldfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.datasets import make_classification def compare_models_properly(X, y, model_a_factory, model_b_factory, model_a_name, model_b_name, k=10, r=5, alpha=0.05, random_state=42): """ Properly compare two models, avoiding the CI overlap fallacy. """ # Run repeated CV with same splits for both models cv_results_a = [] cv_results_b = [] paired_diffs = [] for rep in range(r): from sklearn.model_selection import StratifiedKFold cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=random_state + rep) rep_diffs = [] for train_idx, val_idx in cv.split(X, y): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] # Same split for both model_a = model_a_factory().fit(X_train, y_train) model_b = model_b_factory().fit(X_train, y_train) score_a = model_a.score(X_val, y_val) score_b = model_b.score(X_val, y_val) cv_results_a.append(score_a) cv_results_b.append(score_b) rep_diffs.append(score_a - score_b) paired_diffs.extend(rep_diffs) cv_results_a = np.array(cv_results_a) cv_results_b = np.array(cv_results_b) paired_diffs = np.array(paired_diffs) # Individual CIs (using between-rep variance) rep_means_a = cv_results_a.reshape(r, k).mean(axis=1) rep_means_b = cv_results_b.reshape(r, k).mean(axis=1) def ci_from_rep_means(rep_means): t_crit = stats.t.ppf(1 - alpha/2, r - 1) se = rep_means.std(ddof=1) / np.sqrt(r) mean = rep_means.mean() return (mean - t_crit * se, mean + t_crit * se) ci_a = ci_from_rep_means(rep_means_a) ci_b = ci_from_rep_means(rep_means_b) # Check overlap overlap = not (ci_a[1] < ci_b[0] or ci_b[1] < ci_a[0]) # Proper paired t-test on differences t_stat, p_value = stats.ttest_1samp(paired_diffs, 0) # Corrected t-test (Nadeau-Bengio style) n = len(X) n_test = n / k n_train = n - n_test var_correction = (1/(r*k)) + (n_test/n_train) corrected_t = paired_diffs.mean() / np.sqrt(var_correction * paired_diffs.var()) corrected_p = 2 * (1 - stats.t.cdf(abs(corrected_t), r*k - 1)) print("Model Comparison Results") print("=" * 60) print(f"\n{model_a_name}:") print(f" Mean: {rep_means_a.mean():.4f}") print(f" 95% CI: [{ci_a[0]:.4f}, {ci_a[1]:.4f}]") print(f"\n{model_b_name}:") print(f" Mean: {rep_means_b.mean():.4f}") print(f" 95% CI: [{ci_b[0]:.4f}, {ci_b[1]:.4f}]") print(f"\nCIs overlap: {overlap}") print() print("Proper Statistical Tests:") print(f" Mean difference (A - B): {paired_diffs.mean():+.4f}") print(f" Naive paired t-test: p = {p_value:.4f}") print(f" Corrected paired t-test: p = {corrected_p:.4f}") print() if overlap and corrected_p < alpha: print("⚠️ CIs overlap BUT difference IS significant at α=0.05!") print(" This demonstrates the CI overlap fallacy.") elif not overlap and corrected_p >= alpha: print("⚠️ CIs don't overlap BUT difference is NOT significant!") print(" (This is rare but possible.)") elif corrected_p < alpha: winner = model_a_name if paired_diffs.mean() > 0 else model_b_name print(f"✓ {winner} is significantly better (p = {corrected_p:.4f})") else: print("✓ No significant difference between models") return { 'ci_a': ci_a, 'ci_b': ci_b, 'overlap': overlap, 'p_value': corrected_p, 'mean_diff': paired_diffs.mean() } # Demonp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) result = compare_models_properly( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42), lambda: GradientBoostingClassifier(n_estimators=100, random_state=42), "Random Forest", "Gradient Boosting")Proper reporting of CV results is essential for reproducibility and honest communication of uncertainty. Here's a comprehensive guide.
Examples of Good Reporting:
✓ "Our model achieves 87.3% accuracy (95% CI: [85.1%, 89.5%] using Nadeau-Bengio corrected intervals) based on 10×5 repeated stratified k-fold CV on n=1000 samples (random_state=42)."
✓ "Mean F1 = 0.823 ± 0.018 (std across 10 repetitions), 95% CI = [0.810, 0.837], using 10-fold stratified CV with 10 repetitions."
Examples of Inadequate Reporting:
✗ "Accuracy: 87.3%" (no uncertainty)
✗ "Accuracy: 87.3% ± 2.1%" (±2.1% of what? std? SE? CI? which method?)
✗ "Cross-validated accuracy: 87.3%" (what kind of CV? how many folds?)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
import numpy as npfrom scipy import statsfrom sklearn.model_selection import RepeatedStratifiedKFold, cross_val_scorefrom sklearn.ensemble import RandomForestClassifierfrom dataclasses import dataclassfrom typing import Optional, List, Dict @dataclassclass CVReport: """Complete CV reporting structure.""" metric_name: str mean: float std: float se: float ci_low: float ci_high: float ci_method: str n_folds: int n_repeats: int n_samples: int random_state: int all_scores: np.ndarray rep_means: np.ndarray def summary(self) -> str: """Short summary for inline reporting.""" return (f"{self.metric_name} = {self.mean:.3f} " f"(95% CI: [{self.ci_low:.3f}, {self.ci_high:.3f}])") def full_report(self) -> str: """Full publication-ready report.""" lines = [ "Cross-Validation Results", "=" * 50, f"Metric: {self.metric_name}", f"Sample size: n = {self.n_samples}", f"Configuration: {self.n_repeats}×{self.n_folds} repeated stratified k-fold", f"Random state: {self.random_state}", "", "Performance:", f" Mean: {self.mean:.4f}", f" Standard deviation: {self.std:.4f}", f" Standard error: {self.se:.4f}", f" 95% CI: [{self.ci_low:.4f}, {self.ci_high:.4f}]", f" CI method: {self.ci_method}", "", "Per-repetition means:", ] for i, m in enumerate(self.rep_means): lines.append(f" Rep {i+1}: {m:.4f}") lines.extend([ "", "For reproducibility, save all fold scores:", f" Shape: ({self.n_repeats}, {self.n_folds})", f" All scores: {self.all_scores.round(4).tolist()}" ]) return "\n".join(lines) def latex_table_row(self) -> str: """LaTeX table row format.""" return (f"{self.mean:.3f} & ({self.ci_low:.3f}, {self.ci_high:.3f}) & " f"{self.n_repeats}×{self.n_folds} & {self.n_samples}") def create_cv_report(X, y, model_factory, model_name: str = "Model", metric: str = 'accuracy', n_folds: int = 10, n_repeats: int = 5, random_state: int = 42) -> CVReport: """ Create a complete, publication-ready CV report. """ cv = RepeatedStratifiedKFold(n_splits=n_folds, n_repeats=n_repeats, random_state=random_state) all_scores = cross_val_score(model_factory(), X, y, cv=cv, scoring=metric) # Reshape to (n_repeats, n_folds) scores_matrix = all_scores.reshape(n_repeats, n_folds) rep_means = scores_matrix.mean(axis=1) mean = all_scores.mean() std = all_scores.std() # Use between-repetition variance for CI se = rep_means.std(ddof=1) / np.sqrt(n_repeats) t_crit = stats.t.ppf(0.975, n_repeats - 1) ci_low = mean - t_crit * se ci_high = mean + t_crit * se return CVReport( metric_name=f"{model_name} {metric}", mean=mean, std=std, se=se, ci_low=ci_low, ci_high=ci_high, ci_method="Between-repetition t-interval", n_folds=n_folds, n_repeats=n_repeats, n_samples=len(X), random_state=random_state, all_scores=all_scores, rep_means=rep_means ) # Demofrom sklearn.datasets import make_classificationnp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) report = create_cv_report( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42), model_name="Random Forest", metric='accuracy', n_folds=10, n_repeats=5, random_state=42) print(report.full_report())print("\n" + "=" * 50)print("Summary for inline use:")print(report.summary())print("\nLaTeX table row:")print(report.latex_table_row())Many ML venues now require proper uncertainty reporting. Check submission guidelines for specific requirements. Common expectations: (1) At least mean ± std, (2) Repeated CV for rigorous venues, (3) Statistical tests for model comparisons, (4) Random seeds for reproducibility.
Even experienced practitioners make mistakes with CV confidence intervals. Here are the most common pitfalls and how to avoid them.
| Pitfall | Problem | Solution |
|---|---|---|
| Using naive SE = std/√k | Underestimates variance due to fold correlation | Use Nadeau-Bengio correction or between-rep variance |
| Reporting ± std instead of CI | Std is not a confidence interval | Convert to CI: mean ± t_crit × SE |
| Cherry-picking random seeds | Invalid inference | Pre-register seed or use repeated CV |
| Comparing CIs visually | CI overlap ≠ non-significance | Use proper paired tests |
| Ignoring degrees of freedom | Too narrow CIs with few folds | Use t-distribution with k-1 or r-1 df |
| Pooling r×k scores for SE | Ignores between-repetition variance | Use between-rep variance for SE |
| Not reporting method | Results not reproducible | Always state CI method used |
If you compare 10 models and report 'best model is significantly better,' remember: with 10 comparisons at α=0.05, you expect ~0.5 false positives by chance. Use appropriate corrections (Bonferroni, Holm, etc.) for multiple comparisons, or use methods designed for multiple comparison (Nemenyi test, critical difference diagrams).
We've developed a comprehensive understanding of uncertainty quantification for cross-validation estimates. Here are the essential takeaways:
Module Complete:
You've now mastered k-fold cross-validation comprehensively—from the basic procedure through bias-variance tradeoffs, k selection, repeated CV, and rigorous uncertainty quantification. You can now:
This foundation prepares you for advanced validation topics like stratified/group CV, time series validation, and nested CV for hyperparameter tuning.
Congratulations! You've completed the K-Fold Cross-Validation module. You now have world-class understanding of model evaluation methodology—knowledge that distinguishes practitioners who merely use ML from those who truly understand it.