Loading learning content...
A subtle but important limitation of standard k-fold cross-validation is partition dependence: the CV estimate changes depending on how the data is randomly split into folds. Run 10-fold CV twice with different random seeds, and you'll get different results. Which result should you trust?
Repeated cross-validation addresses this by running the entire k-fold procedure multiple times, each with a different random partition, then aggregating results across all runs. This simple extension provides more stable estimates with proper variance quantification—making it the preferred choice for rigorous evaluation.
By the end of this page, you will understand: (1) Why single-run CV is inherently variable, (2) How repeated CV reduces variance through independent repetitions, (3) The variance decomposition into within-CV and between-CV components, (4) Optimal configurations for different scenarios, and (5) Best practices for implementation and reporting.
Let's first understand the problem that repeated CV solves.
Demonstrating Partition Dependence:
When you run k-fold CV with different random seeds, you get different results. This variation is not due to random model training—even deterministic models show this behavior. It's the partition itself that matters.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import cross_val_score, KFoldfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classification # Generate datanp.random.seed(42)X, y = make_classification(n_samples=300, n_features=20, n_informative=10, random_state=42) # Run 10-fold CV with many different random partitionsn_runs = 100cv_means = [] for seed in range(n_runs): kfold = KFold(n_splits=10, shuffle=True, random_state=seed) scores = cross_val_score( LogisticRegression(max_iter=1000, random_state=42), # Fixed model seed X, y, cv=kfold, scoring='accuracy' ) cv_means.append(scores.mean()) cv_means = np.array(cv_means) # Analyze the distributionprint("Partition Dependence Analysis (10-fold CV, 100 runs)")print("=" * 60)print(f"Mean of CV estimates: {cv_means.mean():.4f}")print(f"Std of CV estimates: {cv_means.std():.4f}")print(f"Range: [{cv_means.min():.4f}, {cv_means.max():.4f}]")print(f"Span: {(cv_means.max() - cv_means.min()):.4f}")print()print("If we reported just one run, we might get anything from")print(f"{cv_means.min():.4f} to {cv_means.max():.4f} - a {(cv_means.max()-cv_means.min())*100:.1f}% point range!") # Visualizeplt.figure(figsize=(12, 4)) plt.subplot(1, 2, 1)plt.hist(cv_means, bins=20, edgecolor='black', alpha=0.7)plt.axvline(cv_means.mean(), color='red', linestyle='--', label=f'Mean = {cv_means.mean():.4f}')plt.xlabel('CV Accuracy')plt.ylabel('Frequency')plt.title('Distribution of 10-Fold CV Estimates\n(100 different partitions)')plt.legend() plt.subplot(1, 2, 2)plt.plot(range(n_runs), cv_means, 'o-', markersize=3, alpha=0.7)plt.axhline(cv_means.mean(), color='red', linestyle='--')plt.fill_between(range(n_runs), cv_means.mean() - cv_means.std(), cv_means.mean() + cv_means.std(), alpha=0.2, color='red')plt.xlabel('Run Number (different random seed)')plt.ylabel('CV Accuracy')plt.title('CV Estimate Varies with Partition')plt.tight_layout() # Show that model randomness is not the issueprint("\n" + "=" * 60)print("Confirming: Model randomness is NOT the source of variance")print("=" * 60) # Same partition, different model seedsfixed_kfold = KFold(n_splits=10, shuffle=True, random_state=42)model_var_scores = []for model_seed in range(50): scores = cross_val_score( LogisticRegression(max_iter=1000, random_state=model_seed), X, y, cv=fixed_kfold ) model_var_scores.append(scores.mean()) print(f"Fixed partition, varying model seed:")print(f" Std of CV estimates: {np.std(model_var_scores):.6f}")print(f"\nVarying partition, fixed model seed:")print(f" Std of CV estimates: {cv_means.std():.6f}")print(f"\nPartition variance is {cv_means.std()/np.std(model_var_scores):.1f}x larger!")If your 10-fold CV gives 85% accuracy with one random seed, someone else might get 83% or 87% with a different seed on the same data. Single-run CV results are not fully reproducible—the random partition is part of the 'experiment'. This is why serious evaluation requires repeated CV.
Why Partition Matters So Much:
The specific samples that end up in each fold affect the result in several ways:
Difficult samples: Some samples are inherently harder to classify. If they cluster in one fold's validation set, that fold's score drops.
Similar samples: Samples that are similar to each other should ideally be spread across folds. If they cluster, validation becomes easier (seeing similar training examples) or harder (not seeing them).
Class distribution: Even with stratification, small imbalances between folds introduce noise.
Outliers: Where outliers land affects the model and evaluation.
The solution: average over multiple partitions.
Repeated cross-validation (also called n×k CV or replicated CV) is straightforward: run k-fold CV multiple times, each with a different random partition, and aggregate results.
Formal Definition:
Let r be the number of repetitions and k be the number of folds. Repeated CV produces r × k individual fold scores.
Algorithm:
for i = 1 to r: # r repetitions
Create random k-fold partition_i
for j = 1 to k: # k folds per repetition
Train on all folds except j
Evaluate on fold j
Store score[i,j]
Final estimate = mean(all r × k scores)
Standard error = std(all r × k scores) / sqrt(r × k)
Common notation: '5×10 CV' means 5 repetitions of 10-fold CV, producing 50 individual fold evaluations. '10×5 CV' means 10 repetitions of 5-fold CV, also 50 evaluations. These are different! 5×10 trains each time on 90% of data; 10×5 trains on 80%.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom sklearn.model_selection import ( cross_val_score, RepeatedKFold, RepeatedStratifiedKFold)from sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classificationfrom typing import List, Tupleimport time class RepeatedCVAnalysis: """ Comprehensive repeated cross-validation analysis. """ def __init__(self, n_splits: int = 10, n_repeats: int = 5, random_state: int = 42): self.n_splits = n_splits self.n_repeats = n_repeats self.random_state = random_state def run_repeated_cv(self, X, y, model_factory, scoring='accuracy', stratified=True, verbose=True): """ Run repeated cross-validation with detailed analysis. """ # Choose CV strategy if stratified: cv = RepeatedStratifiedKFold( n_splits=self.n_splits, n_repeats=self.n_repeats, random_state=self.random_state ) cv_name = f"Repeated Stratified {self.n_splits}-Fold" else: cv = RepeatedKFold( n_splits=self.n_splits, n_repeats=self.n_repeats, random_state=self.random_state ) cv_name = f"Repeated {self.n_splits}-Fold" # Run CV start = time.time() all_scores = cross_val_score( model_factory(), X, y, cv=cv, scoring=scoring ) elapsed = time.time() - start # Reshape to (n_repeats, n_splits) scores_matrix = all_scores.reshape(self.n_repeats, self.n_splits) # Compute statistics results = self._compute_statistics(scores_matrix, all_scores) results['cv_name'] = cv_name results['time'] = elapsed results['total_folds'] = len(all_scores) if verbose: self._print_report(results) return results def _compute_statistics(self, scores_matrix, all_scores): """Compute comprehensive statistics.""" # Overall statistics mean_score = all_scores.mean() std_score = all_scores.std() se_score = std_score / np.sqrt(len(all_scores)) # Per-repetition means rep_means = scores_matrix.mean(axis=1) # Variance decomposition within_rep_var = scores_matrix.var(axis=1).mean() # Avg variance within reps between_rep_var = rep_means.var() # Variance between rep means # Confidence intervals ci_95 = (mean_score - 1.96 * se_score, mean_score + 1.96 * se_score) return { 'mean': mean_score, 'std': std_score, 'se': se_score, 'ci_95': ci_95, 'rep_means': rep_means, 'within_rep_var': within_rep_var, 'between_rep_var': between_rep_var, 'scores_matrix': scores_matrix, 'all_scores': all_scores } def _print_report(self, results): """Print detailed report.""" print(f"\n{'='*60}") print(f"Repeated CV Results: {results['cv_name']}") print(f"{'='*60}") print(f"Configuration: {self.n_repeats} repeats × {self.n_splits} folds " f"= {results['total_folds']} evaluations") print(f"Time: {results['time']:.2f}s") print() print("Overall Statistics:") print(f" Mean: {results['mean']:.4f}") print(f" Std: {results['std']:.4f}") print(f" SE: {results['se']:.4f}") print(f" 95% CI: [{results['ci_95'][0]:.4f}, {results['ci_95'][1]:.4f}]") print() print("Variance Decomposition:") print(f" Within-repetition variance: {results['within_rep_var']:.6f}") print(f" Between-repetition variance: {results['between_rep_var']:.6f}") print() print("Per-Repetition Means:") for i, mean in enumerate(results['rep_means']): print(f" Rep {i+1}: {mean:.4f}") # Demonp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) analyzer = RepeatedCVAnalysis(n_splits=10, n_repeats=5, random_state=42)results = analyzer.run_repeated_cv( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42))Key Properties of Repeated CV:
More stable estimates: Averaging over r partitions reduces partition-dependent variance
Better variance estimation: With r×k scores, we can properly estimate variance
Same bias as single-run: Each repetition trains on (k-1)/k of data; averaging doesn't change this
r× computation cost: But trivially parallelizable—each repetition is independent
Understanding the variance structure in repeated CV helps us interpret results and choose configurations.
Two Sources of Variance:
Within-repetition variance (σ²_within): The variance of fold scores within a single CV run. Reflects validation set sampling variability.
Between-repetition variance (σ²_between): The variance of mean scores across repetitions. Reflects partition-dependence.
Total Variance Decomposition:
$$\text{Var}[\text{score}] = \sigma^2_{\text{within}} + \sigma^2_{\text{between}}$$
For the mean across all r×k scores:
$$\text{Var}[\bar{\text{score}}] \approx \frac{\sigma^2_{\text{within}}}{r \cdot k} + \frac{\sigma^2_{\text{between}}}{r}$$
The between-repetition term (σ²_between/r) often dominates! This shows why adding repetitions is effective.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import numpy as npfrom sklearn.model_selection import cross_val_score, StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification def analyze_variance_components(X, y, model_factory, k=10, n_repetitions=30, random_state=42): """ Decompose variance into within-fold and between-repetition components. """ np.random.seed(random_state) all_fold_scores = [] rep_means = [] for rep in range(n_repetitions): # Different partition each repetition cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=random_state + rep) fold_scores = cross_val_score( model_factory(), X, y, cv=cv, scoring='accuracy' ) all_fold_scores.append(fold_scores) rep_means.append(fold_scores.mean()) all_fold_scores = np.array(all_fold_scores) # Shape: (n_reps, k) rep_means = np.array(rep_means) # Variance decomposition # Within-repetition: average variance across folds within each rep within_variances = all_fold_scores.var(axis=1) avg_within_var = within_variances.mean() # Between-repetition: variance of repetition means between_var = rep_means.var() # Total variance of individual fold scores total_var = all_fold_scores.var() print("Variance Decomposition Analysis") print("=" * 60) print(f"Configuration: {n_repetitions} repetitions × {k} folds") print(f"Total fold evaluations: {n_repetitions * k}") print() print(f"Within-repetition variance (avg): {avg_within_var:.6f}") print(f"Between-repetition variance: {between_var:.6f}") print(f"Total variance of fold scores: {total_var:.6f}") print() print(f"Between/Total ratio: {between_var/total_var:.2%}") print(f"Within/Total ratio: {avg_within_var/total_var:.2%}") print() # Effective variance reduction from averaging # Variance of mean over r repetitions var_of_mean_1_rep = all_fold_scores[0].var() / k var_of_mean_r_reps = total_var / (n_repetitions * k) print("Variance of the mean estimate:") print(f" Single repetition (k={k}): {var_of_mean_1_rep:.8f}") print(f" {n_repetitions} repetitions: {var_of_mean_r_reps:.8f}") print(f" Variance reduction: {var_of_mean_1_rep/var_of_mean_r_reps:.1f}x") return { 'within_var': avg_within_var, 'between_var': between_var, 'total_var': total_var, 'all_scores': all_fold_scores, 'rep_means': rep_means } # Generate datanp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) # Analyze variance for different modelsprint("\n" + "=" * 70)print("Random Forest (moderate stability)")print("=" * 70)rf_results = analyze_variance_components( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42)) print("\n" + "=" * 70)print("How variance decreases with more repetitions")print("=" * 70) for r in [1, 2, 5, 10, 20]: rep_means = rf_results['rep_means'][:r] var_of_mean = np.var(rep_means) / r + rf_results['within_var'] / (r * 10) se = np.sqrt(var_of_mean) print(f"r={r:2d}: SE ≈ {se:.4f}, 95% CI width ≈ {3.92*se:.4f}")The between-repetition variance captures the 'partition luck' effect. If this is large, single-run CV is unreliable—your reported result is highly dependent on random seed choice. Repeated CV reveals this variability and averages it out.
Optimal Allocation: k vs r?
Given a fixed computational budget (total number of model fits = r × k), should we prefer more folds or more repetitions?
| Strategy | Model Fits | Bias | Variance Reduction |
|---|---|---|---|
| 10×5 CV | 50 | Moderate (80% train) | Addresses partition variance |
| 5×10 CV | 50 | Low (90% train) | Addresses partition variance |
| 1×50 CV | 50 | Very low (98% train) | Only within-partition variance |
Key insight: Increasing r (repetitions) reduces between-repetition variance, which is often the dominant component. Increasing k reduces bias but may increase correlated variance. Generally, 5×10 or 10×10 is preferred over 1×50.
Choosing the number of repetitions involves balancing precision against computational cost. Let's develop practical guidelines.
| Repetitions (r) | SE Reduction | 95% CI Width | Best For | Compute Cost |
|---|---|---|---|---|
| 1 | 1× | Full width | Quick experiments | k × training |
| 3 | ~1.7× | ~58% of single | Development iteration | 3k × training |
| 5 | ~2.2× | ~45% of single | Solid publication | 5k × training |
| 10 | ~3.2× | ~32% of single | Rigorous evaluation | 10k × training |
| 20 | ~4.5× | ~22% of single | Critical decisions | 20k × training |
Diminishing Returns:
The standard error decreases as 1/√r. This means:
The cost, however, is linear. At some point, additional repetitions provide minimal SE reduction for substantial compute cost.
Practical Guidelines:
r = 5: Good default for most purposes. Reduces SE by ~55%. Standard for publication.
r = 10: Use for competition submissions, important model comparisons, or when results are close.
r = 3: Minimum for meaningful variance estimation. Use when compute is constrained.
r = 1: Only for quick exploration. Never report single-run CV as final results.
Dietterich (1998) proposed 5×2 CV specifically for comparing two algorithms. Five repetitions of 2-fold CV provides 10 paired differences with specific statistical properties. The resulting t-test has correct Type I error rates, unlike naive tests on k-fold results.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import RepeatedStratifiedKFold, cross_val_scorefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification def analyze_repetition_effect(X, y, model_factory, k=10, max_reps=30, random_state=42): """ Analyze how estimation quality improves with more repetitions. """ # Run many repetitions all_scores = [] for rep in range(max_reps): cv = RepeatedStratifiedKFold( n_splits=k, n_repeats=1, random_state=random_state + rep ) scores = cross_val_score(model_factory(), X, y, cv=cv) all_scores.extend(scores) all_scores = np.array(all_scores).reshape(max_reps, k) # Analyze cumulative effect of adding repetitions results = [] for r in range(1, max_reps + 1): scores_so_far = all_scores[:r].flatten() mean = scores_so_far.mean() se = scores_so_far.std() / np.sqrt(len(scores_so_far)) ci_width = 3.92 * se # 95% CI results.append({ 'r': r, 'mean': mean, 'se': se, 'ci_width': ci_width }) # Print key points print("Effect of Repetitions on Estimate Quality") print("=" * 60) print(f"{'Reps':>5} {'Mean':>8} {'SE':>8} {'CI Width':>10} {'vs r=1':>10}") print("-" * 60) baseline_ci = results[0]['ci_width'] for r in [1, 2, 3, 5, 10, 15, 20, 30]: if r <= max_reps: res = results[r-1] reduction = res['ci_width'] / baseline_ci print(f"{r:>5} {res['mean']:>8.4f} {res['se']:>8.4f} " f"{res['ci_width']:>10.4f} {reduction:>10.1%}") return results # Generate datanp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) print()results = analyze_repetition_effect( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42)) # Visualize convergenceplt.figure(figsize=(12, 4)) plt.subplot(1, 2, 1)reps = [r['r'] for r in results]means = [r['mean'] for r in results]ses = [r['se'] for r in results] plt.plot(reps, means, 'b-', linewidth=2)plt.fill_between(reps, np.array(means) - 1.96*np.array(ses), np.array(means) + 1.96*np.array(ses), alpha=0.3)plt.xlabel('Number of Repetitions')plt.ylabel('Mean Accuracy')plt.title('CV Estimate Stabilizes with More Repetitions')plt.axhline(means[-1], color='red', linestyle='--', alpha=0.5) plt.subplot(1, 2, 2)plt.plot(reps, [r['ci_width'] for r in results], 'g-', linewidth=2)plt.xlabel('Number of Repetitions')plt.ylabel('95% CI Width')plt.title('Confidence Interval Shrinks with Repetitions') plt.tight_layout() # Recommendation functiondef recommend_repetitions(compute_budget, importance, model_stability='medium'): """ Recommend number of repetitions based on constraints. """ if importance == 'critical': base_r = 10 elif importance == 'publication': base_r = 5 elif importance == 'development': base_r = 3 else: # exploration base_r = 1 # Adjust for model stability if model_stability == 'low': # Unstable models need more base_r = int(base_r * 1.5) elif model_stability == 'high': # Stable models need fewer base_r = max(1, int(base_r * 0.7)) # Adjust for compute budget if compute_budget == 'low': base_r = min(base_r, 3) elif compute_budget == 'very_low': base_r = 1 return base_r print("\nRecommendation Examples:")print("-" * 40)for imp in ['exploration', 'development', 'publication', 'critical']: r = recommend_repetitions('medium', imp) print(f"Importance='{imp}': r={r}")Implementing repeated CV correctly requires attention to several details. Let's cover best practices.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
import numpy as npfrom sklearn.model_selection import RepeatedStratifiedKFold, cross_validatefrom sklearn.base import clonefrom dataclasses import dataclassfrom typing import Dict, List, Any, Callable, Optionalimport timefrom joblib import Parallel, delayed @dataclassclass RepeatedCVResult: """Container for repeated CV results.""" mean: float std: float se: float ci_95: tuple all_scores: np.ndarray per_rep_means: np.ndarray config: Dict[str, Any] time_seconds: float def summary(self) -> str: return (f"{self.mean:.4f} ± {self.std:.4f} " f"(95% CI: [{self.ci_95[0]:.4f}, {self.ci_95[1]:.4f}])") def detailed_report(self) -> str: lines = [ "=" * 60, "Repeated Cross-Validation Report", "=" * 60, f"Configuration: {self.config['n_repeats']}×{self.config['n_splits']} CV", f"Total evaluations: {len(self.all_scores)}", f"Time: {self.time_seconds:.2f}s", "", "Results:", f" Mean: {self.mean:.4f}", f" Std: {self.std:.4f}", f" SE: {self.se:.4f}", f" 95% CI: [{self.ci_95[0]:.4f}, {self.ci_95[1]:.4f}]", "", "Per-Repetition Means:", ] for i, m in enumerate(self.per_rep_means): lines.append(f" Rep {i+1}: {m:.4f}") return "\n".join(lines) def run_repeated_cv( X: np.ndarray, y: np.ndarray, model_factory: Callable, n_splits: int = 10, n_repeats: int = 5, scoring: str = 'accuracy', random_state: int = 42, n_jobs: int = -1, return_estimators: bool = False) -> RepeatedCVResult: """ Production-quality repeated cross-validation. Parameters: ----------- X : array-like of shape (n_samples, n_features) y : array-like of shape (n_samples,) model_factory : callable returning unfitted estimator n_splits : number of folds n_repeats : number of repetitions scoring : scoring metric random_state : random seed for reproducibility n_jobs : parallel jobs (-1 = all cores) return_estimators : whether to keep fitted models Returns: -------- RepeatedCVResult with comprehensive results """ cv = RepeatedStratifiedKFold( n_splits=n_splits, n_repeats=n_repeats, random_state=random_state ) start_time = time.time() cv_results = cross_validate( model_factory(), X, y, cv=cv, scoring=scoring, return_train_score=True, return_estimator=return_estimators, n_jobs=n_jobs ) elapsed = time.time() - start_time # Extract test scores all_scores = cv_results['test_score'] # Reshape to (n_repeats, n_splits) scores_matrix = all_scores.reshape(n_repeats, n_splits) per_rep_means = scores_matrix.mean(axis=1) # Statistics mean = all_scores.mean() std = all_scores.std() se = std / np.sqrt(len(all_scores)) ci_95 = (mean - 1.96 * se, mean + 1.96 * se) return RepeatedCVResult( mean=mean, std=std, se=se, ci_95=ci_95, all_scores=all_scores, per_rep_means=per_rep_means, config={ 'n_splits': n_splits, 'n_repeats': n_repeats, 'scoring': scoring, 'random_state': random_state }, time_seconds=elapsed ) # Example usagefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification # Generate datanp.random.seed(42)X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42) # Run repeated CVresult = run_repeated_cv( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42), n_splits=10, n_repeats=5, random_state=42) print(result.detailed_report()) # Compare models properlyprint("\n" + "=" * 60)print("Comparing Models with Repeated CV")print("=" * 60) from sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import GradientBoostingClassifier models = [ (lambda: LogisticRegression(max_iter=1000), "Logistic Regression"), (lambda: RandomForestClassifier(n_estimators=100), "Random Forest"), (lambda: GradientBoostingClassifier(n_estimators=100), "Gradient Boosting")] for model_factory, name in models: result = run_repeated_cv(X, y, model_factory, n_splits=10, n_repeats=5) print(f"{name:25s}: {result.summary()}")Properly interpreting repeated CV results requires understanding what the statistics represent and their limitations.
What the Mean Represents:
The mean of r×k scores estimates the expected performance of a model trained on (k-1)/k of the data. This is still not quite the same as full-data performance due to the pessimistic bias we discussed. However, with k=10, we're estimating 90%-data performance, which is very close.
What the Standard Deviation Represents:
The standard deviation across r×k scores reflects total variability from:
This is NOT the same as the expected variability you'd see with new test data. It's variability in the CV procedure itself.
What the Confidence Interval Means:
The 95% CI says: if we repeated this entire r×k CV procedure many times, 95% of the resulting mean estimates would fall within this interval. It's a confidence interval for the CV estimate of performance, which itself is an estimate of true performance.
The CI does NOT mean: '95% of new predictions will have accuracy in this range.' It means: 'We are 95% confident that the true expected CV estimate (and approximately true performance) lies in this range.' These are different statistical statements.
Comparing Models:
When comparing models A and B:
✓ Valid approach: Use the same CV splits for both models and compare paired differences.
✗ Invalid approach: Run independent CV for each and compare means.
Why pairing matters:
Some folds are inherently easier or harder. Both models benefit/suffer from the same folds. Paired comparison cancels this fold-specific noise.
| Fold | Model A | Model B | A - B |
|---|---|---|---|
| 1 | 0.85 | 0.83 | +0.02 |
| 2 | 0.92 | 0.89 | +0.03 |
| 3 | 0.78 | 0.76 | +0.02 |
The difference A-B is remarkably stable, even though individual scores vary widely. Paired tests leverage this stability.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom scipy import statsfrom sklearn.model_selection import StratifiedKFoldfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.datasets import make_classification def paired_cv_comparison(X, y, model_a_factory, model_b_factory, n_splits=10, n_repeats=5, random_state=42): """ Properly compare two models using paired repeated CV. """ np.random.seed(random_state) all_diffs = [] for rep in range(n_repeats): cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state + rep) for train_idx, val_idx in cv.split(X, y): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] # Same train/val split for both models model_a = model_a_factory() model_b = model_b_factory() model_a.fit(X_train, y_train) model_b.fit(X_train, y_train) score_a = model_a.score(X_val, y_val) score_b = model_b.score(X_val, y_val) all_diffs.append(score_a - score_b) all_diffs = np.array(all_diffs) # Statistics mean_diff = all_diffs.mean() std_diff = all_diffs.std() se_diff = std_diff / np.sqrt(len(all_diffs)) # One-sample t-test: is mean_diff significantly different from 0? t_stat, p_value = stats.ttest_1samp(all_diffs, 0) # Corrected t-test (Nadeau & Bengio, 2003) # Accounts for CV fold correlation n = len(X) n_test = n // n_splits n_train = n - n_test corrected_variance = ( all_diffs.var() * (1/len(all_diffs) + n_test/n_train) ) corrected_t = mean_diff / np.sqrt(corrected_variance) df = len(all_diffs) - 1 corrected_p = 2 * (1 - stats.t.cdf(abs(corrected_t), df)) print("Paired CV Comparison Results") print("=" * 60) print(f"Configuration: {n_repeats}×{n_splits} CV = {len(all_diffs)} pairs") print() print(f"Mean difference (A - B): {mean_diff:+.4f}") print(f"Std of differences: {std_diff:.4f}") print(f"95% CI for difference: " f"[{mean_diff - 1.96*se_diff:+.4f}, {mean_diff + 1.96*se_diff:+.4f}]") print() print(f"Naive t-test: t = {t_stat:.3f}, p = {p_value:.4f}") print(f"Corrected t-test: t = {corrected_t:.3f}, p = {corrected_p:.4f}") print() if corrected_p < 0.05: winner = "Model A" if mean_diff > 0 else "Model B" print(f"Conclusion: {winner} is significantly better (α=0.05)") else: print("Conclusion: No significant difference between models") return { 'mean_diff': mean_diff, 'std_diff': std_diff, 'all_diffs': all_diffs, 'naive_p': p_value, 'corrected_p': corrected_p } # Demonp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) result = paired_cv_comparison( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42), lambda: GradientBoostingClassifier(n_estimators=100, random_state=42), n_splits=10, n_repeats=5)Beyond standard repeated k-fold, several specialized configurations serve specific purposes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
import numpy as npfrom sklearn.model_selection import ( RepeatedStratifiedKFold, ShuffleSplit, cross_val_score)from sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification def compare_cv_strategies(X, y, model_factory, n_iterations=50): """ Compare different repeated CV strategies. """ strategies = [ # Standard repeated 10-fold ("10×5 Repeated KFold", RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)), # 5×2 CV for statistical testing ("5×2 CV (for t-test)", RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=42)), # Monte Carlo (Shuffle-Split) ("Monte Carlo (50 × 20%)", ShuffleSplit(n_splits=50, test_size=0.2, random_state=42)), # More folds, fewer repeats ("20×2 Repeated KFold", RepeatedStratifiedKFold(n_splits=20, n_repeats=2, random_state=42)), ] print("Comparison of CV Strategies") print("=" * 70) print(f"{'Strategy':<25} {'Mean':>8} {'Std':>8} {'SE':>8} {'Evals':>8}") print("-" * 70) for name, cv in strategies: scores = cross_val_score(model_factory(), X, y, cv=cv) print(f"{name:<25} {scores.mean():>8.4f} {scores.std():>8.4f} " f"{scores.std()/np.sqrt(len(scores)):>8.4f} {len(scores):>8}") return strategies # Generate datanp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) strategies = compare_cv_strategies( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42)) # 5×2 CV t-test implementationfrom scipy import stats def five_times_two_cv_test(X, y, model_a_factory, model_b_factory, random_state=42): """ Dietterich's 5×2 CV paired t-test. Designed to have approximately correct Type I error rate. """ p_values = [] # Differences in first fold of each repetition differences_sq_sum = 0 for rep in range(5): # Create 2-fold split np.random.seed(random_state + rep) indices = np.random.permutation(len(X)) mid = len(X) // 2 idx1, idx2 = indices[:mid], indices[mid:] X1, y1 = X[idx1], y[idx1] X2, y2 = X[idx2], y[idx2] # Fold 1: train on 1, test on 2 model_a = model_a_factory().fit(X1, y1) model_b = model_b_factory().fit(X1, y1) p1_a = model_a.score(X2, y2) p1_b = model_b.score(X2, y2) # Fold 2: train on 2, test on 1 model_a = model_a_factory().fit(X2, y2) model_b = model_b_factory().fit(X2, y2) p2_a = model_a.score(X1, y1) p2_b = model_b.score(X1, y1) d1 = p1_a - p1_b d2 = p2_a - p2_b p_values.append(d1) # Variance estimate for this repetition d_mean = (d1 + d2) / 2 s_sq = (d1 - d_mean)**2 + (d2 - d_mean)**2 differences_sq_sum += s_sq # 5×2 CV t-statistic p1 = p_values[0] t_stat = p1 / np.sqrt(differences_sq_sum / 5) # Approximation: t-distribution with 5 degrees of freedom p_value = 2 * (1 - stats.t.cdf(abs(t_stat), 5)) print("5×2 CV Paired t-test Results") print("=" * 50) print(f"t-statistic: {t_stat:.4f}") print(f"p-value: {p_value:.4f}") print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}") return {'t_stat': t_stat, 'p_value': p_value} print("\n" + "=" * 70)print("5×2 CV Paired t-test (Dietterich)")print("=" * 70)from sklearn.ensemble import GradientBoostingClassifier result = five_times_two_cv_test( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=None), lambda: GradientBoostingClassifier(n_estimators=100, random_state=None))Use 5×2 CV for formal statistical comparison of two algorithms. Use 10×5 repeated k-fold for general evaluation and reporting. Use Monte Carlo (Shuffle-Split) when you need flexible train/test ratios or very many iterations for stability analysis.
We've thoroughly explored repeated cross-validation as a method for obtaining more reliable performance estimates. Let's consolidate the key insights:
What's Next:
With solid understanding of repeated CV, we now turn to confidence intervals for CV estimates. While we've computed approximate CIs using the standard formula, proper confidence intervals for CV require careful treatment of the correlation structure between fold estimates. The next page develops rigorous methods for quantifying uncertainty in CV-based performance claims.
You now have a comprehensive understanding of repeated cross-validation—why it's necessary, how it works, what configurations to use, and how to implement it properly. Your model evaluations will be more reliable and your reported results more defensible.