Machine LearningK-Fold Cross-Validation

K-Fold Cross-Validation

LevelIntermediate

Duration90 mins

TopicK-Fold Cross-Validation

4 / 5

Repeated Cross-Validation

Breaking the Partition Dependence

A subtle but important limitation of standard k-fold cross-validation is partition dependence: the CV estimate changes depending on how the data is randomly split into folds. Run 10-fold CV twice with different random seeds, and you'll get different results. Which result should you trust?

Repeated cross-validation addresses this by running the entire k-fold procedure multiple times, each with a different random partition, then aggregating results across all runs. This simple extension provides more stable estimates with proper variance quantification—making it the preferred choice for rigorous evaluation.

What You Will Master

By the end of this page, you will understand: (1) Why single-run CV is inherently variable, (2) How repeated CV reduces variance through independent repetitions, (3) The variance decomposition into within-CV and between-CV components, (4) Optimal configurations for different scenarios, and (5) Best practices for implementation and reporting.

The Partition Dependence Problem

Let's first understand the problem that repeated CV solves.

Demonstrating Partition Dependence:

When you run k-fold CV with different random seeds, you get different results. This variation is not due to random model training—even deterministic models show this behavior. It's the partition itself that matters.

partition_dependence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=300, n_features=20, 
                           n_informative=10, random_state=42)
 
# Run 10-fold CV with many different random partitions
n_runs = 100
cv_means = []
 
for seed in range(n_runs):
    kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
    scores = cross_val_score(
        LogisticRegression(max_iter=1000, random_state=42),  # Fixed model seed
        X, y, cv=kfold, scoring='accuracy'
    )
    cv_means.append(scores.mean())
 
cv_means = np.array(cv_means)
 
# Analyze the distribution
print("Partition Dependence Analysis (10-fold CV, 100 runs)")
print("=" * 60)
print(f"Mean of CV estimates: {cv_means.mean():.4f}")
print(f"Std of CV estimates:  {cv_means.std():.4f}")
print(f"Range: [{cv_means.min():.4f}, {cv_means.max():.4f}]")
print(f"Span: {(cv_means.max() - cv_means.min()):.4f}")
print()
print("If we reported just one run, we might get anything from")
print(f"{cv_means.min():.4f} to {cv_means.max():.4f} - a {(cv_means.max()-cv_means.min())*100:.1f}% point range!")
 
# Visualize
plt.figure(figsize=(12, 4))
 
plt.subplot(1, 2, 1)
plt.hist(cv_means, bins=20, edgecolor='black', alpha=0.7)
plt.axvline(cv_means.mean(), color='red', linestyle='--', 
            label=f'Mean = {cv_means.mean():.4f}')
plt.xlabel('CV Accuracy')
plt.ylabel('Frequency')
plt.title('Distribution of 10-Fold CV Estimates\n(100 different partitions)')
plt.legend()
 
plt.subplot(1, 2, 2)
plt.plot(range(n_runs), cv_means, 'o-', markersize=3, alpha=0.7)
plt.axhline(cv_means.mean(), color='red', linestyle='--')
plt.fill_between(range(n_runs), 
                 cv_means.mean() - cv_means.std(),
                 cv_means.mean() + cv_means.std(),
                 alpha=0.2, color='red')
plt.xlabel('Run Number (different random seed)')
plt.ylabel('CV Accuracy')
plt.title('CV Estimate Varies with Partition')
plt.tight_layout()
 
# Show that model randomness is not the issue
print("\n" + "=" * 60)
print("Confirming: Model randomness is NOT the source of variance")
print("=" * 60)
 
# Same partition, different model seeds
fixed_kfold = KFold(n_splits=10, shuffle=True, random_state=42)
model_var_scores = []
for model_seed in range(50):
    scores = cross_val_score(
        LogisticRegression(max_iter=1000, random_state=model_seed),
        X, y, cv=fixed_kfold
    )
    model_var_scores.append(scores.mean())
 
print(f"Fixed partition, varying model seed:")
print(f"  Std of CV estimates: {np.std(model_var_scores):.6f}")
print(f"\nVarying partition, fixed model seed:")
print(f"  Std of CV estimates: {cv_means.std():.6f}")
print(f"\nPartition variance is {cv_means.std()/np.std(model_var_scores):.1f}x larger!")

The Reporting Problem

If your 10-fold CV gives 85% accuracy with one random seed, someone else might get 83% or 87% with a different seed on the same data. Single-run CV results are not fully reproducible—the random partition is part of the 'experiment'. This is why serious evaluation requires repeated CV.

Why Partition Matters So Much:

The specific samples that end up in each fold affect the result in several ways:

Difficult samples: Some samples are inherently harder to classify. If they cluster in one fold's validation set, that fold's score drops.
Similar samples: Samples that are similar to each other should ideally be spread across folds. If they cluster, validation becomes easier (seeing similar training examples) or harder (not seeing them).
Class distribution: Even with stratification, small imbalances between folds introduce noise.
Outliers: Where outliers land affects the model and evaluation.

The solution: average over multiple partitions.

The Repeated Cross-Validation Procedure

Repeated cross-validation (also called n×k CV or replicated CV) is straightforward: run k-fold CV multiple times, each with a different random partition, and aggregate results.

Formal Definition:

Let r be the number of repetitions and k be the number of folds. Repeated CV produces r × k individual fold scores.

Algorithm:

for i = 1 to r:                          # r repetitions
    Create random k-fold partition_i
    for j = 1 to k:                      # k folds per repetition
        Train on all folds except j
        Evaluate on fold j
        Store score[i,j]

Final estimate = mean(all r × k scores)
Standard error = std(all r × k scores) / sqrt(r × k)

Notation Convention

Common notation: '5×10 CV' means 5 repetitions of 10-fold CV, producing 50 individual fold evaluations. '10×5 CV' means 10 repetitions of 5-fold CV, also 50 evaluations. These are different! 5×10 trains each time on 90% of data; 10×5 trains on 80%.

repeated_cv_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from sklearn.model_selection import (
    cross_val_score, RepeatedKFold, RepeatedStratifiedKFold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from typing import List, Tuple
import time
 
class RepeatedCVAnalysis:
    """
    Comprehensive repeated cross-validation analysis.
    """
    
    def __init__(self, n_splits: int = 10, n_repeats: int = 5,
                 random_state: int = 42):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state
        
    def run_repeated_cv(self, X, y, model_factory, scoring='accuracy',
                        stratified=True, verbose=True):
        """
        Run repeated cross-validation with detailed analysis.
        """
        # Choose CV strategy
        if stratified:
            cv = RepeatedStratifiedKFold(
                n_splits=self.n_splits,
                n_repeats=self.n_repeats,
                random_state=self.random_state
            )
            cv_name = f"Repeated Stratified {self.n_splits}-Fold"
        else:
            cv = RepeatedKFold(
                n_splits=self.n_splits,
                n_repeats=self.n_repeats,
                random_state=self.random_state
            )
            cv_name = f"Repeated {self.n_splits}-Fold"
        
        # Run CV
        start = time.time()
        all_scores = cross_val_score(
            model_factory(), X, y, cv=cv, scoring=scoring
        )
        elapsed = time.time() - start
        
        # Reshape to (n_repeats, n_splits)
        scores_matrix = all_scores.reshape(self.n_repeats, self.n_splits)
        
        # Compute statistics
        results = self._compute_statistics(scores_matrix, all_scores)
        results['cv_name'] = cv_name
        results['time'] = elapsed
        results['total_folds'] = len(all_scores)
        
        if verbose:
            self._print_report(results)
        
        return results
    
    def _compute_statistics(self, scores_matrix, all_scores):
        """Compute comprehensive statistics."""
        # Overall statistics
        mean_score = all_scores.mean()
        std_score = all_scores.std()
        se_score = std_score / np.sqrt(len(all_scores))
        
        # Per-repetition means
        rep_means = scores_matrix.mean(axis=1)
        
        # Variance decomposition
        within_rep_var = scores_matrix.var(axis=1).mean()  # Avg variance within reps
        between_rep_var = rep_means.var()  # Variance between rep means
        
        # Confidence intervals
        ci_95 = (mean_score - 1.96 * se_score, mean_score + 1.96 * se_score)
        
        return {
            'mean': mean_score,
            'std': std_score,
            'se': se_score,
            'ci_95': ci_95,
            'rep_means': rep_means,
            'within_rep_var': within_rep_var,
            'between_rep_var': between_rep_var,
            'scores_matrix': scores_matrix,
            'all_scores': all_scores
        }
    
    def _print_report(self, results):
        """Print detailed report."""
        print(f"\n{'='*60}")
        print(f"Repeated CV Results: {results['cv_name']}")
        print(f"{'='*60}")
        print(f"Configuration: {self.n_repeats} repeats × {self.n_splits} folds "
              f"= {results['total_folds']} evaluations")
        print(f"Time: {results['time']:.2f}s")
        print()
        print("Overall Statistics:")
        print(f"  Mean: {results['mean']:.4f}")
        print(f"  Std:  {results['std']:.4f}")
        print(f"  SE:   {results['se']:.4f}")
        print(f"  95% CI: [{results['ci_95'][0]:.4f}, {results['ci_95'][1]:.4f}]")
        print()
        print("Variance Decomposition:")
        print(f"  Within-repetition variance:  {results['within_rep_var']:.6f}")
        print(f"  Between-repetition variance: {results['between_rep_var']:.6f}")
        print()
        print("Per-Repetition Means:")
        for i, mean in enumerate(results['rep_means']):
            print(f"  Rep {i+1}: {mean:.4f}")
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
analyzer = RepeatedCVAnalysis(n_splits=10, n_repeats=5, random_state=42)
results = analyzer.run_repeated_cv(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)

Key Properties of Repeated CV:

More stable estimates: Averaging over r partitions reduces partition-dependent variance
Better variance estimation: With r×k scores, we can properly estimate variance
Same bias as single-run: Each repetition trains on (k-1)/k of data; averaging doesn't change this
r× computation cost: But trivially parallelizable—each repetition is independent

Variance Decomposition in Repeated CV

Understanding the variance structure in repeated CV helps us interpret results and choose configurations.

Two Sources of Variance:

Within-repetition variance (σ²_within): The variance of fold scores within a single CV run. Reflects validation set sampling variability.
Between-repetition variance (σ²_between): The variance of mean scores across repetitions. Reflects partition-dependence.

Total Variance Decomposition:

$$\text{Var}[\text{score}] = \sigma^2_{\text{within}} + \sigma^2_{\text{between}}$$

For the mean across all r×k scores:

$$\text{Var}[\bar{\text{score}}] \approx \frac{\sigma^2_{\text{within}}}{r \cdot k} + \frac{\sigma^2_{\text{between}}}{r}$$

The between-repetition term (σ²_between/r) often dominates! This shows why adding repetitions is effective.

variance_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def analyze_variance_components(X, y, model_factory, k=10, 
                                n_repetitions=30, random_state=42):
    """
    Decompose variance into within-fold and between-repetition components.
    """
    np.random.seed(random_state)
    
    all_fold_scores = []
    rep_means = []
    
    for rep in range(n_repetitions):
        # Different partition each repetition
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + rep)
        
        fold_scores = cross_val_score(
            model_factory(), X, y, cv=cv, scoring='accuracy'
        )
        
        all_fold_scores.append(fold_scores)
        rep_means.append(fold_scores.mean())
    
    all_fold_scores = np.array(all_fold_scores)  # Shape: (n_reps, k)
    rep_means = np.array(rep_means)
    
    # Variance decomposition
    # Within-repetition: average variance across folds within each rep
    within_variances = all_fold_scores.var(axis=1)
    avg_within_var = within_variances.mean()
    
    # Between-repetition: variance of repetition means
    between_var = rep_means.var()
    
    # Total variance of individual fold scores
    total_var = all_fold_scores.var()
    
    print("Variance Decomposition Analysis")
    print("=" * 60)
    print(f"Configuration: {n_repetitions} repetitions × {k} folds")
    print(f"Total fold evaluations: {n_repetitions * k}")
    print()
    print(f"Within-repetition variance (avg): {avg_within_var:.6f}")
    print(f"Between-repetition variance:      {between_var:.6f}")
    print(f"Total variance of fold scores:    {total_var:.6f}")
    print()
    print(f"Between/Total ratio: {between_var/total_var:.2%}")
    print(f"Within/Total ratio:  {avg_within_var/total_var:.2%}")
    print()
    
    # Effective variance reduction from averaging
    # Variance of mean over r repetitions
    var_of_mean_1_rep = all_fold_scores[0].var() / k
    var_of_mean_r_reps = total_var / (n_repetitions * k)
    
    print("Variance of the mean estimate:")
    print(f"  Single repetition (k={k}): {var_of_mean_1_rep:.8f}")
    print(f"  {n_repetitions} repetitions:          {var_of_mean_r_reps:.8f}")
    print(f"  Variance reduction: {var_of_mean_1_rep/var_of_mean_r_reps:.1f}x")
    
    return {
        'within_var': avg_within_var,
        'between_var': between_var,
        'total_var': total_var,
        'all_scores': all_fold_scores,
        'rep_means': rep_means
    }
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
# Analyze variance for different models
print("\n" + "=" * 70)
print("Random Forest (moderate stability)")
print("=" * 70)
rf_results = analyze_variance_components(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
print("\n" + "=" * 70)
print("How variance decreases with more repetitions")
print("=" * 70)
 
for r in [1, 2, 5, 10, 20]:
    rep_means = rf_results['rep_means'][:r]
    var_of_mean = np.var(rep_means) / r + rf_results['within_var'] / (r * 10)
    se = np.sqrt(var_of_mean)
    print(f"r={r:2d}: SE ≈ {se:.4f}, 95% CI width ≈ {3.92*se:.4f}")

Why Between-Repetition Variance Matters

The between-repetition variance captures the 'partition luck' effect. If this is large, single-run CV is unreliable—your reported result is highly dependent on random seed choice. Repeated CV reveals this variability and averages it out.

Optimal Allocation: k vs r?

Given a fixed computational budget (total number of model fits = r × k), should we prefer more folds or more repetitions?

Strategy	Model Fits	Bias	Variance Reduction
10×5 CV	50	Moderate (80% train)	Addresses partition variance
5×10 CV	50	Low (90% train)	Addresses partition variance
1×50 CV	50	Very low (98% train)	Only within-partition variance

Key insight: Increasing r (repetitions) reduces between-repetition variance, which is often the dominant component. Increasing k reduces bias but may increase correlated variance. Generally, 5×10 or 10×10 is preferred over 1×50.

How Many Repetitions?

Choosing the number of repetitions involves balancing precision against computational cost. Let's develop practical guidelines.

Impact of Repetitions on Estimate Quality
Repetitions (r)	SE Reduction	95% CI Width	Best For	Compute Cost
1	1×	Full width	Quick experiments	k × training
3	~1.7×	~58% of single	Development iteration	3k × training
5	~2.2×	~45% of single	Solid publication	5k × training
10	~3.2×	~32% of single	Rigorous evaluation	10k × training
20	~4.5×	~22% of single	Critical decisions	20k × training

Diminishing Returns:

The standard error decreases as 1/√r. This means:

Going from 1 to 4 repetitions halves the SE
Going from 4 to 16 halves it again
Going from 16 to 64 halves it once more

The cost, however, is linear. At some point, additional repetitions provide minimal SE reduction for substantial compute cost.

Practical Guidelines:

r = 5: Good default for most purposes. Reduces SE by ~55%. Standard for publication.
r = 10: Use for competition submissions, important model comparisons, or when results are close.
r = 3: Minimum for meaningful variance estimation. Use when compute is constrained.
r = 1: Only for quick exploration. Never report single-run CV as final results.

The 5×2 CV Paired t-test

Dietterich (1998) proposed 5×2 CV specifically for comparing two algorithms. Five repetitions of 2-fold CV provides 10 paired differences with specific statistical properties. The resulting t-test has correct Type I error rates, unlike naive tests on k-fold results.

choosing_repetitions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def analyze_repetition_effect(X, y, model_factory, k=10, 
                               max_reps=30, random_state=42):
    """
    Analyze how estimation quality improves with more repetitions.
    """
    # Run many repetitions
    all_scores = []
    for rep in range(max_reps):
        cv = RepeatedStratifiedKFold(
            n_splits=k, n_repeats=1, 
            random_state=random_state + rep
        )
        scores = cross_val_score(model_factory(), X, y, cv=cv)
        all_scores.extend(scores)
    
    all_scores = np.array(all_scores).reshape(max_reps, k)
    
    # Analyze cumulative effect of adding repetitions
    results = []
    for r in range(1, max_reps + 1):
        scores_so_far = all_scores[:r].flatten()
        
        mean = scores_so_far.mean()
        se = scores_so_far.std() / np.sqrt(len(scores_so_far))
        ci_width = 3.92 * se  # 95% CI
        
        results.append({
            'r': r,
            'mean': mean,
            'se': se,
            'ci_width': ci_width
        })
    
    # Print key points
    print("Effect of Repetitions on Estimate Quality")
    print("=" * 60)
    print(f"{'Reps':>5} {'Mean':>8} {'SE':>8} {'CI Width':>10} {'vs r=1':>10}")
    print("-" * 60)
    
    baseline_ci = results[0]['ci_width']
    for r in [1, 2, 3, 5, 10, 15, 20, 30]:
        if r <= max_reps:
            res = results[r-1]
            reduction = res['ci_width'] / baseline_ci
            print(f"{r:>5} {res['mean']:>8.4f} {res['se']:>8.4f} "
                  f"{res['ci_width']:>10.4f} {reduction:>10.1%}")
    
    return results
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
print()
results = analyze_repetition_effect(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
# Visualize convergence
plt.figure(figsize=(12, 4))
 
plt.subplot(1, 2, 1)
reps = [r['r'] for r in results]
means = [r['mean'] for r in results]
ses = [r['se'] for r in results]
 
plt.plot(reps, means, 'b-', linewidth=2)
plt.fill_between(reps, 
                 np.array(means) - 1.96*np.array(ses),
                 np.array(means) + 1.96*np.array(ses), 
                 alpha=0.3)
plt.xlabel('Number of Repetitions')
plt.ylabel('Mean Accuracy')
plt.title('CV Estimate Stabilizes with More Repetitions')
plt.axhline(means[-1], color='red', linestyle='--', alpha=0.5)
 
plt.subplot(1, 2, 2)
plt.plot(reps, [r['ci_width'] for r in results], 'g-', linewidth=2)
plt.xlabel('Number of Repetitions')
plt.ylabel('95% CI Width')
plt.title('Confidence Interval Shrinks with Repetitions')
 
plt.tight_layout()
 
# Recommendation function
def recommend_repetitions(compute_budget, importance, model_stability='medium'):
    """
    Recommend number of repetitions based on constraints.
    """
    if importance == 'critical':
        base_r = 10
    elif importance == 'publication':
        base_r = 5
    elif importance == 'development':
        base_r = 3
    else:  # exploration
        base_r = 1
    
    # Adjust for model stability
    if model_stability == 'low':  # Unstable models need more
        base_r = int(base_r * 1.5)
    elif model_stability == 'high':  # Stable models need fewer
        base_r = max(1, int(base_r * 0.7))
    
    # Adjust for compute budget
    if compute_budget == 'low':
        base_r = min(base_r, 3)
    elif compute_budget == 'very_low':
        base_r = 1
    
    return base_r
 
print("\nRecommendation Examples:")
print("-" * 40)
for imp in ['exploration', 'development', 'publication', 'critical']:
    r = recommend_repetitions('medium', imp)
    print(f"Importance='{imp}': r={r}")

Implementation Best Practices

Implementing repeated CV correctly requires attention to several details. Let's cover best practices.

Implementation Checklist

•Use stratified CV for classification: Maintain class proportions in each fold
•Set and record random_state: Enables reproducibility across runs
•Fresh model each fold: Never reuse fitted models between folds
•Report mean, std, and CI: Give full picture of uncertainty
•Store all fold scores: Enable variance decomposition and analysis
•Parallelize across repetitions: Each repetition is independent
•Consider computational cost: r repetitions mean r× more training

production_repeated_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate
from sklearn.base import clone
from dataclasses import dataclass
from typing import Dict, List, Any, Callable, Optional
import time
from joblib import Parallel, delayed
 
@dataclass
class RepeatedCVResult:
    """Container for repeated CV results."""
    mean: float
    std: float
    se: float
    ci_95: tuple
    all_scores: np.ndarray
    per_rep_means: np.ndarray
    config: Dict[str, Any]
    time_seconds: float
    
    def summary(self) -> str:
        return (f"{self.mean:.4f} ± {self.std:.4f} "
                f"(95% CI: [{self.ci_95[0]:.4f}, {self.ci_95[1]:.4f}])")
    
    def detailed_report(self) -> str:
        lines = [
            "=" * 60,
            "Repeated Cross-Validation Report",
            "=" * 60,
            f"Configuration: {self.config['n_repeats']}×{self.config['n_splits']} CV",
            f"Total evaluations: {len(self.all_scores)}",
            f"Time: {self.time_seconds:.2f}s",
            "",
            "Results:",
            f"  Mean:     {self.mean:.4f}",
            f"  Std:      {self.std:.4f}",
            f"  SE:       {self.se:.4f}",
            f"  95% CI:   [{self.ci_95[0]:.4f}, {self.ci_95[1]:.4f}]",
            "",
            "Per-Repetition Means:",
        ]
        for i, m in enumerate(self.per_rep_means):
            lines.append(f"  Rep {i+1}: {m:.4f}")
        
        return "\n".join(lines)
 
def run_repeated_cv(
    X: np.ndarray,
    y: np.ndarray,
    model_factory: Callable,
    n_splits: int = 10,
    n_repeats: int = 5,
    scoring: str = 'accuracy',
    random_state: int = 42,
    n_jobs: int = -1,
    return_estimators: bool = False
) -> RepeatedCVResult:
    """
    Production-quality repeated cross-validation.
    
    Parameters:
    -----------
    X : array-like of shape (n_samples, n_features)
    y : array-like of shape (n_samples,)
    model_factory : callable returning unfitted estimator
    n_splits : number of folds
    n_repeats : number of repetitions
    scoring : scoring metric
    random_state : random seed for reproducibility
    n_jobs : parallel jobs (-1 = all cores)
    return_estimators : whether to keep fitted models
    
    Returns:
    --------
    RepeatedCVResult with comprehensive results
    """
    cv = RepeatedStratifiedKFold(
        n_splits=n_splits,
        n_repeats=n_repeats,
        random_state=random_state
    )
    
    start_time = time.time()
    
    cv_results = cross_validate(
        model_factory(),
        X, y,
        cv=cv,
        scoring=scoring,
        return_train_score=True,
        return_estimator=return_estimators,
        n_jobs=n_jobs
    )
    
    elapsed = time.time() - start_time
    
    # Extract test scores
    all_scores = cv_results['test_score']
    
    # Reshape to (n_repeats, n_splits)
    scores_matrix = all_scores.reshape(n_repeats, n_splits)
    per_rep_means = scores_matrix.mean(axis=1)
    
    # Statistics
    mean = all_scores.mean()
    std = all_scores.std()
    se = std / np.sqrt(len(all_scores))
    ci_95 = (mean - 1.96 * se, mean + 1.96 * se)
    
    return RepeatedCVResult(
        mean=mean,
        std=std,
        se=se,
        ci_95=ci_95,
        all_scores=all_scores,
        per_rep_means=per_rep_means,
        config={
            'n_splits': n_splits,
            'n_repeats': n_repeats,
            'scoring': scoring,
            'random_state': random_state
        },
        time_seconds=elapsed
    )
 
# Example usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, random_state=42)
 
# Run repeated CV
result = run_repeated_cv(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5,
    random_state=42
)
 
print(result.detailed_report())
 
# Compare models properly
print("\n" + "=" * 60)
print("Comparing Models with Repeated CV")
print("=" * 60)
 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
 
models = [
    (lambda: LogisticRegression(max_iter=1000), "Logistic Regression"),
    (lambda: RandomForestClassifier(n_estimators=100), "Random Forest"),
    (lambda: GradientBoostingClassifier(n_estimators=100), "Gradient Boosting")
]
 
for model_factory, name in models:
    result = run_repeated_cv(X, y, model_factory, n_splits=10, n_repeats=5)
    print(f"{name:25s}: {result.summary()}")

Common Mistakes

Using the same random state for all repetitions defeats the purpose. 2) Not reporting per-repetition variance loses important information. 3) Computing CI from within-fold variance alone underestimates uncertainty. 4) Ignoring LOOCV when n is very small—repeated 10-fold with n=50 has tiny validation sets.

Statistical Interpretation of Repeated CV

Properly interpreting repeated CV results requires understanding what the statistics represent and their limitations.

What the Mean Represents:

The mean of r×k scores estimates the expected performance of a model trained on (k-1)/k of the data. This is still not quite the same as full-data performance due to the pessimistic bias we discussed. However, with k=10, we're estimating 90%-data performance, which is very close.

What the Standard Deviation Represents:

The standard deviation across r×k scores reflects total variability from:

Partition randomness (which fold each sample lands in)
Validation set sampling (which specific samples are in each validation set)
Any model randomness (if the model has random components)

This is NOT the same as the expected variability you'd see with new test data. It's variability in the CV procedure itself.

What the Confidence Interval Means:

The 95% CI says: if we repeated this entire r×k CV procedure many times, 95% of the resulting mean estimates would fall within this interval. It's a confidence interval for the CV estimate of performance, which itself is an estimate of true performance.

Correct Interpretation

The CI does NOT mean: '95% of new predictions will have accuracy in this range.' It means: 'We are 95% confident that the true expected CV estimate (and approximately true performance) lies in this range.' These are different statistical statements.

Comparing Models:

When comparing models A and B:

✓ Valid approach: Use the same CV splits for both models and compare paired differences.

✗ Invalid approach: Run independent CV for each and compare means.

Why pairing matters:

Some folds are inherently easier or harder. Both models benefit/suffer from the same folds. Paired comparison cancels this fold-specific noise.

Fold	Model A	Model B	A - B
1	0.85	0.83	+0.02
2	0.92	0.89	+0.03
3	0.78	0.76	+0.02

The difference A-B is remarkably stable, even though individual scores vary widely. Paired tests leverage this stability.

paired_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy import stats
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
 
def paired_cv_comparison(X, y, model_a_factory, model_b_factory,
                         n_splits=10, n_repeats=5, random_state=42):
    """
    Properly compare two models using paired repeated CV.
    """
    np.random.seed(random_state)
    
    all_diffs = []
    
    for rep in range(n_repeats):
        cv = StratifiedKFold(n_splits=n_splits, shuffle=True, 
                            random_state=random_state + rep)
        
        for train_idx, val_idx in cv.split(X, y):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Same train/val split for both models
            model_a = model_a_factory()
            model_b = model_b_factory()
            
            model_a.fit(X_train, y_train)
            model_b.fit(X_train, y_train)
            
            score_a = model_a.score(X_val, y_val)
            score_b = model_b.score(X_val, y_val)
            
            all_diffs.append(score_a - score_b)
    
    all_diffs = np.array(all_diffs)
    
    # Statistics
    mean_diff = all_diffs.mean()
    std_diff = all_diffs.std()
    se_diff = std_diff / np.sqrt(len(all_diffs))
    
    # One-sample t-test: is mean_diff significantly different from 0?
    t_stat, p_value = stats.ttest_1samp(all_diffs, 0)
    
    # Corrected t-test (Nadeau & Bengio, 2003)
    # Accounts for CV fold correlation
    n = len(X)
    n_test = n // n_splits
    n_train = n - n_test
    
    corrected_variance = (
        all_diffs.var() * (1/len(all_diffs) + n_test/n_train)
    )
    corrected_t = mean_diff / np.sqrt(corrected_variance)
    df = len(all_diffs) - 1
    corrected_p = 2 * (1 - stats.t.cdf(abs(corrected_t), df))
    
    print("Paired CV Comparison Results")
    print("=" * 60)
    print(f"Configuration: {n_repeats}×{n_splits} CV = {len(all_diffs)} pairs")
    print()
    print(f"Mean difference (A - B): {mean_diff:+.4f}")
    print(f"Std of differences: {std_diff:.4f}")
    print(f"95% CI for difference: "
          f"[{mean_diff - 1.96*se_diff:+.4f}, {mean_diff + 1.96*se_diff:+.4f}]")
    print()
    print(f"Naive t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
    print(f"Corrected t-test: t = {corrected_t:.3f}, p = {corrected_p:.4f}")
    print()
    
    if corrected_p < 0.05:
        winner = "Model A" if mean_diff > 0 else "Model B"
        print(f"Conclusion: {winner} is significantly better (α=0.05)")
    else:
        print("Conclusion: No significant difference between models")
    
    return {
        'mean_diff': mean_diff,
        'std_diff': std_diff,
        'all_diffs': all_diffs,
        'naive_p': p_value,
        'corrected_p': corrected_p
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
result = paired_cv_comparison(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5
)

Advanced Repeated CV Configurations

Beyond standard repeated k-fold, several specialized configurations serve specific purposes.

Advanced Configurations

•5×2 CV: 5 repetitions of 2-fold. Designed for statistical testing (Dietterich's paired t-test). Independent fold estimates within each repetition.
•Monte Carlo CV (Shuffle-Split): Random train/test splits in fixed proportions. More flexible than k-fold but observations may never appear in test set.
•Repeated Stratified Group K-Fold: For data with group structure (e.g., multiple samples per patient). Ensures groups don't leak across train/test.
•Leave-P-Out with Repeats: Rarely used due to computational explosion, but theoretically interesting for variance analysis.
•Bootstrap + CV: Combine bootstrap resampling with CV for optimism-corrected estimates (related to .632+ bootstrap).

advanced_configs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from sklearn.model_selection import (
    RepeatedStratifiedKFold, ShuffleSplit,
    cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def compare_cv_strategies(X, y, model_factory, n_iterations=50):
    """
    Compare different repeated CV strategies.
    """
    strategies = [
        # Standard repeated 10-fold
        ("10×5 Repeated KFold", 
         RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)),
        
        # 5×2 CV for statistical testing
        ("5×2 CV (for t-test)",
         RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=42)),
        
        # Monte Carlo (Shuffle-Split)
        ("Monte Carlo (50 × 20%)",
         ShuffleSplit(n_splits=50, test_size=0.2, random_state=42)),
        
        # More folds, fewer repeats
        ("20×2 Repeated KFold",
         RepeatedStratifiedKFold(n_splits=20, n_repeats=2, random_state=42)),
    ]
    
    print("Comparison of CV Strategies")
    print("=" * 70)
    print(f"{'Strategy':<25} {'Mean':>8} {'Std':>8} {'SE':>8} {'Evals':>8}")
    print("-" * 70)
    
    for name, cv in strategies:
        scores = cross_val_score(model_factory(), X, y, cv=cv)
        print(f"{name:<25} {scores.mean():>8.4f} {scores.std():>8.4f} "
              f"{scores.std()/np.sqrt(len(scores)):>8.4f} {len(scores):>8}")
    
    return strategies
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
strategies = compare_cv_strategies(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
# 5×2 CV t-test implementation
from scipy import stats
 
def five_times_two_cv_test(X, y, model_a_factory, model_b_factory, 
                            random_state=42):
    """
    Dietterich's 5×2 CV paired t-test.
    
    Designed to have approximately correct Type I error rate.
    """
    p_values = []  # Differences in first fold of each repetition
    differences_sq_sum = 0
    
    for rep in range(5):
        # Create 2-fold split
        np.random.seed(random_state + rep)
        indices = np.random.permutation(len(X))
        mid = len(X) // 2
        
        idx1, idx2 = indices[:mid], indices[mid:]
        
        X1, y1 = X[idx1], y[idx1]
        X2, y2 = X[idx2], y[idx2]
        
        # Fold 1: train on 1, test on 2
        model_a = model_a_factory().fit(X1, y1)
        model_b = model_b_factory().fit(X1, y1)
        p1_a = model_a.score(X2, y2)
        p1_b = model_b.score(X2, y2)
        
        # Fold 2: train on 2, test on 1
        model_a = model_a_factory().fit(X2, y2)
        model_b = model_b_factory().fit(X2, y2)
        p2_a = model_a.score(X1, y1)
        p2_b = model_b.score(X1, y1)
        
        d1 = p1_a - p1_b
        d2 = p2_a - p2_b
        
        p_values.append(d1)
        
        # Variance estimate for this repetition
        d_mean = (d1 + d2) / 2
        s_sq = (d1 - d_mean)**2 + (d2 - d_mean)**2
        differences_sq_sum += s_sq
    
    # 5×2 CV t-statistic
    p1 = p_values[0]
    t_stat = p1 / np.sqrt(differences_sq_sum / 5)
    
    # Approximation: t-distribution with 5 degrees of freedom
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), 5))
    
    print("5×2 CV Paired t-test Results")
    print("=" * 50)
    print(f"t-statistic: {t_stat:.4f}")
    print(f"p-value: {p_value:.4f}")
    print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")
    
    return {'t_stat': t_stat, 'p_value': p_value}
 
print("\n" + "=" * 70)
print("5×2 CV Paired t-test (Dietterich)")
print("=" * 70)
from sklearn.ensemble import GradientBoostingClassifier
 
result = five_times_two_cv_test(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=None),
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=None)
)

Choosing Between Strategies

Use 5×2 CV for formal statistical comparison of two algorithms. Use 10×5 repeated k-fold for general evaluation and reporting. Use Monte Carlo (Shuffle-Split) when you need flexible train/test ratios or very many iterations for stability analysis.

Summary: Repeated Cross-Validation

We've thoroughly explored repeated cross-validation as a method for obtaining more reliable performance estimates. Let's consolidate the key insights:

Key Takeaways

•Single-run CV is partition-dependent: Different random seeds yield different results—sometimes substantially different
•Repeated CV averages over partitions: Running r repetitions with different random partitions reduces partition-specific noise
•Variance decomposes into within and between: Between-repetition variance captures partition luck; within captures fold variance
•Adding repetitions is effective: Goes to the root of the problem (partition dependence) rather than just adding folds
•5 or 10 repetitions are standard: 5 for solid results; 10 for critical decisions; 3 minimum for variance estimation
•r×k configurations matter: 5×10 differs from 10×5—same compute but different bias-variance tradeoffs
•Use paired comparisons: Same splits for both models enables powerful paired statistical tests
•Report mean, std, and CI: Full uncertainty picture essential for honest reporting

What's Next:

With solid understanding of repeated CV, we now turn to confidence intervals for CV estimates. While we've computed approximate CIs using the standard formula, proper confidence intervals for CV require careful treatment of the correlation structure between fold estimates. The next page develops rigorous methods for quantifying uncertainty in CV-based performance claims.

Page Complete

You now have a comprehensive understanding of repeated cross-validation—why it's necessary, how it works, what configurations to use, and how to implement it properly. Your model evaluations will be more reliable and your reported results more defensible.

4 / 5

Loading learning content...

Machine LearningK-Fold Cross-Validation

K-Fold Cross-Validation

LevelIntermediate

Duration90 mins

TopicK-Fold Cross-Validation

4 / 5

Repeated Cross-Validation

Breaking the Partition Dependence

What You Will Master

The Partition Dependence Problem

Let's first understand the problem that repeated CV solves.

Demonstrating Partition Dependence:

partition_dependence.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=300, n_features=20, 
                           n_informative=10, random_state=42)
 
# Run 10-fold CV with many different random partitions
n_runs = 100
cv_means = []
 
for seed in range(n_runs):
    kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
    scores = cross_val_score(
        LogisticRegression(max_iter=1000, random_state=42),  # Fixed model seed
        X, y, cv=kfold, scoring='accuracy'
    )
    cv_means.append(scores.mean())
 
cv_means = np.array(cv_means)
 
# Analyze the distribution
print("Partition Dependence Analysis (10-fold CV, 100 runs)")
print("=" * 60)
print(f"Mean of CV estimates: {cv_means.mean():.4f}")
print(f"Std of CV estimates:  {cv_means.std():.4f}")
print(f"Range: [{cv_means.min():.4f}, {cv_means.max():.4f}]")
print(f"Span: {(cv_means.max() - cv_means.min()):.4f}")
print()
print("If we reported just one run, we might get anything from")
print(f"{cv_means.min():.4f} to {cv_means.max():.4f} - a {(cv_means.max()-cv_means.min())*100:.1f}% point range!")
 
# Visualize
plt.figure(figsize=(12, 4))
 
plt.subplot(1, 2, 1)
plt.hist(cv_means, bins=20, edgecolor='black', alpha=0.7)
plt.axvline(cv_means.mean(), color='red', linestyle='--', 
            label=f'Mean = {cv_means.mean():.4f}')
plt.xlabel('CV Accuracy')
plt.ylabel('Frequency')
plt.title('Distribution of 10-Fold CV Estimates\n(100 different partitions)')
plt.legend()
 
plt.subplot(1, 2, 2)
plt.plot(range(n_runs), cv_means, 'o-', markersize=3, alpha=0.7)
plt.axhline(cv_means.mean(), color='red', linestyle='--')
plt.fill_between(range(n_runs), 
                 cv_means.mean() - cv_means.std(),
                 cv_means.mean() + cv_means.std(),
                 alpha=0.2, color='red')
plt.xlabel('Run Number (different random seed)')
plt.ylabel('CV Accuracy')
plt.title('CV Estimate Varies with Partition')
plt.tight_layout()
 
# Show that model randomness is not the issue
print("\n" + "=" * 60)
print("Confirming: Model randomness is NOT the source of variance")
print("=" * 60)
 
# Same partition, different model seeds
fixed_kfold = KFold(n_splits=10, shuffle=True, random_state=42)
model_var_scores = []
for model_seed in range(50):
    scores = cross_val_score(
        LogisticRegression(max_iter=1000, random_state=model_seed),
        X, y, cv=fixed_kfold
    )
    model_var_scores.append(scores.mean())
 
print(f"Fixed partition, varying model seed:")
print(f"  Std of CV estimates: {np.std(model_var_scores):.6f}")
print(f"\nVarying partition, fixed model seed:")
print(f"  Std of CV estimates: {cv_means.std():.6f}")
print(f"\nPartition variance is {cv_means.std()/np.std(model_var_scores):.1f}x larger!")

The Reporting Problem

Why Partition Matters So Much:

The specific samples that end up in each fold affect the result in several ways:

Difficult samples: Some samples are inherently harder to classify. If they cluster in one fold's validation set, that fold's score drops.
Similar samples: Samples that are similar to each other should ideally be spread across folds. If they cluster, validation becomes easier (seeing similar training examples) or harder (not seeing them).
Class distribution: Even with stratification, small imbalances between folds introduce noise.
Outliers: Where outliers land affects the model and evaluation.

The solution: average over multiple partitions.

The Repeated Cross-Validation Procedure

Repeated cross-validation (also called n×k CV or replicated CV) is straightforward: run k-fold CV multiple times, each with a different random partition, and aggregate results.

Formal Definition:

Let r be the number of repetitions and k be the number of folds. Repeated CV produces r × k individual fold scores.

Algorithm:

for i = 1 to r:                          # r repetitions
    Create random k-fold partition_i
    for j = 1 to k:                      # k folds per repetition
        Train on all folds except j
        Evaluate on fold j
        Store score[i,j]

Final estimate = mean(all r × k scores)
Standard error = std(all r × k scores) / sqrt(r × k)

Notation Convention

repeated_cv_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
import numpy as np
from sklearn.model_selection import (
    cross_val_score, RepeatedKFold, RepeatedStratifiedKFold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from typing import List, Tuple
import time
 
class RepeatedCVAnalysis:
    """
    Comprehensive repeated cross-validation analysis.
    """
    
    def __init__(self, n_splits: int = 10, n_repeats: int = 5,
                 random_state: int = 42):
        self.n_splits = n_splits
        self.n_repeats = n_repeats
        self.random_state = random_state
        
    def run_repeated_cv(self, X, y, model_factory, scoring='accuracy',
                        stratified=True, verbose=True):
        """
        Run repeated cross-validation with detailed analysis.
        """
        # Choose CV strategy
        if stratified:
            cv = RepeatedStratifiedKFold(
                n_splits=self.n_splits,
                n_repeats=self.n_repeats,
                random_state=self.random_state
            )
            cv_name = f"Repeated Stratified {self.n_splits}-Fold"
        else:
            cv = RepeatedKFold(
                n_splits=self.n_splits,
                n_repeats=self.n_repeats,
                random_state=self.random_state
            )
            cv_name = f"Repeated {self.n_splits}-Fold"
        
        # Run CV
        start = time.time()
        all_scores = cross_val_score(
            model_factory(), X, y, cv=cv, scoring=scoring
        )
        elapsed = time.time() - start
        
        # Reshape to (n_repeats, n_splits)
        scores_matrix = all_scores.reshape(self.n_repeats, self.n_splits)
        
        # Compute statistics
        results = self._compute_statistics(scores_matrix, all_scores)
        results['cv_name'] = cv_name
        results['time'] = elapsed
        results['total_folds'] = len(all_scores)
        
        if verbose:
            self._print_report(results)
        
        return results
    
    def _compute_statistics(self, scores_matrix, all_scores):
        """Compute comprehensive statistics."""
        # Overall statistics
        mean_score = all_scores.mean()
        std_score = all_scores.std()
        se_score = std_score / np.sqrt(len(all_scores))
        
        # Per-repetition means
        rep_means = scores_matrix.mean(axis=1)
        
        # Variance decomposition
        within_rep_var = scores_matrix.var(axis=1).mean()  # Avg variance within reps
        between_rep_var = rep_means.var()  # Variance between rep means
        
        # Confidence intervals
        ci_95 = (mean_score - 1.96 * se_score, mean_score + 1.96 * se_score)
        
        return {
            'mean': mean_score,
            'std': std_score,
            'se': se_score,
            'ci_95': ci_95,
            'rep_means': rep_means,
            'within_rep_var': within_rep_var,
            'between_rep_var': between_rep_var,
            'scores_matrix': scores_matrix,
            'all_scores': all_scores
        }
    
    def _print_report(self, results):
        """Print detailed report."""
        print(f"\n{'='*60}")
        print(f"Repeated CV Results: {results['cv_name']}")
        print(f"{'='*60}")
        print(f"Configuration: {self.n_repeats} repeats × {self.n_splits} folds "
              f"= {results['total_folds']} evaluations")
        print(f"Time: {results['time']:.2f}s")
        print()
        print("Overall Statistics:")
        print(f"  Mean: {results['mean']:.4f}")
        print(f"  Std:  {results['std']:.4f}")
        print(f"  SE:   {results['se']:.4f}")
        print(f"  95% CI: [{results['ci_95'][0]:.4f}, {results['ci_95'][1]:.4f}]")
        print()
        print("Variance Decomposition:")
        print(f"  Within-repetition variance:  {results['within_rep_var']:.6f}")
        print(f"  Between-repetition variance: {results['between_rep_var']:.6f}")
        print()
        print("Per-Repetition Means:")
        for i, mean in enumerate(results['rep_means']):
            print(f"  Rep {i+1}: {mean:.4f}")
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20,
                           n_informative=10, random_state=42)
 
analyzer = RepeatedCVAnalysis(n_splits=10, n_repeats=5, random_state=42)
results = analyzer.run_repeated_cv(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)

Key Properties of Repeated CV:

More stable estimates: Averaging over r partitions reduces partition-dependent variance
Better variance estimation: With r×k scores, we can properly estimate variance
Same bias as single-run: Each repetition trains on (k-1)/k of data; averaging doesn't change this
r× computation cost: But trivially parallelizable—each repetition is independent

Variance Decomposition in Repeated CV

Understanding the variance structure in repeated CV helps us interpret results and choose configurations.

Two Sources of Variance:

Within-repetition variance (σ²_within): The variance of fold scores within a single CV run. Reflects validation set sampling variability.
Between-repetition variance (σ²_between): The variance of mean scores across repetitions. Reflects partition-dependence.

Total Variance Decomposition:

$$\text{Var}[\text{score}] = \sigma^2_{\text{within}} + \sigma^2_{\text{between}}$$

For the mean across all r×k scores:

$$\text{Var}[\bar{\text{score}}] \approx \frac{\sigma^2_{\text{within}}}{r \cdot k} + \frac{\sigma^2_{\text{between}}}{r}$$

The between-repetition term (σ²_between/r) often dominates! This shows why adding repetitions is effective.

variance_decomposition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def analyze_variance_components(X, y, model_factory, k=10, 
                                n_repetitions=30, random_state=42):
    """
    Decompose variance into within-fold and between-repetition components.
    """
    np.random.seed(random_state)
    
    all_fold_scores = []
    rep_means = []
    
    for rep in range(n_repetitions):
        # Different partition each repetition
        cv = StratifiedKFold(n_splits=k, shuffle=True, 
                            random_state=random_state + rep)
        
        fold_scores = cross_val_score(
            model_factory(), X, y, cv=cv, scoring='accuracy'
        )
        
        all_fold_scores.append(fold_scores)
        rep_means.append(fold_scores.mean())
    
    all_fold_scores = np.array(all_fold_scores)  # Shape: (n_reps, k)
    rep_means = np.array(rep_means)
    
    # Variance decomposition
    # Within-repetition: average variance across folds within each rep
    within_variances = all_fold_scores.var(axis=1)
    avg_within_var = within_variances.mean()
    
    # Between-repetition: variance of repetition means
    between_var = rep_means.var()
    
    # Total variance of individual fold scores
    total_var = all_fold_scores.var()
    
    print("Variance Decomposition Analysis")
    print("=" * 60)
    print(f"Configuration: {n_repetitions} repetitions × {k} folds")
    print(f"Total fold evaluations: {n_repetitions * k}")
    print()
    print(f"Within-repetition variance (avg): {avg_within_var:.6f}")
    print(f"Between-repetition variance:      {between_var:.6f}")
    print(f"Total variance of fold scores:    {total_var:.6f}")
    print()
    print(f"Between/Total ratio: {between_var/total_var:.2%}")
    print(f"Within/Total ratio:  {avg_within_var/total_var:.2%}")
    print()
    
    # Effective variance reduction from averaging
    # Variance of mean over r repetitions
    var_of_mean_1_rep = all_fold_scores[0].var() / k
    var_of_mean_r_reps = total_var / (n_repetitions * k)
    
    print("Variance of the mean estimate:")
    print(f"  Single repetition (k={k}): {var_of_mean_1_rep:.8f}")
    print(f"  {n_repetitions} repetitions:          {var_of_mean_r_reps:.8f}")
    print(f"  Variance reduction: {var_of_mean_1_rep/var_of_mean_r_reps:.1f}x")
    
    return {
        'within_var': avg_within_var,
        'between_var': between_var,
        'total_var': total_var,
        'all_scores': all_fold_scores,
        'rep_means': rep_means
    }
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
# Analyze variance for different models
print("\n" + "=" * 70)
print("Random Forest (moderate stability)")
print("=" * 70)
rf_results = analyze_variance_components(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
print("\n" + "=" * 70)
print("How variance decreases with more repetitions")
print("=" * 70)
 
for r in [1, 2, 5, 10, 20]:
    rep_means = rf_results['rep_means'][:r]
    var_of_mean = np.var(rep_means) / r + rf_results['within_var'] / (r * 10)
    se = np.sqrt(var_of_mean)
    print(f"r={r:2d}: SE ≈ {se:.4f}, 95% CI width ≈ {3.92*se:.4f}")

Why Between-Repetition Variance Matters

Optimal Allocation: k vs r?

Given a fixed computational budget (total number of model fits = r × k), should we prefer more folds or more repetitions?

Strategy	Model Fits	Bias	Variance Reduction
10×5 CV	50	Moderate (80% train)	Addresses partition variance
5×10 CV	50	Low (90% train)	Addresses partition variance
1×50 CV	50	Very low (98% train)	Only within-partition variance

How Many Repetitions?

Choosing the number of repetitions involves balancing precision against computational cost. Let's develop practical guidelines.

Impact of Repetitions on Estimate Quality
Repetitions (r)	SE Reduction	95% CI Width	Best For	Compute Cost
1	1×	Full width	Quick experiments	k × training
3	~1.7×	~58% of single	Development iteration	3k × training
5	~2.2×	~45% of single	Solid publication	5k × training
10	~3.2×	~32% of single	Rigorous evaluation	10k × training
20	~4.5×	~22% of single	Critical decisions	20k × training

Diminishing Returns:

The standard error decreases as 1/√r. This means:

Going from 1 to 4 repetitions halves the SE
Going from 4 to 16 halves it again
Going from 16 to 64 halves it once more

The cost, however, is linear. At some point, additional repetitions provide minimal SE reduction for substantial compute cost.

Practical Guidelines:

r = 5: Good default for most purposes. Reduces SE by ~55%. Standard for publication.
r = 10: Use for competition submissions, important model comparisons, or when results are close.
r = 3: Minimum for meaningful variance estimation. Use when compute is constrained.
r = 1: Only for quick exploration. Never report single-run CV as final results.

The 5×2 CV Paired t-test

choosing_repetitions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def analyze_repetition_effect(X, y, model_factory, k=10, 
                               max_reps=30, random_state=42):
    """
    Analyze how estimation quality improves with more repetitions.
    """
    # Run many repetitions
    all_scores = []
    for rep in range(max_reps):
        cv = RepeatedStratifiedKFold(
            n_splits=k, n_repeats=1, 
            random_state=random_state + rep
        )
        scores = cross_val_score(model_factory(), X, y, cv=cv)
        all_scores.extend(scores)
    
    all_scores = np.array(all_scores).reshape(max_reps, k)
    
    # Analyze cumulative effect of adding repetitions
    results = []
    for r in range(1, max_reps + 1):
        scores_so_far = all_scores[:r].flatten()
        
        mean = scores_so_far.mean()
        se = scores_so_far.std() / np.sqrt(len(scores_so_far))
        ci_width = 3.92 * se  # 95% CI
        
        results.append({
            'r': r,
            'mean': mean,
            'se': se,
            'ci_width': ci_width
        })
    
    # Print key points
    print("Effect of Repetitions on Estimate Quality")
    print("=" * 60)
    print(f"{'Reps':>5} {'Mean':>8} {'SE':>8} {'CI Width':>10} {'vs r=1':>10}")
    print("-" * 60)
    
    baseline_ci = results[0]['ci_width']
    for r in [1, 2, 3, 5, 10, 15, 20, 30]:
        if r <= max_reps:
            res = results[r-1]
            reduction = res['ci_width'] / baseline_ci
            print(f"{r:>5} {res['mean']:>8.4f} {res['se']:>8.4f} "
                  f"{res['ci_width']:>10.4f} {reduction:>10.1%}")
    
    return results
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
print()
results = analyze_repetition_effect(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
# Visualize convergence
plt.figure(figsize=(12, 4))
 
plt.subplot(1, 2, 1)
reps = [r['r'] for r in results]
means = [r['mean'] for r in results]
ses = [r['se'] for r in results]
 
plt.plot(reps, means, 'b-', linewidth=2)
plt.fill_between(reps, 
                 np.array(means) - 1.96*np.array(ses),
                 np.array(means) + 1.96*np.array(ses), 
                 alpha=0.3)
plt.xlabel('Number of Repetitions')
plt.ylabel('Mean Accuracy')
plt.title('CV Estimate Stabilizes with More Repetitions')
plt.axhline(means[-1], color='red', linestyle='--', alpha=0.5)
 
plt.subplot(1, 2, 2)
plt.plot(reps, [r['ci_width'] for r in results], 'g-', linewidth=2)
plt.xlabel('Number of Repetitions')
plt.ylabel('95% CI Width')
plt.title('Confidence Interval Shrinks with Repetitions')
 
plt.tight_layout()
 
# Recommendation function
def recommend_repetitions(compute_budget, importance, model_stability='medium'):
    """
    Recommend number of repetitions based on constraints.
    """
    if importance == 'critical':
        base_r = 10
    elif importance == 'publication':
        base_r = 5
    elif importance == 'development':
        base_r = 3
    else:  # exploration
        base_r = 1
    
    # Adjust for model stability
    if model_stability == 'low':  # Unstable models need more
        base_r = int(base_r * 1.5)
    elif model_stability == 'high':  # Stable models need fewer
        base_r = max(1, int(base_r * 0.7))
    
    # Adjust for compute budget
    if compute_budget == 'low':
        base_r = min(base_r, 3)
    elif compute_budget == 'very_low':
        base_r = 1
    
    return base_r
 
print("\nRecommendation Examples:")
print("-" * 40)
for imp in ['exploration', 'development', 'publication', 'critical']:
    r = recommend_repetitions('medium', imp)
    print(f"Importance='{imp}': r={r}")

Implementation Best Practices

Implementing repeated CV correctly requires attention to several details. Let's cover best practices.

Implementation Checklist

•Use stratified CV for classification: Maintain class proportions in each fold
•Set and record random_state: Enables reproducibility across runs
•Fresh model each fold: Never reuse fitted models between folds
•Report mean, std, and CI: Give full picture of uncertainty
•Store all fold scores: Enable variance decomposition and analysis
•Parallelize across repetitions: Each repetition is independent
•Consider computational cost: r repetitions mean r× more training

production_repeated_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
from sklearn.model_selection import RepeatedStratifiedKFold, cross_validate
from sklearn.base import clone
from dataclasses import dataclass
from typing import Dict, List, Any, Callable, Optional
import time
from joblib import Parallel, delayed
 
@dataclass
class RepeatedCVResult:
    """Container for repeated CV results."""
    mean: float
    std: float
    se: float
    ci_95: tuple
    all_scores: np.ndarray
    per_rep_means: np.ndarray
    config: Dict[str, Any]
    time_seconds: float
    
    def summary(self) -> str:
        return (f"{self.mean:.4f} ± {self.std:.4f} "
                f"(95% CI: [{self.ci_95[0]:.4f}, {self.ci_95[1]:.4f}])")
    
    def detailed_report(self) -> str:
        lines = [
            "=" * 60,
            "Repeated Cross-Validation Report",
            "=" * 60,
            f"Configuration: {self.config['n_repeats']}×{self.config['n_splits']} CV",
            f"Total evaluations: {len(self.all_scores)}",
            f"Time: {self.time_seconds:.2f}s",
            "",
            "Results:",
            f"  Mean:     {self.mean:.4f}",
            f"  Std:      {self.std:.4f}",
            f"  SE:       {self.se:.4f}",
            f"  95% CI:   [{self.ci_95[0]:.4f}, {self.ci_95[1]:.4f}]",
            "",
            "Per-Repetition Means:",
        ]
        for i, m in enumerate(self.per_rep_means):
            lines.append(f"  Rep {i+1}: {m:.4f}")
        
        return "\n".join(lines)
 
def run_repeated_cv(
    X: np.ndarray,
    y: np.ndarray,
    model_factory: Callable,
    n_splits: int = 10,
    n_repeats: int = 5,
    scoring: str = 'accuracy',
    random_state: int = 42,
    n_jobs: int = -1,
    return_estimators: bool = False
) -> RepeatedCVResult:
    """
    Production-quality repeated cross-validation.
    
    Parameters:
    -----------
    X : array-like of shape (n_samples, n_features)
    y : array-like of shape (n_samples,)
    model_factory : callable returning unfitted estimator
    n_splits : number of folds
    n_repeats : number of repetitions
    scoring : scoring metric
    random_state : random seed for reproducibility
    n_jobs : parallel jobs (-1 = all cores)
    return_estimators : whether to keep fitted models
    
    Returns:
    --------
    RepeatedCVResult with comprehensive results
    """
    cv = RepeatedStratifiedKFold(
        n_splits=n_splits,
        n_repeats=n_repeats,
        random_state=random_state
    )
    
    start_time = time.time()
    
    cv_results = cross_validate(
        model_factory(),
        X, y,
        cv=cv,
        scoring=scoring,
        return_train_score=True,
        return_estimator=return_estimators,
        n_jobs=n_jobs
    )
    
    elapsed = time.time() - start_time
    
    # Extract test scores
    all_scores = cv_results['test_score']
    
    # Reshape to (n_repeats, n_splits)
    scores_matrix = all_scores.reshape(n_repeats, n_splits)
    per_rep_means = scores_matrix.mean(axis=1)
    
    # Statistics
    mean = all_scores.mean()
    std = all_scores.std()
    se = std / np.sqrt(len(all_scores))
    ci_95 = (mean - 1.96 * se, mean + 1.96 * se)
    
    return RepeatedCVResult(
        mean=mean,
        std=std,
        se=se,
        ci_95=ci_95,
        all_scores=all_scores,
        per_rep_means=per_rep_means,
        config={
            'n_splits': n_splits,
            'n_repeats': n_repeats,
            'scoring': scoring,
            'random_state': random_state
        },
        time_seconds=elapsed
    )
 
# Example usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, random_state=42)
 
# Run repeated CV
result = run_repeated_cv(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5,
    random_state=42
)
 
print(result.detailed_report())
 
# Compare models properly
print("\n" + "=" * 60)
print("Comparing Models with Repeated CV")
print("=" * 60)
 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
 
models = [
    (lambda: LogisticRegression(max_iter=1000), "Logistic Regression"),
    (lambda: RandomForestClassifier(n_estimators=100), "Random Forest"),
    (lambda: GradientBoostingClassifier(n_estimators=100), "Gradient Boosting")
]
 
for model_factory, name in models:
    result = run_repeated_cv(X, y, model_factory, n_splits=10, n_repeats=5)
    print(f"{name:25s}: {result.summary()}")

Common Mistakes

Using the same random state for all repetitions defeats the purpose. 2) Not reporting per-repetition variance loses important information. 3) Computing CI from within-fold variance alone underestimates uncertainty. 4) Ignoring LOOCV when n is very small—repeated 10-fold with n=50 has tiny validation sets.

Statistical Interpretation of Repeated CV

Properly interpreting repeated CV results requires understanding what the statistics represent and their limitations.

What the Mean Represents:

What the Standard Deviation Represents:

The standard deviation across r×k scores reflects total variability from:

Partition randomness (which fold each sample lands in)
Validation set sampling (which specific samples are in each validation set)
Any model randomness (if the model has random components)

This is NOT the same as the expected variability you'd see with new test data. It's variability in the CV procedure itself.

What the Confidence Interval Means:

Correct Interpretation

Comparing Models:

When comparing models A and B:

✓ Valid approach: Use the same CV splits for both models and compare paired differences.

✗ Invalid approach: Run independent CV for each and compare means.

Why pairing matters:

Some folds are inherently easier or harder. Both models benefit/suffer from the same folds. Paired comparison cancels this fold-specific noise.

Fold	Model A	Model B	A - B
1	0.85	0.83	+0.02
2	0.92	0.89	+0.03
3	0.78	0.76	+0.02

The difference A-B is remarkably stable, even though individual scores vary widely. Paired tests leverage this stability.

paired_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy import stats
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import make_classification
 
def paired_cv_comparison(X, y, model_a_factory, model_b_factory,
                         n_splits=10, n_repeats=5, random_state=42):
    """
    Properly compare two models using paired repeated CV.
    """
    np.random.seed(random_state)
    
    all_diffs = []
    
    for rep in range(n_repeats):
        cv = StratifiedKFold(n_splits=n_splits, shuffle=True, 
                            random_state=random_state + rep)
        
        for train_idx, val_idx in cv.split(X, y):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Same train/val split for both models
            model_a = model_a_factory()
            model_b = model_b_factory()
            
            model_a.fit(X_train, y_train)
            model_b.fit(X_train, y_train)
            
            score_a = model_a.score(X_val, y_val)
            score_b = model_b.score(X_val, y_val)
            
            all_diffs.append(score_a - score_b)
    
    all_diffs = np.array(all_diffs)
    
    # Statistics
    mean_diff = all_diffs.mean()
    std_diff = all_diffs.std()
    se_diff = std_diff / np.sqrt(len(all_diffs))
    
    # One-sample t-test: is mean_diff significantly different from 0?
    t_stat, p_value = stats.ttest_1samp(all_diffs, 0)
    
    # Corrected t-test (Nadeau & Bengio, 2003)
    # Accounts for CV fold correlation
    n = len(X)
    n_test = n // n_splits
    n_train = n - n_test
    
    corrected_variance = (
        all_diffs.var() * (1/len(all_diffs) + n_test/n_train)
    )
    corrected_t = mean_diff / np.sqrt(corrected_variance)
    df = len(all_diffs) - 1
    corrected_p = 2 * (1 - stats.t.cdf(abs(corrected_t), df))
    
    print("Paired CV Comparison Results")
    print("=" * 60)
    print(f"Configuration: {n_repeats}×{n_splits} CV = {len(all_diffs)} pairs")
    print()
    print(f"Mean difference (A - B): {mean_diff:+.4f}")
    print(f"Std of differences: {std_diff:.4f}")
    print(f"95% CI for difference: "
          f"[{mean_diff - 1.96*se_diff:+.4f}, {mean_diff + 1.96*se_diff:+.4f}]")
    print()
    print(f"Naive t-test: t = {t_stat:.3f}, p = {p_value:.4f}")
    print(f"Corrected t-test: t = {corrected_t:.3f}, p = {corrected_p:.4f}")
    print()
    
    if corrected_p < 0.05:
        winner = "Model A" if mean_diff > 0 else "Model B"
        print(f"Conclusion: {winner} is significantly better (α=0.05)")
    else:
        print("Conclusion: No significant difference between models")
    
    return {
        'mean_diff': mean_diff,
        'std_diff': std_diff,
        'all_diffs': all_diffs,
        'naive_p': p_value,
        'corrected_p': corrected_p
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
result = paired_cv_comparison(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5
)

Advanced Repeated CV Configurations

Beyond standard repeated k-fold, several specialized configurations serve specific purposes.

Advanced Configurations

•5×2 CV: 5 repetitions of 2-fold. Designed for statistical testing (Dietterich's paired t-test). Independent fold estimates within each repetition.
•Monte Carlo CV (Shuffle-Split): Random train/test splits in fixed proportions. More flexible than k-fold but observations may never appear in test set.
•Repeated Stratified Group K-Fold: For data with group structure (e.g., multiple samples per patient). Ensures groups don't leak across train/test.
•Leave-P-Out with Repeats: Rarely used due to computational explosion, but theoretically interesting for variance analysis.
•Bootstrap + CV: Combine bootstrap resampling with CV for optimism-corrected estimates (related to .632+ bootstrap).

advanced_configs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import numpy as np
from sklearn.model_selection import (
    RepeatedStratifiedKFold, ShuffleSplit,
    cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def compare_cv_strategies(X, y, model_factory, n_iterations=50):
    """
    Compare different repeated CV strategies.
    """
    strategies = [
        # Standard repeated 10-fold
        ("10×5 Repeated KFold", 
         RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)),
        
        # 5×2 CV for statistical testing
        ("5×2 CV (for t-test)",
         RepeatedStratifiedKFold(n_splits=2, n_repeats=5, random_state=42)),
        
        # Monte Carlo (Shuffle-Split)
        ("Monte Carlo (50 × 20%)",
         ShuffleSplit(n_splits=50, test_size=0.2, random_state=42)),
        
        # More folds, fewer repeats
        ("20×2 Repeated KFold",
         RepeatedStratifiedKFold(n_splits=20, n_repeats=2, random_state=42)),
    ]
    
    print("Comparison of CV Strategies")
    print("=" * 70)
    print(f"{'Strategy':<25} {'Mean':>8} {'Std':>8} {'SE':>8} {'Evals':>8}")
    print("-" * 70)
    
    for name, cv in strategies:
        scores = cross_val_score(model_factory(), X, y, cv=cv)
        print(f"{name:<25} {scores.mean():>8.4f} {scores.std():>8.4f} "
              f"{scores.std()/np.sqrt(len(scores)):>8.4f} {len(scores):>8}")
    
    return strategies
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
strategies = compare_cv_strategies(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
# 5×2 CV t-test implementation
from scipy import stats
 
def five_times_two_cv_test(X, y, model_a_factory, model_b_factory, 
                            random_state=42):
    """
    Dietterich's 5×2 CV paired t-test.
    
    Designed to have approximately correct Type I error rate.
    """
    p_values = []  # Differences in first fold of each repetition
    differences_sq_sum = 0
    
    for rep in range(5):
        # Create 2-fold split
        np.random.seed(random_state + rep)
        indices = np.random.permutation(len(X))
        mid = len(X) // 2
        
        idx1, idx2 = indices[:mid], indices[mid:]
        
        X1, y1 = X[idx1], y[idx1]
        X2, y2 = X[idx2], y[idx2]
        
        # Fold 1: train on 1, test on 2
        model_a = model_a_factory().fit(X1, y1)
        model_b = model_b_factory().fit(X1, y1)
        p1_a = model_a.score(X2, y2)
        p1_b = model_b.score(X2, y2)
        
        # Fold 2: train on 2, test on 1
        model_a = model_a_factory().fit(X2, y2)
        model_b = model_b_factory().fit(X2, y2)
        p2_a = model_a.score(X1, y1)
        p2_b = model_b.score(X1, y1)
        
        d1 = p1_a - p1_b
        d2 = p2_a - p2_b
        
        p_values.append(d1)
        
        # Variance estimate for this repetition
        d_mean = (d1 + d2) / 2
        s_sq = (d1 - d_mean)**2 + (d2 - d_mean)**2
        differences_sq_sum += s_sq
    
    # 5×2 CV t-statistic
    p1 = p_values[0]
    t_stat = p1 / np.sqrt(differences_sq_sum / 5)
    
    # Approximation: t-distribution with 5 degrees of freedom
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), 5))
    
    print("5×2 CV Paired t-test Results")
    print("=" * 50)
    print(f"t-statistic: {t_stat:.4f}")
    print(f"p-value: {p_value:.4f}")
    print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")
    
    return {'t_stat': t_stat, 'p_value': p_value}
 
print("\n" + "=" * 70)
print("5×2 CV Paired t-test (Dietterich)")
print("=" * 70)
from sklearn.ensemble import GradientBoostingClassifier
 
result = five_times_two_cv_test(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=None),
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=None)
)

Choosing Between Strategies

Summary: Repeated Cross-Validation

We've thoroughly explored repeated cross-validation as a method for obtaining more reliable performance estimates. Let's consolidate the key insights:

Key Takeaways

•Single-run CV is partition-dependent: Different random seeds yield different results—sometimes substantially different
•Repeated CV averages over partitions: Running r repetitions with different random partitions reduces partition-specific noise
•Variance decomposes into within and between: Between-repetition variance captures partition luck; within captures fold variance
•Adding repetitions is effective: Goes to the root of the problem (partition dependence) rather than just adding folds
•5 or 10 repetitions are standard: 5 for solid results; 10 for critical decisions; 3 minimum for variance estimation
•r×k configurations matter: 5×10 differs from 10×5—same compute but different bias-variance tradeoffs
•Use paired comparisons: Same splits for both models enables powerful paired statistical tests
•Report mean, std, and CI: Full uncertainty picture essential for honest reporting

What's Next:

Page Complete

4 / 5