Machine LearningK-Fold Cross-Validation

K-Fold Cross-Validation

LevelIntermediate

Duration90 mins

TopicK-Fold Cross-Validation

3 / 5

Choosing K

The Art and Science of Selecting K

Armed with a theoretical understanding of the bias-variance tradeoff, we now face the practical question: what value of k should I use for my specific problem?

This isn't a question with a single universal answer. The optimal k depends on your dataset size, computational budget, model stability, and ultimate goal. This page synthesizes theory, empirical research, and practical experience into actionable guidance.

By the end, you'll have a decision framework that accounts for all relevant factors and confidence in defending your k selection.

What You Will Master

This page covers: (1) Standard k values and their tradeoffs, (2) Sample size considerations, (3) Computational budgeting, (4) Model-specific recommendations, (5) Goal-dependent selection, and (6) A practical decision tree for any situation.

The Standard K Values and Their Properties

While k can theoretically be any integer from 2 to n, certain values have become standard due to their favorable properties. Let's examine each in detail.

Comprehensive Comparison of Standard K Values
K	Train %	Bias (relative)	Variance	Iterations	Primary Use Case
2	50%	Very High	Low (independent)	2	Statistical tests, theoretical analysis
3	67%	High	Low-Moderate	3	Quick prototyping, expensive models
5	80%	Moderate	Moderate	5	General purpose, most common default
10	90%	Low	Moderate-High	10	Publication quality, comprehensive eval
20	95%	Very Low	High (correlation)	20	Small datasets, low-bias requirement
n (LOOCV)	≈100%	Minimal	Variable (can be high)	n	Very small datasets, analytical shortcuts

K = 2 (2-Fold Cross-Validation):

Training on only 50% of data creates substantial bias. However, 2-fold has a unique property: the two training sets are completely disjoint. This independence is valuable for statistical tests like the 5×2 CV paired t-test (introduced by Dietterich, 1998).

Advantages:
- Completely independent fold estimates
- Valid for paired statistical tests
- Only 2× training cost

Disadvantages:
- High pessimistic bias
- Only 2 estimates → high variance
- Poor for absolute performance estimation

K = 5 (5-Fold Cross-Validation):

Widely considered the best default choice. Trains on 80% of data (acceptable bias) with 5 iterations (manageable computation). Research by Breiman and Friedman found 5-fold performs as well as 10-fold for model selection in most cases.

The 5-Fold Sweet Spot

5-fold strikes an excellent balance: 80% training data keeps bias reasonable, 5 iterations provide sufficient variance reduction, and computational cost is manageable. When in doubt about k, 5-fold is rarely a bad choice.

K = 10 (10-Fold Cross-Validation):

The gold standard for thorough evaluation. Training on 90% of data minimizes bias, and scikit-learn, R's caret, and most ML frameworks use k=10 as default. Ron Kohavi's seminal 1995 study found 10-fold optimal across many scenarios.

Advantages:
- Low pessimistic bias (90% training data)
- 10 estimates → good variance reduction
- Industry and academic standard
- Sufficient for publication-quality results

Disadvantages:
- 10× training cost (can be expensive)
- High correlation between folds
- May be overkill for quick iterations

K = n (Leave-One-Out Cross-Validation):

The extreme case, covered in detail previously. Use when n is small and computational shortcuts exist.

compare_k_values.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import time
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def comprehensive_k_comparison(X, y, model_factory, k_values=[2, 3, 5, 10, 20]):
    """
    Comprehensively compare different k values.
    
    Measures: score, variance, time, and estimate stability.
    """
    n = len(X)
    results = []
    
    print(f"{'K':>4} {'Train%':>7} {'Mean':>8} {'Std':>8} {'Time':>8} {'Stable?':>8}")
    print("-" * 60)
    
    for k in k_values:
        if k > n:
            continue
            
        train_frac = (k-1)/k * 100
        
        # Time the CV process
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y, 
            cv=KFold(n_splits=k, shuffle=True, random_state=42),
            scoring='accuracy'
        )
        elapsed = time.time() - start
        
        # Stability: run with different random seeds
        cv_means = []
        for seed in range(10):
            s = cross_val_score(
                model_factory(), X, y,
                cv=KFold(n_splits=k, shuffle=True, random_state=seed),
                scoring='accuracy'
            ).mean()
            cv_means.append(s)
        
        stability = np.std(cv_means)  # Lower = more stable across partitions
        is_stable = "Yes" if stability < 0.01 else "No"
        
        print(f"{k:>4} {train_frac:>6.0f}% {scores.mean():>8.4f} "
              f"{scores.std():>8.4f} {elapsed:>7.2f}s {is_stable:>8}")
        
        results.append({
            'k': k,
            'train_fraction': train_frac,
            'mean_score': scores.mean(),
            'std_score': scores.std(),
            'time': elapsed,
            'stability': stability
        })
    
    return results
 
# Generate test data
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
 
print("=" * 60)
print("K-Value Comparison: Random Forest on 1000 samples")
print("=" * 60)
 
results = comprehensive_k_comparison(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
# Also test with smaller dataset to see bias effect
print("\n" + "=" * 60)
print("K-Value Comparison: Random Forest on 100 samples")
print("=" * 60)
 
X_small, y_small = X[:100], y[:100]
results_small = comprehensive_k_comparison(
    X_small, y_small,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    k_values=[2, 3, 5, 10, 20, 50]
)

Sample Size Considerations

Sample size profoundly affects the optimal choice of k. The key insight: the smaller your dataset, the more each sample matters—both for training quality and validation reliability.

The Fundamental Tradeoffs by Sample Size:

Sample Size Categories

•Tiny (n < 50): Every sample is precious. Use LOOCV or 10-fold minimum. The bias from training on less data is substantial at this scale.
•Small (50 ≤ n < 200): Still sample-constrained. Prefer 10-fold. Consider repeated CV (e.g., 10×10) for stable estimates.
•Medium (200 ≤ n < 1000): More flexibility. 5-fold or 10-fold both work well. Choice depends on computational budget.
•Large (1000 ≤ n < 10000): 5-fold usually sufficient. Bias between 5 and 10-fold is negligible at this scale.
•Very Large (n ≥ 10000): Even 3-fold may suffice. Or consider single train-validation-test splits with sufficient validation set size.

The Validation Set Size Perspective:

Another way to think about k: each validation fold should be large enough to give reliable performance estimates. With k-fold, each validation set has n/k samples.

n	k=5 val size	k=10 val size	k=20 val size
50	10	5	2-3
100	20	10	5
500	100	50	25
1000	200	100	50
5000	1000	500	250

For classification with p classes, you generally want at least 10-20 samples per class per validation set for reliable estimates. If n/k/p < 10, consider smaller k or stratification.

The Small Validation Set Problem

With 5 samples in a validation fold, your classification accuracy is limited to multiples of 20% (0%, 20%, 40%, ..., 100%). This discretization adds noise to your estimates. For very small samples, LOOCV's single-sample validation isn't worse—there's no discretization, just binary right/wrong per sample.

sample_size_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from sklearn.model_selection import cross_val_score, KFold, LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
 
def analyze_k_by_sample_size(sample_sizes=[30, 50, 100, 200, 500, 1000],
                              k_values=[3, 5, 10, 20],
                              n_repeats=50):
    """
    Analyze how optimal k varies with sample size.
    """
    print(f"{'n':>6} | ", end="")
    for k in k_values:
        print(f"k={k:2d} (mean±std) |", end=" ")
    print()
    print("-" * (6 + len(k_values) * 22))
    
    all_results = {}
    
    for n in sample_sizes:
        line = f"{n:>6} | "
        all_results[n] = {}
        
        for k in k_values:
            if k > n:
                line += f"{'N/A':^18} |"
                continue
            
            # Run CV multiple times with different data samples
            cv_means = []
            for rep in range(n_repeats):
                X, y = make_classification(
                    n_samples=n, n_features=10, n_informative=5,
                    random_state=rep
                )
                scores = cross_val_score(
                    LogisticRegression(max_iter=1000),
                    X, y,
                    cv=KFold(n_splits=k, shuffle=True, random_state=rep)
                )
                cv_means.append(scores.mean())
            
            mean_cv = np.mean(cv_means)
            std_cv = np.std(cv_means)
            all_results[n][k] = {'mean': mean_cv, 'std': std_cv}
            
            line += f"{mean_cv:.3f}±{std_cv:.3f}    |"
        
        print(line)
    
    # Analysis
    print("\nObservations:")
    print("-" * 60)
    
    # Compare variance across k for each n
    for n in sample_sizes:
        if n >= 50:
            stds = [(k, all_results[n][k]['std']) 
                    for k in k_values if k in all_results[n]]
            min_std_k = min(stds, key=lambda x: x[1])
            print(f"n={n:4d}: Lowest variance at k={min_std_k[0]} "
                  f"(std={min_std_k[1]:.4f})")
    
    return all_results
 
# Run analysis
print("=" * 80)
print("How Sample Size Affects Optimal K (Logistic Regression)")
print("=" * 80)
print()
 
results = analyze_k_by_sample_size()
 
# Recommendation function
def recommend_k_for_sample_size(n, n_classes=2, compute_budget='medium'):
    """
    Recommend k based on sample size.
    """
    min_val_per_class = 10  # Minimum samples per class in validation
    
    # Calculate minimum k that gives sufficient validation size
    max_k_by_val = n // (min_val_per_class * n_classes)
    
    # Base recommendation
    if n < 50:
        rec = n  # LOOCV
        reason = "Very small sample: LOOCV recommended"
    elif n < 100:
        rec = min(n // 10, 10)  # At least 10 per fold
        reason = f"Small sample: k={rec} ensures {n//rec} per fold"
    elif n < 500:
        rec = 10 if compute_budget != 'low' else 5
        reason = f"Medium sample: k={rec} balances bias/variance"
    else:
        rec = 5 if compute_budget == 'low' else 10
        reason = f"Large sample: k={rec} is sufficient"
    
    # Ensure valid k
    rec = max(2, min(rec, max_k_by_val, n))
    
    return rec, reason
 
# Examples
print("\n" + "=" * 80)
print("Recommended K by Sample Size")
print("=" * 80)
for n in [30, 75, 150, 500, 2000, 10000]:
    k, reason = recommend_k_for_sample_size(n)
    print(f"n={n:5d}: recommended k={k:3d} - {reason}")

Computational Budget Considerations

The choice of k directly impacts computational cost: k-fold CV requires training k models. For expensive algorithms, this can be prohibitive.

Time Complexity Analysis:

Let T(n) be the training time for n samples. K-fold CV takes approximately:

$$\text{CV Time} \approx k \cdot T\left(n \cdot \frac{k-1}{k}\right)$$

For many algorithms, T(n) is superlinear:

Algorithm	T(n)	k=5 vs k=10 time ratio
Linear Regression	O(n·p²)	2.1×
SVM (RBF)	O(n²) - O(n³)	1.8× - 1.6×
Random Forest	O(k·n·log(n)·p)	2.0×
Gradient Boosting	O(trees·n·log(n)·p)	2.0×
Neural Network	O(epochs·n·layers)	2.0×
k-NN (prediction)	O(n) per query	2.0×

The 5-Fold Compute Advantage

5-fold takes roughly half the time of 10-fold (5 fits vs 10). For large models or datasets, this difference matters. Use 5-fold for development/iteration, 10-fold for final evaluation.

Strategies for Expensive Models:

Staged evaluation: Use 3-fold for initial experiments, 5-fold for promising models, 10-fold for final reports.
Parallelization: Each fold is independent—train them in parallel on multiple cores/machines.
Subsampling: For very large datasets, subsample before CV (carefully! handle stratification).
Warm starts: Some algorithms support warm starting from a previous solution, reducing per-fold training time.
Early stopping: Use validation performance to stop training early, especially for iterative algorithms.

cv_timing_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
import time
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from joblib import Parallel, delayed
 
def time_cv_strategies(X, y, model_factory, k_values=[3, 5, 10]):
    """
    Compare timing of different k values and parallelization.
    """
    results = []
    
    for k in k_values:
        # Sequential
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y,
            cv=KFold(n_splits=k, shuffle=True, random_state=42),
            n_jobs=1
        )
        seq_time = time.time() - start
        
        # Parallel (all cores)
        start = time.time()
        scores_par = cross_val_score(
            model_factory(), X, y,
            cv=KFold(n_splits=k, shuffle=True, random_state=42),
            n_jobs=-1  # Use all cores
        )
        par_time = time.time() - start
        
        speedup = seq_time / par_time if par_time > 0 else 0
        
        print(f"k={k:2d}: Sequential={seq_time:6.2f}s, "
              f"Parallel={par_time:6.2f}s, "
              f"Speedup={speedup:.1f}x, "
              f"Score={scores.mean():.4f}")
        
        results.append({
            'k': k,
            'sequential_time': seq_time,
            'parallel_time': par_time,
            'speedup': speedup,
            'score': scores.mean()
        })
    
    return results
 
# Generate test data
np.random.seed(42)
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
 
print("=" * 70)
print("CV Timing: Gradient Boosting (100 trees)")
print("=" * 70)
 
results = time_cv_strategies(
    X, y,
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=42)
)
 
# Staged evaluation strategy
print("\n" + "=" * 70)
print("Staged Evaluation Strategy Demo")
print("=" * 70)
 
def staged_cv_evaluation(X, y, model_factory, 
                         stage_ks=[3, 5, 10],
                         n_candidates=5):
    """
    Demonstrate staged CV: quick evaluation of many models,
    thorough evaluation of promising ones.
    """
    print(f"\nStage 1: Quick 3-fold CV on {n_candidates} candidate models")
    print("-" * 50)
    
    # Simulate multiple candidate models
    candidates = []
    for i in range(n_candidates):
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y,
            cv=3, n_jobs=-1
        )
        elapsed = time.time() - start
        candidates.append({
            'id': i,
            'score': scores.mean(),
            'time': elapsed
        })
        print(f"  Model {i}: CV3 score = {scores.mean():.4f} ({elapsed:.2f}s)")
    
    total_stage1 = sum(c['time'] for c in candidates)
    print(f"  Stage 1 total time: {total_stage1:.2f}s")
    
    # Select top 2
    top_candidates = sorted(candidates, key=lambda x: x['score'], 
                           reverse=True)[:2]
    
    print(f"\nStage 2: Thorough 10-fold CV on top 2 models")
    print("-" * 50)
    
    for cand in top_candidates:
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y,
            cv=10, n_jobs=-1
        )
        elapsed = time.time() - start
        print(f"  Model {cand['id']}: CV10 score = {scores.mean():.4f} "
              f"± {scores.std():.4f} ({elapsed:.2f}s)")
    
    print(f"\nTotal time with staged approach: ~{total_stage1 + elapsed*2:.2f}s")
    print(f"Time if we did 10-fold on all {n_candidates} models: "
          f"~{elapsed * n_candidates:.2f}s")
    print(f"Savings: {(1 - (total_stage1 + elapsed*2)/(elapsed*n_candidates))*100:.0f}%")
 
staged_cv_evaluation(X, y, 
                     lambda: GradientBoostingClassifier(n_estimators=50, 
                                                        random_state=42))

Recommended K by Computational Constraint
Budget	Iterations (k)	Best For	Tradeoff
Very Low	3	Quick experiments, expensive models	Higher bias, faster iteration
Low	5	Development, model selection	Good balance, half the time of k=10
Medium	10	Standard evaluation, publication	Industry standard, thorough
High	10 + 5 repeats	Rigorous evaluation, competitions	Low variance estimates
Very High	10 + 10 repeats or LOOCV	Formal research, small samples	Maximum reliability

Model-Specific Recommendations

Different model types have different characteristics that affect optimal k selection. Stability, training speed, and learning curve shape all matter.

K Recommendations by Model Type
Model Type	Recommended K	Rationale	Special Considerations
Linear Models	5 or 10	Stable, fast training, minor variance	Can use LOOCV with GCV shortcut
Decision Trees	10 + repeats	Unstable, high variance between partitions	Multiple repeats essential
Random Forests	5 or 10	Ensemble averaging adds stability	Consider OOB error as alternative
Gradient Boosting	5	Moderate stability, expensive training	Use early stopping to reduce time
Neural Networks	3-5	Very expensive, good internal validation	Use validation set for early stopping
SVM	5 or 10	Moderate stability, can be expensive	RBF kernel more stable than poly
k-NN	10 or LOOCV	Prediction cheap, no training	LOOCV is computationally equivalent
Naive Bayes	5 or 10	Very stable and fast	Almost any k works well

Random Forests and Out-of-Bag Error

Random Forests provide out-of-bag (OOB) error estimates 'for free' as a byproduct of bootstrap sampling. OOB error approximates LOOCV without additional computation. Consider using OOB error instead of CV for RF models, especially for computational savings.

Model Stability and K:

Recall that model stability affects the correlation between fold estimates. Unstable models (high-variance learners) benefit from:

More folds: To average out variability
Multiple repeats: To reduce partition dependence
Larger validation sets: To stabilize individual fold estimates

Neural Network Considerations:

Deep learning models have unique considerations:

Very expensive training (hours to days per fold)
Internal validation set used for early stopping
Stochastic training means replicate runs give different results

Common practice for neural networks:

Use 3-5 fold CV for architecture/hyperparameter selection
Report mean ± std across folds
For final model: train on 100% with validation-based early stopping
Consider Bayesian methods for uncertainty estimation

model_stability_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.model_selection import cross_val_score, KFold, RepeatedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
 
def measure_cv_stability(X, y, model_factory, model_name, k=5, n_repeats=20):
    """
    Measure stability of CV estimates across different random partitions.
    """
    cv_means = []
    for seed in range(n_repeats):
        kfold = KFold(n_splits=k, shuffle=True, random_state=seed)
        scores = cross_val_score(model_factory(), X, y, cv=kfold)
        cv_means.append(scores.mean())
    
    cv_means = np.array(cv_means)
    
    return {
        'model': model_name,
        'mean': cv_means.mean(),
        'std': cv_means.std(),
        'range': cv_means.max() - cv_means.min(),
        'coefficient_of_variation': cv_means.std() / cv_means.mean()
    }
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
models = [
    (lambda: LogisticRegression(max_iter=1000), "Logistic Regression"),
    (lambda: DecisionTreeClassifier(max_depth=None), "Decision Tree (unpruned)"),
    (lambda: DecisionTreeClassifier(max_depth=5), "Decision Tree (depth=5)"),
    (lambda: RandomForestClassifier(n_estimators=100), "Random Forest (100 trees)"),
    (lambda: KNeighborsClassifier(n_neighbors=5), "k-NN (k=5)"),
]
 
print("=" * 70)
print("CV Stability Comparison Across Models (5-fold, 20 repeats)")
print("=" * 70)
print(f"{'Model':<30} {'Mean':>8} {'Std':>8} {'Range':>8} {'CoV':>8}")
print("-" * 70)
 
stability_results = []
for model_factory, name in models:
    result = measure_cv_stability(X, y, model_factory, name)
    stability_results.append(result)
    print(f"{name:<30} {result['mean']:>8.4f} {result['std']:>8.4f} "
          f"{result['range']:>8.4f} {result['coefficient_of_variation']:>8.4f}")
 
# Recommend based on stability
print("\n" + "=" * 70)
print("Recommendations Based on Stability")
print("=" * 70)
 
for result in stability_results:
    cov = result['coefficient_of_variation']
    if cov < 0.01:
        rec = "5-fold sufficient; very stable"
    elif cov < 0.03:
        rec = "5-10 fold recommended; moderately stable"
    else:
        rec = "10-fold with repeats needed; unstable"
    
    print(f"{result['model']:<30}: {rec}")

Goal-Dependent K Selection

Different use cases for cross-validation have different requirements. Your goal should influence k selection.

CV Goals and Recommended K

•Model Selection (choosing between models): Bias differences cancel out when comparing models trained on the same data. k=5 often sufficient; focus on low variance for reliable comparisons.
•Hyperparameter Tuning: Similar to model selection. k=3-5 for fast iteration during grid search; k=10 for final evaluation of selected hyperparameters.
•Absolute Performance Estimation: Need low bias to accurately predict deployment performance. k=10 or k=20; consider repeated CV.
•Feature Selection: High variance can cause unstable feature rankings. k=10 with 5+ repeats; or use stability selection methods.
•Publication/Reporting: k=10 is the community standard. Include confidence intervals from repeated CV or bootstrap.
•Statistical Hypothesis Testing: Special considerations—5×2 CV for paired t-tests; McNemar's test for paired comparisons.

The Bias Cancellation Principle

When comparing models A and B, both experience similar pessimistic bias from training on (k-1)/k of data. The bias largely cancels in the comparison: if A beats B with 5-fold, A will very likely beat B with 10-fold too. This is why smaller k is often acceptable for model selection.

Statistical Testing Considerations:

When you need statistical tests to determine if one model is significantly better than another, k selection interacts with the test validity.

The 5×2 CV paired t-test (Dietterich, 1998):

For comparing two algorithms:

Repeat 2-fold CV 5 times (different random partitions)
Compute the paired differences between algorithms
Use a specific t-statistic formula that accounts for correlations

This test is designed to have correct Type I error rates (not too many false positives), which standard CV-based t-tests can fail to achieve.

The Corrected Resampled Paired t-test (Nadeau & Bengio, 2003):

For standard k-fold CV:

Applies a correction factor to the variance estimate
Accounts for the correlation between fold estimates
More reliable than naive t-tests on CV results

goal_dependent_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from sklearn.model_selection import (
    cross_val_score, RepeatedStratifiedKFold, GridSearchCV
)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from scipy import stats
 
def goal_oriented_cv_demo(X, y):
    """
    Demonstrate different CV strategies for different goals.
    """
    
    # Goal 1: Model Selection (which model is best?)
    print("=" * 70)
    print("GOAL 1: Model Selection")
    print("=" * 70)
    print("Using 5-fold CV (bias cancels in comparisons)")
    print()
    
    models = [
        (LogisticRegression(max_iter=1000), "Logistic Regression"),
        (RandomForestClassifier(n_estimators=50), "Random Forest"),
        (GradientBoostingClassifier(n_estimators=50), "Gradient Boosting"),
    ]
    
    for model, name in models:
        scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
        print(f"{name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")
    
    # Goal 2: Hyperparameter Tuning (fast iteration)
    print("\n" + "=" * 70)
    print("GOAL 2: Hyperparameter Tuning")
    print("=" * 70)
    print("Using 3-fold CV for speed during grid search")
    print()
    
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 10, None]
    }
    
    # Fast CV for grid search
    grid_search = GridSearchCV(
        RandomForestClassifier(random_state=42),
        param_grid,
        cv=3,  # Fast iteration
        scoring='accuracy',
        n_jobs=-1
    )
    grid_search.fit(X, y)
    print(f"Best params: {grid_search.best_params_}")
    print(f"Best CV score (3-fold): {grid_search.best_score_:.4f}")
    
    # Thorough evaluation of best params
    best_model = RandomForestClassifier(**grid_search.best_params_, random_state=42)
    thorough_scores = cross_val_score(best_model, X, y, cv=10)
    print(f"Best model (10-fold): {thorough_scores.mean():.4f} ± {thorough_scores.std():.4f}")
    
    # Goal 3: Absolute Performance Estimation
    print("\n" + "=" * 70)
    print("GOAL 3: Absolute Performance Estimation")
    print("=" * 70)
    print("Using 10-fold repeated CV for low bias and variance")
    print()
    
    repeated_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)
    scores = cross_val_score(
        RandomForestClassifier(n_estimators=100, random_state=42),
        X, y, cv=repeated_cv, scoring='accuracy'
    )
    
    print(f"Mean accuracy: {scores.mean():.4f}")
    print(f"Std: {scores.std():.4f}")
    print(f"95% CI: [{scores.mean() - 1.96*scores.std():.4f}, "
          f"{scores.mean() + 1.96*scores.std():.4f}]")
    
    # Goal 4: Statistical Comparison
    print("\n" + "=" * 70)
    print("GOAL 4: Statistical Comparison (Is A better than B?)")
    print("=" * 70)
    print("Using paired tests on CV results")
    print()
    
    # Run same CV splits for both models
    from sklearn.model_selection import StratifiedKFold
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    
    model_a = RandomForestClassifier(n_estimators=100, random_state=42)
    model_b = GradientBoostingClassifier(n_estimators=100, random_state=42)
    
    scores_a = []
    scores_b = []
    
    for train_idx, val_idx in cv.split(X, y):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        model_a.fit(X_train, y_train)
        model_b.fit(X_train, y_train)
        
        scores_a.append(model_a.score(X_val, y_val))
        scores_b.append(model_b.score(X_val, y_val))
    
    scores_a = np.array(scores_a)
    scores_b = np.array(scores_b)
    
    # Paired t-test (naive - for illustration)
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
    
    print(f"RF mean: {scores_a.mean():.4f}, GB mean: {scores_b.mean():.4f}")
    print(f"Paired t-test: t={t_stat:.3f}, p={p_value:.4f}")
    print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")
    print("\nNote: For rigorous testing, use corrected t-test or 5x2 CV test")
 
# Generate data and run demo
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
goal_oriented_cv_demo(X, y)

A Complete Decision Framework

Let's synthesize everything into a practical decision tree for selecting k.

Converting Mermaid diagram...

Quick Reference Guide

•Start with sample size: n < 50 → LOOCV; n < 200 → k=10; n ≥ 200 → k=5 or k=10
•Adjust for compute budget: Limited time → reduce k; sufficient time → k=10 or add repeats
•Consider your goal: Model selection tolerates higher variance; performance estimation needs low bias
•Account for model stability: Unstable models (trees, neural nets) → add repeated CV
•For publication: Default to 10-fold with 5+ repeats, report confidence intervals
•For hyperparameter tuning: Use smaller k (3-5) for inner CV, larger k (10) for outer evaluation

The Safe Defaults

When you're unsure: (1) k=5 for exploratory work and model iteration, (2) k=10 for final evaluation and reporting, (3) k=10 with 5 repeats for publication-quality results. These choices are rarely wrong.

Common Mistakes in K Selection

Avoid these common pitfalls when selecting and using k-fold cross-validation:

Mistakes to Avoid

•Using k=2 for performance estimation: 50% training data creates severe pessimistic bias. Only use 2-fold for specific statistical tests.
•Ignoring model stability: Reporting single-run CV results for unstable models (decision trees) leads to non-reproducible claims.
•Large k for small samples: k=20 on n=100 means only 5 samples per validation fold—too noisy.
•LOOCV for unstable models: The high correlation inflates variance, potentially giving worse estimates than 10-fold.
•Changing k mid-experiment: Compare models using the same CV setup. Mixing k=5 and k=10 results isn't fair.
•Over-trusting single CV estimates: Even 10-fold has partition-dependent variance. Use repeated CV for important decisions.
•Ignoring stratification: For classification, always use stratified k-fold to maintain class proportions.

The Reproducibility Crisis

Many published ML results are not reproducible partly because of inadequate CV practices. Single-run CV with unfixed random seeds, inappropriate k selection, and failure to report variance all contribute. Always set random states and report full results.

cv_best_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, RepeatedStratifiedKFold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def reproducible_cv_evaluation(X, y, model_factory, 
                               n_splits=10, n_repeats=5, 
                               random_state=42):
    """
    Best practice CV evaluation with reproducibility and proper reporting.
    """
    # Always use stratified CV for classification
    if n_repeats > 1:
        cv = RepeatedStratifiedKFold(
            n_splits=n_splits, 
            n_repeats=n_repeats,
            random_state=random_state
        )
        cv_name = f"Repeated Stratified {n_splits}-Fold (×{n_repeats})"
    else:
        cv = StratifiedKFold(
            n_splits=n_splits, 
            shuffle=True, 
            random_state=random_state
        )
        cv_name = f"Stratified {n_splits}-Fold"
    
    # Run CV
    scores = cross_val_score(
        model_factory(), X, y, cv=cv, scoring='accuracy'
    )
    
    # Compute statistics
    mean_score = scores.mean()
    std_score = scores.std()
    se_score = std_score / np.sqrt(len(scores))  # Standard error
    
    # 95% confidence interval (approximate)
    ci_low = mean_score - 1.96 * se_score
    ci_high = mean_score + 1.96 * se_score
    
    # Report
    print(f"CV Strategy: {cv_name}")
    print(f"Random State: {random_state}")
    print(f"Total Evaluations: {len(scores)}")
    print(f"\nResults:")
    print(f"  Mean Accuracy: {mean_score:.4f}")
    print(f"  Standard Deviation: {std_score:.4f}")
    print(f"  Standard Error: {se_score:.4f}")
    print(f"  95% CI: [{ci_low:.4f}, {ci_high:.4f}]")
    print(f"\nAll scores: {', '.join(f'{s:.4f}' for s in scores[:10])}"
          + ("..." if len(scores) > 10 else ""))
    
    return {
        'mean': mean_score,
        'std': std_score,
        'se': se_score,
        'ci_95': (ci_low, ci_high),
        'all_scores': scores
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
print("=" * 70)
print("Best Practice: Reproducible CV Evaluation")
print("=" * 70)
 
result = reproducible_cv_evaluation(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5,
    random_state=42
)
 
print("\n" + "=" * 70)
print("Reproducibility Check: Same parameters should give same results")
print("=" * 70)
 
result2 = reproducible_cv_evaluation(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5,
    random_state=42  # Same random state
)
 
print(f"\nResults match: {np.allclose(result['all_scores'], result2['all_scores'])}")

Summary: Choosing K

We've developed a comprehensive framework for selecting k. Let's consolidate the key guidance:

Key Takeaways

•k=5 and k=10 are the standard choices for most practical situations—rarely wrong
•Sample size is the primary driver: Smaller samples need larger k (up to LOOCV for n < 50)
•Computational budget matters: k=5 takes half the time of k=10; use staged evaluation
•Model stability affects strategy: Unstable models need repeated CV, not just higher k
•Your goal determines tolerance: Model selection tolerates higher variance; performance estimation needs low bias
•Always use stratified k-fold for classification to maintain class proportions
•Set random states and report them for reproducibility—single CV estimates are partition-dependent
•For publication: k=10 with 5+ repeats and reported confidence intervals

What's Next:

We now understand how to select k for standard k-fold cross-validation. The next page explores repeated cross-validation, which addresses the variance problem by running multiple CV iterations with different random partitions—providing more stable estimates without compromising on the bias-variance tradeoff.

Page Complete

You now have a complete decision framework for selecting k in cross-validation. You can confidently choose k based on sample size, computational constraints, model stability, and your specific goals—and defend that choice with theoretical and empirical reasoning.

3 / 5

Loading learning content...

Machine LearningK-Fold Cross-Validation

K-Fold Cross-Validation

LevelIntermediate

Duration90 mins

TopicK-Fold Cross-Validation

3 / 5

Choosing K

The Art and Science of Selecting K

Armed with a theoretical understanding of the bias-variance tradeoff, we now face the practical question: what value of k should I use for my specific problem?

By the end, you'll have a decision framework that accounts for all relevant factors and confidence in defending your k selection.

What You Will Master

The Standard K Values and Their Properties

While k can theoretically be any integer from 2 to n, certain values have become standard due to their favorable properties. Let's examine each in detail.

Comprehensive Comparison of Standard K Values
K	Train %	Bias (relative)	Variance	Iterations	Primary Use Case
2	50%	Very High	Low (independent)	2	Statistical tests, theoretical analysis
3	67%	High	Low-Moderate	3	Quick prototyping, expensive models
5	80%	Moderate	Moderate	5	General purpose, most common default
10	90%	Low	Moderate-High	10	Publication quality, comprehensive eval
20	95%	Very Low	High (correlation)	20	Small datasets, low-bias requirement
n (LOOCV)	≈100%	Minimal	Variable (can be high)	n	Very small datasets, analytical shortcuts

K = 2 (2-Fold Cross-Validation):

Advantages:
- Completely independent fold estimates
- Valid for paired statistical tests
- Only 2× training cost

Disadvantages:
- High pessimistic bias
- Only 2 estimates → high variance
- Poor for absolute performance estimation

K = 5 (5-Fold Cross-Validation):

The 5-Fold Sweet Spot

K = 10 (10-Fold Cross-Validation):

Advantages:
- Low pessimistic bias (90% training data)
- 10 estimates → good variance reduction
- Industry and academic standard
- Sufficient for publication-quality results

Disadvantages:
- 10× training cost (can be expensive)
- High correlation between folds
- May be overkill for quick iterations

K = n (Leave-One-Out Cross-Validation):

The extreme case, covered in detail previously. Use when n is small and computational shortcuts exist.

compare_k_values.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import numpy as np
import time
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def comprehensive_k_comparison(X, y, model_factory, k_values=[2, 3, 5, 10, 20]):
    """
    Comprehensively compare different k values.
    
    Measures: score, variance, time, and estimate stability.
    """
    n = len(X)
    results = []
    
    print(f"{'K':>4} {'Train%':>7} {'Mean':>8} {'Std':>8} {'Time':>8} {'Stable?':>8}")
    print("-" * 60)
    
    for k in k_values:
        if k > n:
            continue
            
        train_frac = (k-1)/k * 100
        
        # Time the CV process
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y, 
            cv=KFold(n_splits=k, shuffle=True, random_state=42),
            scoring='accuracy'
        )
        elapsed = time.time() - start
        
        # Stability: run with different random seeds
        cv_means = []
        for seed in range(10):
            s = cross_val_score(
                model_factory(), X, y,
                cv=KFold(n_splits=k, shuffle=True, random_state=seed),
                scoring='accuracy'
            ).mean()
            cv_means.append(s)
        
        stability = np.std(cv_means)  # Lower = more stable across partitions
        is_stable = "Yes" if stability < 0.01 else "No"
        
        print(f"{k:>4} {train_frac:>6.0f}% {scores.mean():>8.4f} "
              f"{scores.std():>8.4f} {elapsed:>7.2f}s {is_stable:>8}")
        
        results.append({
            'k': k,
            'train_fraction': train_frac,
            'mean_score': scores.mean(),
            'std_score': scores.std(),
            'time': elapsed,
            'stability': stability
        })
    
    return results
 
# Generate test data
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
 
print("=" * 60)
print("K-Value Comparison: Random Forest on 1000 samples")
print("=" * 60)
 
results = comprehensive_k_comparison(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42)
)
 
# Also test with smaller dataset to see bias effect
print("\n" + "=" * 60)
print("K-Value Comparison: Random Forest on 100 samples")
print("=" * 60)
 
X_small, y_small = X[:100], y[:100]
results_small = comprehensive_k_comparison(
    X_small, y_small,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    k_values=[2, 3, 5, 10, 20, 50]
)

Sample Size Considerations

Sample size profoundly affects the optimal choice of k. The key insight: the smaller your dataset, the more each sample matters—both for training quality and validation reliability.

The Fundamental Tradeoffs by Sample Size:

Sample Size Categories

•Tiny (n < 50): Every sample is precious. Use LOOCV or 10-fold minimum. The bias from training on less data is substantial at this scale.
•Small (50 ≤ n < 200): Still sample-constrained. Prefer 10-fold. Consider repeated CV (e.g., 10×10) for stable estimates.
•Medium (200 ≤ n < 1000): More flexibility. 5-fold or 10-fold both work well. Choice depends on computational budget.
•Large (1000 ≤ n < 10000): 5-fold usually sufficient. Bias between 5 and 10-fold is negligible at this scale.
•Very Large (n ≥ 10000): Even 3-fold may suffice. Or consider single train-validation-test splits with sufficient validation set size.

The Validation Set Size Perspective:

Another way to think about k: each validation fold should be large enough to give reliable performance estimates. With k-fold, each validation set has n/k samples.

n	k=5 val size	k=10 val size	k=20 val size
50	10	5	2-3
100	20	10	5
500	100	50	25
1000	200	100	50
5000	1000	500	250

For classification with p classes, you generally want at least 10-20 samples per class per validation set for reliable estimates. If n/k/p < 10, consider smaller k or stratification.

The Small Validation Set Problem

sample_size_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from sklearn.model_selection import cross_val_score, KFold, LeaveOneOut
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
 
def analyze_k_by_sample_size(sample_sizes=[30, 50, 100, 200, 500, 1000],
                              k_values=[3, 5, 10, 20],
                              n_repeats=50):
    """
    Analyze how optimal k varies with sample size.
    """
    print(f"{'n':>6} | ", end="")
    for k in k_values:
        print(f"k={k:2d} (mean±std) |", end=" ")
    print()
    print("-" * (6 + len(k_values) * 22))
    
    all_results = {}
    
    for n in sample_sizes:
        line = f"{n:>6} | "
        all_results[n] = {}
        
        for k in k_values:
            if k > n:
                line += f"{'N/A':^18} |"
                continue
            
            # Run CV multiple times with different data samples
            cv_means = []
            for rep in range(n_repeats):
                X, y = make_classification(
                    n_samples=n, n_features=10, n_informative=5,
                    random_state=rep
                )
                scores = cross_val_score(
                    LogisticRegression(max_iter=1000),
                    X, y,
                    cv=KFold(n_splits=k, shuffle=True, random_state=rep)
                )
                cv_means.append(scores.mean())
            
            mean_cv = np.mean(cv_means)
            std_cv = np.std(cv_means)
            all_results[n][k] = {'mean': mean_cv, 'std': std_cv}
            
            line += f"{mean_cv:.3f}±{std_cv:.3f}    |"
        
        print(line)
    
    # Analysis
    print("\nObservations:")
    print("-" * 60)
    
    # Compare variance across k for each n
    for n in sample_sizes:
        if n >= 50:
            stds = [(k, all_results[n][k]['std']) 
                    for k in k_values if k in all_results[n]]
            min_std_k = min(stds, key=lambda x: x[1])
            print(f"n={n:4d}: Lowest variance at k={min_std_k[0]} "
                  f"(std={min_std_k[1]:.4f})")
    
    return all_results
 
# Run analysis
print("=" * 80)
print("How Sample Size Affects Optimal K (Logistic Regression)")
print("=" * 80)
print()
 
results = analyze_k_by_sample_size()
 
# Recommendation function
def recommend_k_for_sample_size(n, n_classes=2, compute_budget='medium'):
    """
    Recommend k based on sample size.
    """
    min_val_per_class = 10  # Minimum samples per class in validation
    
    # Calculate minimum k that gives sufficient validation size
    max_k_by_val = n // (min_val_per_class * n_classes)
    
    # Base recommendation
    if n < 50:
        rec = n  # LOOCV
        reason = "Very small sample: LOOCV recommended"
    elif n < 100:
        rec = min(n // 10, 10)  # At least 10 per fold
        reason = f"Small sample: k={rec} ensures {n//rec} per fold"
    elif n < 500:
        rec = 10 if compute_budget != 'low' else 5
        reason = f"Medium sample: k={rec} balances bias/variance"
    else:
        rec = 5 if compute_budget == 'low' else 10
        reason = f"Large sample: k={rec} is sufficient"
    
    # Ensure valid k
    rec = max(2, min(rec, max_k_by_val, n))
    
    return rec, reason
 
# Examples
print("\n" + "=" * 80)
print("Recommended K by Sample Size")
print("=" * 80)
for n in [30, 75, 150, 500, 2000, 10000]:
    k, reason = recommend_k_for_sample_size(n)
    print(f"n={n:5d}: recommended k={k:3d} - {reason}")

Computational Budget Considerations

The choice of k directly impacts computational cost: k-fold CV requires training k models. For expensive algorithms, this can be prohibitive.

Time Complexity Analysis:

Let T(n) be the training time for n samples. K-fold CV takes approximately:

$$\text{CV Time} \approx k \cdot T\left(n \cdot \frac{k-1}{k}\right)$$

For many algorithms, T(n) is superlinear:

Algorithm	T(n)	k=5 vs k=10 time ratio
Linear Regression	O(n·p²)	2.1×
SVM (RBF)	O(n²) - O(n³)	1.8× - 1.6×
Random Forest	O(k·n·log(n)·p)	2.0×
Gradient Boosting	O(trees·n·log(n)·p)	2.0×
Neural Network	O(epochs·n·layers)	2.0×
k-NN (prediction)	O(n) per query	2.0×

The 5-Fold Compute Advantage

5-fold takes roughly half the time of 10-fold (5 fits vs 10). For large models or datasets, this difference matters. Use 5-fold for development/iteration, 10-fold for final evaluation.

Strategies for Expensive Models:

Staged evaluation: Use 3-fold for initial experiments, 5-fold for promising models, 10-fold for final reports.
Parallelization: Each fold is independent—train them in parallel on multiple cores/machines.
Subsampling: For very large datasets, subsample before CV (carefully! handle stratification).
Warm starts: Some algorithms support warm starting from a previous solution, reducing per-fold training time.
Early stopping: Use validation performance to stop training early, especially for iterative algorithms.

cv_timing_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
import time
from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from joblib import Parallel, delayed
 
def time_cv_strategies(X, y, model_factory, k_values=[3, 5, 10]):
    """
    Compare timing of different k values and parallelization.
    """
    results = []
    
    for k in k_values:
        # Sequential
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y,
            cv=KFold(n_splits=k, shuffle=True, random_state=42),
            n_jobs=1
        )
        seq_time = time.time() - start
        
        # Parallel (all cores)
        start = time.time()
        scores_par = cross_val_score(
            model_factory(), X, y,
            cv=KFold(n_splits=k, shuffle=True, random_state=42),
            n_jobs=-1  # Use all cores
        )
        par_time = time.time() - start
        
        speedup = seq_time / par_time if par_time > 0 else 0
        
        print(f"k={k:2d}: Sequential={seq_time:6.2f}s, "
              f"Parallel={par_time:6.2f}s, "
              f"Speedup={speedup:.1f}x, "
              f"Score={scores.mean():.4f}")
        
        results.append({
            'k': k,
            'sequential_time': seq_time,
            'parallel_time': par_time,
            'speedup': speedup,
            'score': scores.mean()
        })
    
    return results
 
# Generate test data
np.random.seed(42)
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
 
print("=" * 70)
print("CV Timing: Gradient Boosting (100 trees)")
print("=" * 70)
 
results = time_cv_strategies(
    X, y,
    lambda: GradientBoostingClassifier(n_estimators=100, random_state=42)
)
 
# Staged evaluation strategy
print("\n" + "=" * 70)
print("Staged Evaluation Strategy Demo")
print("=" * 70)
 
def staged_cv_evaluation(X, y, model_factory, 
                         stage_ks=[3, 5, 10],
                         n_candidates=5):
    """
    Demonstrate staged CV: quick evaluation of many models,
    thorough evaluation of promising ones.
    """
    print(f"\nStage 1: Quick 3-fold CV on {n_candidates} candidate models")
    print("-" * 50)
    
    # Simulate multiple candidate models
    candidates = []
    for i in range(n_candidates):
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y,
            cv=3, n_jobs=-1
        )
        elapsed = time.time() - start
        candidates.append({
            'id': i,
            'score': scores.mean(),
            'time': elapsed
        })
        print(f"  Model {i}: CV3 score = {scores.mean():.4f} ({elapsed:.2f}s)")
    
    total_stage1 = sum(c['time'] for c in candidates)
    print(f"  Stage 1 total time: {total_stage1:.2f}s")
    
    # Select top 2
    top_candidates = sorted(candidates, key=lambda x: x['score'], 
                           reverse=True)[:2]
    
    print(f"\nStage 2: Thorough 10-fold CV on top 2 models")
    print("-" * 50)
    
    for cand in top_candidates:
        start = time.time()
        scores = cross_val_score(
            model_factory(), X, y,
            cv=10, n_jobs=-1
        )
        elapsed = time.time() - start
        print(f"  Model {cand['id']}: CV10 score = {scores.mean():.4f} "
              f"± {scores.std():.4f} ({elapsed:.2f}s)")
    
    print(f"\nTotal time with staged approach: ~{total_stage1 + elapsed*2:.2f}s")
    print(f"Time if we did 10-fold on all {n_candidates} models: "
          f"~{elapsed * n_candidates:.2f}s")
    print(f"Savings: {(1 - (total_stage1 + elapsed*2)/(elapsed*n_candidates))*100:.0f}%")
 
staged_cv_evaluation(X, y, 
                     lambda: GradientBoostingClassifier(n_estimators=50, 
                                                        random_state=42))

Recommended K by Computational Constraint
Budget	Iterations (k)	Best For	Tradeoff
Very Low	3	Quick experiments, expensive models	Higher bias, faster iteration
Low	5	Development, model selection	Good balance, half the time of k=10
Medium	10	Standard evaluation, publication	Industry standard, thorough
High	10 + 5 repeats	Rigorous evaluation, competitions	Low variance estimates
Very High	10 + 10 repeats or LOOCV	Formal research, small samples	Maximum reliability

Model-Specific Recommendations

Different model types have different characteristics that affect optimal k selection. Stability, training speed, and learning curve shape all matter.

K Recommendations by Model Type
Model Type	Recommended K	Rationale	Special Considerations
Linear Models	5 or 10	Stable, fast training, minor variance	Can use LOOCV with GCV shortcut
Decision Trees	10 + repeats	Unstable, high variance between partitions	Multiple repeats essential
Random Forests	5 or 10	Ensemble averaging adds stability	Consider OOB error as alternative
Gradient Boosting	5	Moderate stability, expensive training	Use early stopping to reduce time
Neural Networks	3-5	Very expensive, good internal validation	Use validation set for early stopping
SVM	5 or 10	Moderate stability, can be expensive	RBF kernel more stable than poly
k-NN	10 or LOOCV	Prediction cheap, no training	LOOCV is computationally equivalent
Naive Bayes	5 or 10	Very stable and fast	Almost any k works well

Random Forests and Out-of-Bag Error

Model Stability and K:

Recall that model stability affects the correlation between fold estimates. Unstable models (high-variance learners) benefit from:

More folds: To average out variability
Multiple repeats: To reduce partition dependence
Larger validation sets: To stabilize individual fold estimates

Neural Network Considerations:

Deep learning models have unique considerations:

Very expensive training (hours to days per fold)
Internal validation set used for early stopping
Stochastic training means replicate runs give different results

Common practice for neural networks:

Use 3-5 fold CV for architecture/hyperparameter selection
Report mean ± std across folds
For final model: train on 100% with validation-based early stopping
Consider Bayesian methods for uncertainty estimation

model_stability_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.model_selection import cross_val_score, KFold, RepeatedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_classification
 
def measure_cv_stability(X, y, model_factory, model_name, k=5, n_repeats=20):
    """
    Measure stability of CV estimates across different random partitions.
    """
    cv_means = []
    for seed in range(n_repeats):
        kfold = KFold(n_splits=k, shuffle=True, random_state=seed)
        scores = cross_val_score(model_factory(), X, y, cv=kfold)
        cv_means.append(scores.mean())
    
    cv_means = np.array(cv_means)
    
    return {
        'model': model_name,
        'mean': cv_means.mean(),
        'std': cv_means.std(),
        'range': cv_means.max() - cv_means.min(),
        'coefficient_of_variation': cv_means.std() / cv_means.mean()
    }
 
# Generate data
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
models = [
    (lambda: LogisticRegression(max_iter=1000), "Logistic Regression"),
    (lambda: DecisionTreeClassifier(max_depth=None), "Decision Tree (unpruned)"),
    (lambda: DecisionTreeClassifier(max_depth=5), "Decision Tree (depth=5)"),
    (lambda: RandomForestClassifier(n_estimators=100), "Random Forest (100 trees)"),
    (lambda: KNeighborsClassifier(n_neighbors=5), "k-NN (k=5)"),
]
 
print("=" * 70)
print("CV Stability Comparison Across Models (5-fold, 20 repeats)")
print("=" * 70)
print(f"{'Model':<30} {'Mean':>8} {'Std':>8} {'Range':>8} {'CoV':>8}")
print("-" * 70)
 
stability_results = []
for model_factory, name in models:
    result = measure_cv_stability(X, y, model_factory, name)
    stability_results.append(result)
    print(f"{name:<30} {result['mean']:>8.4f} {result['std']:>8.4f} "
          f"{result['range']:>8.4f} {result['coefficient_of_variation']:>8.4f}")
 
# Recommend based on stability
print("\n" + "=" * 70)
print("Recommendations Based on Stability")
print("=" * 70)
 
for result in stability_results:
    cov = result['coefficient_of_variation']
    if cov < 0.01:
        rec = "5-fold sufficient; very stable"
    elif cov < 0.03:
        rec = "5-10 fold recommended; moderately stable"
    else:
        rec = "10-fold with repeats needed; unstable"
    
    print(f"{result['model']:<30}: {rec}")

Goal-Dependent K Selection

Different use cases for cross-validation have different requirements. Your goal should influence k selection.

CV Goals and Recommended K

•Model Selection (choosing between models): Bias differences cancel out when comparing models trained on the same data. k=5 often sufficient; focus on low variance for reliable comparisons.
•Hyperparameter Tuning: Similar to model selection. k=3-5 for fast iteration during grid search; k=10 for final evaluation of selected hyperparameters.
•Absolute Performance Estimation: Need low bias to accurately predict deployment performance. k=10 or k=20; consider repeated CV.
•Feature Selection: High variance can cause unstable feature rankings. k=10 with 5+ repeats; or use stability selection methods.
•Publication/Reporting: k=10 is the community standard. Include confidence intervals from repeated CV or bootstrap.
•Statistical Hypothesis Testing: Special considerations—5×2 CV for paired t-tests; McNemar's test for paired comparisons.

The Bias Cancellation Principle

Statistical Testing Considerations:

When you need statistical tests to determine if one model is significantly better than another, k selection interacts with the test validity.

The 5×2 CV paired t-test (Dietterich, 1998):

For comparing two algorithms:

Repeat 2-fold CV 5 times (different random partitions)
Compute the paired differences between algorithms
Use a specific t-statistic formula that accounts for correlations

This test is designed to have correct Type I error rates (not too many false positives), which standard CV-based t-tests can fail to achieve.

The Corrected Resampled Paired t-test (Nadeau & Bengio, 2003):

For standard k-fold CV:

Applies a correction factor to the variance estimate
Accounts for the correlation between fold estimates
More reliable than naive t-tests on CV results

goal_dependent_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from sklearn.model_selection import (
    cross_val_score, RepeatedStratifiedKFold, GridSearchCV
)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from scipy import stats
 
def goal_oriented_cv_demo(X, y):
    """
    Demonstrate different CV strategies for different goals.
    """
    
    # Goal 1: Model Selection (which model is best?)
    print("=" * 70)
    print("GOAL 1: Model Selection")
    print("=" * 70)
    print("Using 5-fold CV (bias cancels in comparisons)")
    print()
    
    models = [
        (LogisticRegression(max_iter=1000), "Logistic Regression"),
        (RandomForestClassifier(n_estimators=50), "Random Forest"),
        (GradientBoostingClassifier(n_estimators=50), "Gradient Boosting"),
    ]
    
    for model, name in models:
        scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
        print(f"{name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")
    
    # Goal 2: Hyperparameter Tuning (fast iteration)
    print("\n" + "=" * 70)
    print("GOAL 2: Hyperparameter Tuning")
    print("=" * 70)
    print("Using 3-fold CV for speed during grid search")
    print()
    
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 10, None]
    }
    
    # Fast CV for grid search
    grid_search = GridSearchCV(
        RandomForestClassifier(random_state=42),
        param_grid,
        cv=3,  # Fast iteration
        scoring='accuracy',
        n_jobs=-1
    )
    grid_search.fit(X, y)
    print(f"Best params: {grid_search.best_params_}")
    print(f"Best CV score (3-fold): {grid_search.best_score_:.4f}")
    
    # Thorough evaluation of best params
    best_model = RandomForestClassifier(**grid_search.best_params_, random_state=42)
    thorough_scores = cross_val_score(best_model, X, y, cv=10)
    print(f"Best model (10-fold): {thorough_scores.mean():.4f} ± {thorough_scores.std():.4f}")
    
    # Goal 3: Absolute Performance Estimation
    print("\n" + "=" * 70)
    print("GOAL 3: Absolute Performance Estimation")
    print("=" * 70)
    print("Using 10-fold repeated CV for low bias and variance")
    print()
    
    repeated_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42)
    scores = cross_val_score(
        RandomForestClassifier(n_estimators=100, random_state=42),
        X, y, cv=repeated_cv, scoring='accuracy'
    )
    
    print(f"Mean accuracy: {scores.mean():.4f}")
    print(f"Std: {scores.std():.4f}")
    print(f"95% CI: [{scores.mean() - 1.96*scores.std():.4f}, "
          f"{scores.mean() + 1.96*scores.std():.4f}]")
    
    # Goal 4: Statistical Comparison
    print("\n" + "=" * 70)
    print("GOAL 4: Statistical Comparison (Is A better than B?)")
    print("=" * 70)
    print("Using paired tests on CV results")
    print()
    
    # Run same CV splits for both models
    from sklearn.model_selection import StratifiedKFold
    cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
    
    model_a = RandomForestClassifier(n_estimators=100, random_state=42)
    model_b = GradientBoostingClassifier(n_estimators=100, random_state=42)
    
    scores_a = []
    scores_b = []
    
    for train_idx, val_idx in cv.split(X, y):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]
        
        model_a.fit(X_train, y_train)
        model_b.fit(X_train, y_train)
        
        scores_a.append(model_a.score(X_val, y_val))
        scores_b.append(model_b.score(X_val, y_val))
    
    scores_a = np.array(scores_a)
    scores_b = np.array(scores_b)
    
    # Paired t-test (naive - for illustration)
    t_stat, p_value = stats.ttest_rel(scores_a, scores_b)
    
    print(f"RF mean: {scores_a.mean():.4f}, GB mean: {scores_b.mean():.4f}")
    print(f"Paired t-test: t={t_stat:.3f}, p={p_value:.4f}")
    print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}")
    print("\nNote: For rigorous testing, use corrected t-test or 5x2 CV test")
 
# Generate data and run demo
np.random.seed(42)
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
goal_oriented_cv_demo(X, y)

A Complete Decision Framework

Let's synthesize everything into a practical decision tree for selecting k.

Converting Mermaid diagram...

Quick Reference Guide

•Start with sample size: n < 50 → LOOCV; n < 200 → k=10; n ≥ 200 → k=5 or k=10
•Adjust for compute budget: Limited time → reduce k; sufficient time → k=10 or add repeats
•Consider your goal: Model selection tolerates higher variance; performance estimation needs low bias
•Account for model stability: Unstable models (trees, neural nets) → add repeated CV
•For publication: Default to 10-fold with 5+ repeats, report confidence intervals
•For hyperparameter tuning: Use smaller k (3-5) for inner CV, larger k (10) for outer evaluation

The Safe Defaults

Common Mistakes in K Selection

Avoid these common pitfalls when selecting and using k-fold cross-validation:

Mistakes to Avoid

•Using k=2 for performance estimation: 50% training data creates severe pessimistic bias. Only use 2-fold for specific statistical tests.
•Ignoring model stability: Reporting single-run CV results for unstable models (decision trees) leads to non-reproducible claims.
•Large k for small samples: k=20 on n=100 means only 5 samples per validation fold—too noisy.
•LOOCV for unstable models: The high correlation inflates variance, potentially giving worse estimates than 10-fold.
•Changing k mid-experiment: Compare models using the same CV setup. Mixing k=5 and k=10 results isn't fair.
•Over-trusting single CV estimates: Even 10-fold has partition-dependent variance. Use repeated CV for important decisions.
•Ignoring stratification: For classification, always use stratified k-fold to maintain class proportions.

The Reproducibility Crisis

cv_best_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, RepeatedStratifiedKFold
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
def reproducible_cv_evaluation(X, y, model_factory, 
                               n_splits=10, n_repeats=5, 
                               random_state=42):
    """
    Best practice CV evaluation with reproducibility and proper reporting.
    """
    # Always use stratified CV for classification
    if n_repeats > 1:
        cv = RepeatedStratifiedKFold(
            n_splits=n_splits, 
            n_repeats=n_repeats,
            random_state=random_state
        )
        cv_name = f"Repeated Stratified {n_splits}-Fold (×{n_repeats})"
    else:
        cv = StratifiedKFold(
            n_splits=n_splits, 
            shuffle=True, 
            random_state=random_state
        )
        cv_name = f"Stratified {n_splits}-Fold"
    
    # Run CV
    scores = cross_val_score(
        model_factory(), X, y, cv=cv, scoring='accuracy'
    )
    
    # Compute statistics
    mean_score = scores.mean()
    std_score = scores.std()
    se_score = std_score / np.sqrt(len(scores))  # Standard error
    
    # 95% confidence interval (approximate)
    ci_low = mean_score - 1.96 * se_score
    ci_high = mean_score + 1.96 * se_score
    
    # Report
    print(f"CV Strategy: {cv_name}")
    print(f"Random State: {random_state}")
    print(f"Total Evaluations: {len(scores)}")
    print(f"\nResults:")
    print(f"  Mean Accuracy: {mean_score:.4f}")
    print(f"  Standard Deviation: {std_score:.4f}")
    print(f"  Standard Error: {se_score:.4f}")
    print(f"  95% CI: [{ci_low:.4f}, {ci_high:.4f}]")
    print(f"\nAll scores: {', '.join(f'{s:.4f}' for s in scores[:10])}"
          + ("..." if len(scores) > 10 else ""))
    
    return {
        'mean': mean_score,
        'std': std_score,
        'se': se_score,
        'ci_95': (ci_low, ci_high),
        'all_scores': scores
    }
 
# Demo
np.random.seed(42)
X, y = make_classification(n_samples=500, n_features=20, 
                           n_informative=10, random_state=42)
 
print("=" * 70)
print("Best Practice: Reproducible CV Evaluation")
print("=" * 70)
 
result = reproducible_cv_evaluation(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5,
    random_state=42
)
 
print("\n" + "=" * 70)
print("Reproducibility Check: Same parameters should give same results")
print("=" * 70)
 
result2 = reproducible_cv_evaluation(
    X, y,
    lambda: RandomForestClassifier(n_estimators=100, random_state=42),
    n_splits=10,
    n_repeats=5,
    random_state=42  # Same random state
)
 
print(f"\nResults match: {np.allclose(result['all_scores'], result2['all_scores'])}")

Summary: Choosing K

We've developed a comprehensive framework for selecting k. Let's consolidate the key guidance:

Key Takeaways

•k=5 and k=10 are the standard choices for most practical situations—rarely wrong
•Sample size is the primary driver: Smaller samples need larger k (up to LOOCV for n < 50)
•Computational budget matters: k=5 takes half the time of k=10; use staged evaluation
•Model stability affects strategy: Unstable models need repeated CV, not just higher k
•Your goal determines tolerance: Model selection tolerates higher variance; performance estimation needs low bias
•Always use stratified k-fold for classification to maintain class proportions
•Set random states and report them for reproducibility—single CV estimates are partition-dependent
•For publication: k=10 with 5+ repeats and reported confidence intervals

What's Next:

Page Complete

3 / 5