Loading learning content...
Armed with a theoretical understanding of the bias-variance tradeoff, we now face the practical question: what value of k should I use for my specific problem?
This isn't a question with a single universal answer. The optimal k depends on your dataset size, computational budget, model stability, and ultimate goal. This page synthesizes theory, empirical research, and practical experience into actionable guidance.
By the end, you'll have a decision framework that accounts for all relevant factors and confidence in defending your k selection.
This page covers: (1) Standard k values and their tradeoffs, (2) Sample size considerations, (3) Computational budgeting, (4) Model-specific recommendations, (5) Goal-dependent selection, and (6) A practical decision tree for any situation.
While k can theoretically be any integer from 2 to n, certain values have become standard due to their favorable properties. Let's examine each in detail.
| K | Train % | Bias (relative) | Variance | Iterations | Primary Use Case |
|---|---|---|---|---|---|
| 2 | 50% | Very High | Low (independent) | 2 | Statistical tests, theoretical analysis |
| 3 | 67% | High | Low-Moderate | 3 | Quick prototyping, expensive models |
| 5 | 80% | Moderate | Moderate | 5 | General purpose, most common default |
| 10 | 90% | Low | Moderate-High | 10 | Publication quality, comprehensive eval |
| 20 | 95% | Very Low | High (correlation) | 20 | Small datasets, low-bias requirement |
| n (LOOCV) | ≈100% | Minimal | Variable (can be high) | n | Very small datasets, analytical shortcuts |
K = 2 (2-Fold Cross-Validation):
Training on only 50% of data creates substantial bias. However, 2-fold has a unique property: the two training sets are completely disjoint. This independence is valuable for statistical tests like the 5×2 CV paired t-test (introduced by Dietterich, 1998).
Advantages:
- Completely independent fold estimates
- Valid for paired statistical tests
- Only 2× training cost
Disadvantages:
- High pessimistic bias
- Only 2 estimates → high variance
- Poor for absolute performance estimation
K = 5 (5-Fold Cross-Validation):
Widely considered the best default choice. Trains on 80% of data (acceptable bias) with 5 iterations (manageable computation). Research by Breiman and Friedman found 5-fold performs as well as 10-fold for model selection in most cases.
5-fold strikes an excellent balance: 80% training data keeps bias reasonable, 5 iterations provide sufficient variance reduction, and computational cost is manageable. When in doubt about k, 5-fold is rarely a bad choice.
K = 10 (10-Fold Cross-Validation):
The gold standard for thorough evaluation. Training on 90% of data minimizes bias, and scikit-learn, R's caret, and most ML frameworks use k=10 as default. Ron Kohavi's seminal 1995 study found 10-fold optimal across many scenarios.
Advantages:
- Low pessimistic bias (90% training data)
- 10 estimates → good variance reduction
- Industry and academic standard
- Sufficient for publication-quality results
Disadvantages:
- 10× training cost (can be expensive)
- High correlation between folds
- May be overkill for quick iterations
K = n (Leave-One-Out Cross-Validation):
The extreme case, covered in detail previously. Use when n is small and computational shortcuts exist.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
import numpy as npimport timefrom sklearn.model_selection import cross_val_score, KFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification def comprehensive_k_comparison(X, y, model_factory, k_values=[2, 3, 5, 10, 20]): """ Comprehensively compare different k values. Measures: score, variance, time, and estimate stability. """ n = len(X) results = [] print(f"{'K':>4} {'Train%':>7} {'Mean':>8} {'Std':>8} {'Time':>8} {'Stable?':>8}") print("-" * 60) for k in k_values: if k > n: continue train_frac = (k-1)/k * 100 # Time the CV process start = time.time() scores = cross_val_score( model_factory(), X, y, cv=KFold(n_splits=k, shuffle=True, random_state=42), scoring='accuracy' ) elapsed = time.time() - start # Stability: run with different random seeds cv_means = [] for seed in range(10): s = cross_val_score( model_factory(), X, y, cv=KFold(n_splits=k, shuffle=True, random_state=seed), scoring='accuracy' ).mean() cv_means.append(s) stability = np.std(cv_means) # Lower = more stable across partitions is_stable = "Yes" if stability < 0.01 else "No" print(f"{k:>4} {train_frac:>6.0f}% {scores.mean():>8.4f} " f"{scores.std():>8.4f} {elapsed:>7.2f}s {is_stable:>8}") results.append({ 'k': k, 'train_fraction': train_frac, 'mean_score': scores.mean(), 'std_score': scores.std(), 'time': elapsed, 'stability': stability }) return results # Generate test datanp.random.seed(42)X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42) print("=" * 60)print("K-Value Comparison: Random Forest on 1000 samples")print("=" * 60) results = comprehensive_k_comparison( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42)) # Also test with smaller dataset to see bias effectprint("\n" + "=" * 60)print("K-Value Comparison: Random Forest on 100 samples")print("=" * 60) X_small, y_small = X[:100], y[:100]results_small = comprehensive_k_comparison( X_small, y_small, lambda: RandomForestClassifier(n_estimators=100, random_state=42), k_values=[2, 3, 5, 10, 20, 50])Sample size profoundly affects the optimal choice of k. The key insight: the smaller your dataset, the more each sample matters—both for training quality and validation reliability.
The Fundamental Tradeoffs by Sample Size:
The Validation Set Size Perspective:
Another way to think about k: each validation fold should be large enough to give reliable performance estimates. With k-fold, each validation set has n/k samples.
| n | k=5 val size | k=10 val size | k=20 val size |
|---|---|---|---|
| 50 | 10 | 5 | 2-3 |
| 100 | 20 | 10 | 5 |
| 500 | 100 | 50 | 25 |
| 1000 | 200 | 100 | 50 |
| 5000 | 1000 | 500 | 250 |
For classification with p classes, you generally want at least 10-20 samples per class per validation set for reliable estimates. If n/k/p < 10, consider smaller k or stratification.
With 5 samples in a validation fold, your classification accuracy is limited to multiples of 20% (0%, 20%, 40%, ..., 100%). This discretization adds noise to your estimates. For very small samples, LOOCV's single-sample validation isn't worse—there's no discretization, just binary right/wrong per sample.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom sklearn.model_selection import cross_val_score, KFold, LeaveOneOutfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classification def analyze_k_by_sample_size(sample_sizes=[30, 50, 100, 200, 500, 1000], k_values=[3, 5, 10, 20], n_repeats=50): """ Analyze how optimal k varies with sample size. """ print(f"{'n':>6} | ", end="") for k in k_values: print(f"k={k:2d} (mean±std) |", end=" ") print() print("-" * (6 + len(k_values) * 22)) all_results = {} for n in sample_sizes: line = f"{n:>6} | " all_results[n] = {} for k in k_values: if k > n: line += f"{'N/A':^18} |" continue # Run CV multiple times with different data samples cv_means = [] for rep in range(n_repeats): X, y = make_classification( n_samples=n, n_features=10, n_informative=5, random_state=rep ) scores = cross_val_score( LogisticRegression(max_iter=1000), X, y, cv=KFold(n_splits=k, shuffle=True, random_state=rep) ) cv_means.append(scores.mean()) mean_cv = np.mean(cv_means) std_cv = np.std(cv_means) all_results[n][k] = {'mean': mean_cv, 'std': std_cv} line += f"{mean_cv:.3f}±{std_cv:.3f} |" print(line) # Analysis print("\nObservations:") print("-" * 60) # Compare variance across k for each n for n in sample_sizes: if n >= 50: stds = [(k, all_results[n][k]['std']) for k in k_values if k in all_results[n]] min_std_k = min(stds, key=lambda x: x[1]) print(f"n={n:4d}: Lowest variance at k={min_std_k[0]} " f"(std={min_std_k[1]:.4f})") return all_results # Run analysisprint("=" * 80)print("How Sample Size Affects Optimal K (Logistic Regression)")print("=" * 80)print() results = analyze_k_by_sample_size() # Recommendation functiondef recommend_k_for_sample_size(n, n_classes=2, compute_budget='medium'): """ Recommend k based on sample size. """ min_val_per_class = 10 # Minimum samples per class in validation # Calculate minimum k that gives sufficient validation size max_k_by_val = n // (min_val_per_class * n_classes) # Base recommendation if n < 50: rec = n # LOOCV reason = "Very small sample: LOOCV recommended" elif n < 100: rec = min(n // 10, 10) # At least 10 per fold reason = f"Small sample: k={rec} ensures {n//rec} per fold" elif n < 500: rec = 10 if compute_budget != 'low' else 5 reason = f"Medium sample: k={rec} balances bias/variance" else: rec = 5 if compute_budget == 'low' else 10 reason = f"Large sample: k={rec} is sufficient" # Ensure valid k rec = max(2, min(rec, max_k_by_val, n)) return rec, reason # Examplesprint("\n" + "=" * 80)print("Recommended K by Sample Size")print("=" * 80)for n in [30, 75, 150, 500, 2000, 10000]: k, reason = recommend_k_for_sample_size(n) print(f"n={n:5d}: recommended k={k:3d} - {reason}")The choice of k directly impacts computational cost: k-fold CV requires training k models. For expensive algorithms, this can be prohibitive.
Time Complexity Analysis:
Let T(n) be the training time for n samples. K-fold CV takes approximately:
$$\text{CV Time} \approx k \cdot T\left(n \cdot \frac{k-1}{k}\right)$$
For many algorithms, T(n) is superlinear:
| Algorithm | T(n) | k=5 vs k=10 time ratio |
|---|---|---|
| Linear Regression | O(n·p²) | 2.1× |
| SVM (RBF) | O(n²) - O(n³) | 1.8× - 1.6× |
| Random Forest | O(k·n·log(n)·p) | 2.0× |
| Gradient Boosting | O(trees·n·log(n)·p) | 2.0× |
| Neural Network | O(epochs·n·layers) | 2.0× |
| k-NN (prediction) | O(n) per query | 2.0× |
5-fold takes roughly half the time of 10-fold (5 fits vs 10). For large models or datasets, this difference matters. Use 5-fold for development/iteration, 10-fold for final evaluation.
Strategies for Expensive Models:
Staged evaluation: Use 3-fold for initial experiments, 5-fold for promising models, 10-fold for final reports.
Parallelization: Each fold is independent—train them in parallel on multiple cores/machines.
Subsampling: For very large datasets, subsample before CV (carefully! handle stratification).
Warm starts: Some algorithms support warm starting from a previous solution, reducing per-fold training time.
Early stopping: Use validation performance to stop training early, especially for iterative algorithms.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npimport timefrom sklearn.model_selection import cross_val_score, KFoldfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.datasets import make_classificationfrom joblib import Parallel, delayed def time_cv_strategies(X, y, model_factory, k_values=[3, 5, 10]): """ Compare timing of different k values and parallelization. """ results = [] for k in k_values: # Sequential start = time.time() scores = cross_val_score( model_factory(), X, y, cv=KFold(n_splits=k, shuffle=True, random_state=42), n_jobs=1 ) seq_time = time.time() - start # Parallel (all cores) start = time.time() scores_par = cross_val_score( model_factory(), X, y, cv=KFold(n_splits=k, shuffle=True, random_state=42), n_jobs=-1 # Use all cores ) par_time = time.time() - start speedup = seq_time / par_time if par_time > 0 else 0 print(f"k={k:2d}: Sequential={seq_time:6.2f}s, " f"Parallel={par_time:6.2f}s, " f"Speedup={speedup:.1f}x, " f"Score={scores.mean():.4f}") results.append({ 'k': k, 'sequential_time': seq_time, 'parallel_time': par_time, 'speedup': speedup, 'score': scores.mean() }) return results # Generate test datanp.random.seed(42)X, y = make_classification(n_samples=2000, n_features=20, random_state=42) print("=" * 70)print("CV Timing: Gradient Boosting (100 trees)")print("=" * 70) results = time_cv_strategies( X, y, lambda: GradientBoostingClassifier(n_estimators=100, random_state=42)) # Staged evaluation strategyprint("\n" + "=" * 70)print("Staged Evaluation Strategy Demo")print("=" * 70) def staged_cv_evaluation(X, y, model_factory, stage_ks=[3, 5, 10], n_candidates=5): """ Demonstrate staged CV: quick evaluation of many models, thorough evaluation of promising ones. """ print(f"\nStage 1: Quick 3-fold CV on {n_candidates} candidate models") print("-" * 50) # Simulate multiple candidate models candidates = [] for i in range(n_candidates): start = time.time() scores = cross_val_score( model_factory(), X, y, cv=3, n_jobs=-1 ) elapsed = time.time() - start candidates.append({ 'id': i, 'score': scores.mean(), 'time': elapsed }) print(f" Model {i}: CV3 score = {scores.mean():.4f} ({elapsed:.2f}s)") total_stage1 = sum(c['time'] for c in candidates) print(f" Stage 1 total time: {total_stage1:.2f}s") # Select top 2 top_candidates = sorted(candidates, key=lambda x: x['score'], reverse=True)[:2] print(f"\nStage 2: Thorough 10-fold CV on top 2 models") print("-" * 50) for cand in top_candidates: start = time.time() scores = cross_val_score( model_factory(), X, y, cv=10, n_jobs=-1 ) elapsed = time.time() - start print(f" Model {cand['id']}: CV10 score = {scores.mean():.4f} " f"± {scores.std():.4f} ({elapsed:.2f}s)") print(f"\nTotal time with staged approach: ~{total_stage1 + elapsed*2:.2f}s") print(f"Time if we did 10-fold on all {n_candidates} models: " f"~{elapsed * n_candidates:.2f}s") print(f"Savings: {(1 - (total_stage1 + elapsed*2)/(elapsed*n_candidates))*100:.0f}%") staged_cv_evaluation(X, y, lambda: GradientBoostingClassifier(n_estimators=50, random_state=42))| Budget | Iterations (k) | Best For | Tradeoff |
|---|---|---|---|
| Very Low | 3 | Quick experiments, expensive models | Higher bias, faster iteration |
| Low | 5 | Development, model selection | Good balance, half the time of k=10 |
| Medium | 10 | Standard evaluation, publication | Industry standard, thorough |
| High | 10 + 5 repeats | Rigorous evaluation, competitions | Low variance estimates |
| Very High | 10 + 10 repeats or LOOCV | Formal research, small samples | Maximum reliability |
Different model types have different characteristics that affect optimal k selection. Stability, training speed, and learning curve shape all matter.
| Model Type | Recommended K | Rationale | Special Considerations |
|---|---|---|---|
| Linear Models | 5 or 10 | Stable, fast training, minor variance | Can use LOOCV with GCV shortcut |
| Decision Trees | 10 + repeats | Unstable, high variance between partitions | Multiple repeats essential |
| Random Forests | 5 or 10 | Ensemble averaging adds stability | Consider OOB error as alternative |
| Gradient Boosting | 5 | Moderate stability, expensive training | Use early stopping to reduce time |
| Neural Networks | 3-5 | Very expensive, good internal validation | Use validation set for early stopping |
| SVM | 5 or 10 | Moderate stability, can be expensive | RBF kernel more stable than poly |
| k-NN | 10 or LOOCV | Prediction cheap, no training | LOOCV is computationally equivalent |
| Naive Bayes | 5 or 10 | Very stable and fast | Almost any k works well |
Random Forests provide out-of-bag (OOB) error estimates 'for free' as a byproduct of bootstrap sampling. OOB error approximates LOOCV without additional computation. Consider using OOB error instead of CV for RF models, especially for computational savings.
Model Stability and K:
Recall that model stability affects the correlation between fold estimates. Unstable models (high-variance learners) benefit from:
Neural Network Considerations:
Deep learning models have unique considerations:
Common practice for neural networks:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as npfrom sklearn.model_selection import cross_val_score, KFold, RepeatedKFoldfrom sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.datasets import make_classification def measure_cv_stability(X, y, model_factory, model_name, k=5, n_repeats=20): """ Measure stability of CV estimates across different random partitions. """ cv_means = [] for seed in range(n_repeats): kfold = KFold(n_splits=k, shuffle=True, random_state=seed) scores = cross_val_score(model_factory(), X, y, cv=kfold) cv_means.append(scores.mean()) cv_means = np.array(cv_means) return { 'model': model_name, 'mean': cv_means.mean(), 'std': cv_means.std(), 'range': cv_means.max() - cv_means.min(), 'coefficient_of_variation': cv_means.std() / cv_means.mean() } # Generate datanp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) models = [ (lambda: LogisticRegression(max_iter=1000), "Logistic Regression"), (lambda: DecisionTreeClassifier(max_depth=None), "Decision Tree (unpruned)"), (lambda: DecisionTreeClassifier(max_depth=5), "Decision Tree (depth=5)"), (lambda: RandomForestClassifier(n_estimators=100), "Random Forest (100 trees)"), (lambda: KNeighborsClassifier(n_neighbors=5), "k-NN (k=5)"),] print("=" * 70)print("CV Stability Comparison Across Models (5-fold, 20 repeats)")print("=" * 70)print(f"{'Model':<30} {'Mean':>8} {'Std':>8} {'Range':>8} {'CoV':>8}")print("-" * 70) stability_results = []for model_factory, name in models: result = measure_cv_stability(X, y, model_factory, name) stability_results.append(result) print(f"{name:<30} {result['mean']:>8.4f} {result['std']:>8.4f} " f"{result['range']:>8.4f} {result['coefficient_of_variation']:>8.4f}") # Recommend based on stabilityprint("\n" + "=" * 70)print("Recommendations Based on Stability")print("=" * 70) for result in stability_results: cov = result['coefficient_of_variation'] if cov < 0.01: rec = "5-fold sufficient; very stable" elif cov < 0.03: rec = "5-10 fold recommended; moderately stable" else: rec = "10-fold with repeats needed; unstable" print(f"{result['model']:<30}: {rec}")Different use cases for cross-validation have different requirements. Your goal should influence k selection.
When comparing models A and B, both experience similar pessimistic bias from training on (k-1)/k of data. The bias largely cancels in the comparison: if A beats B with 5-fold, A will very likely beat B with 10-fold too. This is why smaller k is often acceptable for model selection.
Statistical Testing Considerations:
When you need statistical tests to determine if one model is significantly better than another, k selection interacts with the test validity.
The 5×2 CV paired t-test (Dietterich, 1998):
For comparing two algorithms:
This test is designed to have correct Type I error rates (not too many false positives), which standard CV-based t-tests can fail to achieve.
The Corrected Resampled Paired t-test (Nadeau & Bengio, 2003):
For standard k-fold CV:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom sklearn.model_selection import ( cross_val_score, RepeatedStratifiedKFold, GridSearchCV)from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classificationfrom scipy import stats def goal_oriented_cv_demo(X, y): """ Demonstrate different CV strategies for different goals. """ # Goal 1: Model Selection (which model is best?) print("=" * 70) print("GOAL 1: Model Selection") print("=" * 70) print("Using 5-fold CV (bias cancels in comparisons)") print() models = [ (LogisticRegression(max_iter=1000), "Logistic Regression"), (RandomForestClassifier(n_estimators=50), "Random Forest"), (GradientBoostingClassifier(n_estimators=50), "Gradient Boosting"), ] for model, name in models: scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"{name:25s}: {scores.mean():.4f} ± {scores.std():.4f}") # Goal 2: Hyperparameter Tuning (fast iteration) print("\n" + "=" * 70) print("GOAL 2: Hyperparameter Tuning") print("=" * 70) print("Using 3-fold CV for speed during grid search") print() param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [3, 5, 10, None] } # Fast CV for grid search grid_search = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=3, # Fast iteration scoring='accuracy', n_jobs=-1 ) grid_search.fit(X, y) print(f"Best params: {grid_search.best_params_}") print(f"Best CV score (3-fold): {grid_search.best_score_:.4f}") # Thorough evaluation of best params best_model = RandomForestClassifier(**grid_search.best_params_, random_state=42) thorough_scores = cross_val_score(best_model, X, y, cv=10) print(f"Best model (10-fold): {thorough_scores.mean():.4f} ± {thorough_scores.std():.4f}") # Goal 3: Absolute Performance Estimation print("\n" + "=" * 70) print("GOAL 3: Absolute Performance Estimation") print("=" * 70) print("Using 10-fold repeated CV for low bias and variance") print() repeated_cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=42) scores = cross_val_score( RandomForestClassifier(n_estimators=100, random_state=42), X, y, cv=repeated_cv, scoring='accuracy' ) print(f"Mean accuracy: {scores.mean():.4f}") print(f"Std: {scores.std():.4f}") print(f"95% CI: [{scores.mean() - 1.96*scores.std():.4f}, " f"{scores.mean() + 1.96*scores.std():.4f}]") # Goal 4: Statistical Comparison print("\n" + "=" * 70) print("GOAL 4: Statistical Comparison (Is A better than B?)") print("=" * 70) print("Using paired tests on CV results") print() # Run same CV splits for both models from sklearn.model_selection import StratifiedKFold cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) model_a = RandomForestClassifier(n_estimators=100, random_state=42) model_b = GradientBoostingClassifier(n_estimators=100, random_state=42) scores_a = [] scores_b = [] for train_idx, val_idx in cv.split(X, y): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] model_a.fit(X_train, y_train) model_b.fit(X_train, y_train) scores_a.append(model_a.score(X_val, y_val)) scores_b.append(model_b.score(X_val, y_val)) scores_a = np.array(scores_a) scores_b = np.array(scores_b) # Paired t-test (naive - for illustration) t_stat, p_value = stats.ttest_rel(scores_a, scores_b) print(f"RF mean: {scores_a.mean():.4f}, GB mean: {scores_b.mean():.4f}") print(f"Paired t-test: t={t_stat:.3f}, p={p_value:.4f}") print(f"Significant at α=0.05: {'Yes' if p_value < 0.05 else 'No'}") print("\nNote: For rigorous testing, use corrected t-test or 5x2 CV test") # Generate data and run demonp.random.seed(42)X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)goal_oriented_cv_demo(X, y)Let's synthesize everything into a practical decision tree for selecting k.
When you're unsure: (1) k=5 for exploratory work and model iteration, (2) k=10 for final evaluation and reporting, (3) k=10 with 5 repeats for publication-quality results. These choices are rarely wrong.
Avoid these common pitfalls when selecting and using k-fold cross-validation:
Many published ML results are not reproducible partly because of inadequate CV practices. Single-run CV with unfixed random seeds, inappropriate k selection, and failure to report variance all contribute. Always set random states and report full results.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom sklearn.model_selection import ( cross_val_score, StratifiedKFold, RepeatedStratifiedKFold)from sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification def reproducible_cv_evaluation(X, y, model_factory, n_splits=10, n_repeats=5, random_state=42): """ Best practice CV evaluation with reproducibility and proper reporting. """ # Always use stratified CV for classification if n_repeats > 1: cv = RepeatedStratifiedKFold( n_splits=n_splits, n_repeats=n_repeats, random_state=random_state ) cv_name = f"Repeated Stratified {n_splits}-Fold (×{n_repeats})" else: cv = StratifiedKFold( n_splits=n_splits, shuffle=True, random_state=random_state ) cv_name = f"Stratified {n_splits}-Fold" # Run CV scores = cross_val_score( model_factory(), X, y, cv=cv, scoring='accuracy' ) # Compute statistics mean_score = scores.mean() std_score = scores.std() se_score = std_score / np.sqrt(len(scores)) # Standard error # 95% confidence interval (approximate) ci_low = mean_score - 1.96 * se_score ci_high = mean_score + 1.96 * se_score # Report print(f"CV Strategy: {cv_name}") print(f"Random State: {random_state}") print(f"Total Evaluations: {len(scores)}") print(f"\nResults:") print(f" Mean Accuracy: {mean_score:.4f}") print(f" Standard Deviation: {std_score:.4f}") print(f" Standard Error: {se_score:.4f}") print(f" 95% CI: [{ci_low:.4f}, {ci_high:.4f}]") print(f"\nAll scores: {', '.join(f'{s:.4f}' for s in scores[:10])}" + ("..." if len(scores) > 10 else "")) return { 'mean': mean_score, 'std': std_score, 'se': se_score, 'ci_95': (ci_low, ci_high), 'all_scores': scores } # Demonp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) print("=" * 70)print("Best Practice: Reproducible CV Evaluation")print("=" * 70) result = reproducible_cv_evaluation( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42), n_splits=10, n_repeats=5, random_state=42) print("\n" + "=" * 70)print("Reproducibility Check: Same parameters should give same results")print("=" * 70) result2 = reproducible_cv_evaluation( X, y, lambda: RandomForestClassifier(n_estimators=100, random_state=42), n_splits=10, n_repeats=5, random_state=42 # Same random state) print(f"\nResults match: {np.allclose(result['all_scores'], result2['all_scores'])}")We've developed a comprehensive framework for selecting k. Let's consolidate the key guidance:
What's Next:
We now understand how to select k for standard k-fold cross-validation. The next page explores repeated cross-validation, which addresses the variance problem by running multiple CV iterations with different random partitions—providing more stable estimates without compromising on the bias-variance tradeoff.
You now have a complete decision framework for selecting k in cross-validation. You can confidently choose k based on sample size, computational constraints, model stability, and your specific goals—and defend that choice with theoretical and empirical reasoning.