Loading learning content...
The previous page established k-fold cross-validation as the standard method for estimating model performance. But a critical question remains: what value of k should we use?
This isn't merely a computational convenience question. The choice of k fundamentally affects the statistical properties of our estimate—its bias, its variance, and consequently, its reliability. Understanding this tradeoff transforms k from an arbitrary hyperparameter into a principled design choice.
This page develops a deep theoretical understanding of how k influences the quality of cross-validation estimates, preparing you to make informed decisions for your specific use cases.
By the end of this page, you will understand why larger k reduces bias but can increase variance, why the fold correlation problem complicates variance analysis, how pessimistic bias arises from training on less data, and when to prioritize low bias versus low variance in your estimates.
Before analyzing bias and variance, we must be precise about what quantity cross-validation estimates.
The Target Quantity:
Let $\text{Err}(n)$ denote the true generalization error of a model trained on $n$ samples. This is the expected prediction error on a new, unseen sample from the same distribution. If we could train our model on our full dataset of $n$ samples and then somehow measure performance on infinite new data, we'd get $\text{Err}(n)$.
What K-Fold Actually Measures:
In k-fold CV, each fold trains on approximately $n_{\text{train}} = n \cdot (k-1)/k$ samples. The CV estimate is thus estimating $\text{Err}(n_{\text{train}})$, not $\text{Err}(n)$.
This distinction is subtle but crucial:
| k | Training set size | Estimates |
|---|---|---|
| 2 | 50% of n | $\text{Err}(0.5n)$ |
| 5 | 80% of n | $\text{Err}(0.8n)$ |
| 10 | 90% of n | $\text{Err}(0.9n)$ |
| n (LOOCV) | n-1 ≈ n | $\text{Err}(n-1) \approx \text{Err}(n)$ |
Models generally improve with more training data—this is the learning curve. Since smaller k means training on less data, k-fold CV with small k estimates a worse performance than what the full-data model would achieve. This is the source of pessimistic bias.
The Bias Decomposition:
The bias of the CV estimate can be written as:
$$\text{Bias}[\text{CV}(k)] = E[\text{CV}(k)] - \text{Err}(n)$$
Since CV estimates $\text{Err}(n \cdot (k-1)/k)$ and learning curves are typically decreasing (more data → lower error), we have:
$$\text{Err}(n \cdot (k-1)/k) > \text{Err}(n)$$
Therefore, CV is typically pessimistically biased—it overestimates the error of the full-data model.
The bias magnitude depends on:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import learning_curve, cross_val_scorefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import make_classification # Generate datanp.random.seed(42)X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42) # Compute learning curvetrain_sizes, train_scores, val_scores = learning_curve( RandomForestClassifier(n_estimators=50, random_state=42), X, y, train_sizes=np.linspace(0.1, 1.0, 10), cv=5, scoring='accuracy', random_state=42) # Plot learning curveplt.figure(figsize=(10, 6))plt.plot(train_sizes, train_scores.mean(axis=1), 'o-', label='Training score')plt.plot(train_sizes, val_scores.mean(axis=1), 'o-', label='Validation score')plt.fill_between(train_sizes, val_scores.mean(axis=1) - val_scores.std(axis=1), val_scores.mean(axis=1) + val_scores.std(axis=1), alpha=0.2) # Mark what different k values estimatefor k in [2, 5, 10]: train_fraction = (k-1)/k train_n = int(1000 * train_fraction) idx = np.argmin(np.abs(train_sizes - train_n)) plt.axvline(train_n, color='red', linestyle='--', alpha=0.5) plt.annotate(f'k={k}\n({train_fraction:.0%})', xy=(train_n, val_scores.mean(axis=1)[idx]), xytext=(train_n-80, val_scores.mean(axis=1)[idx]+0.02)) plt.xlabel('Training Set Size')plt.ylabel('Score')plt.title('Learning Curve: What Different k Values Estimate')plt.legend()plt.grid(True, alpha=0.3)plt.tight_layout() # Demonstrate bias empiricallyprint("Empirical CV estimates for different k:")for k in [2, 3, 5, 10, 20]: cv_score = cross_val_score( RandomForestClassifier(n_estimators=50, random_state=42), X, y, cv=k, scoring='accuracy' ).mean() train_fraction = (k-1)/k print(f"k={k:2d}: CV accuracy = {cv_score:.4f} " f"(trains on {train_fraction:.1%} of data)")Let's develop a rigorous understanding of the bias in cross-validation estimates.
Pessimistic Bias Formalization:
Define the learning curve function $L(m)$ as the expected generalization error when training on $m$ samples. For most reasonable models and sufficient data:
$$L(m) = L(\infty) + \frac{\alpha}{m^\beta}$$
where:
The bias of k-fold CV relative to full-data training is:
$$\text{Bias} = L\left(n \cdot \frac{k-1}{k}\right) - L(n)$$
Substituting the learning curve model:
$$\text{Bias} \approx \alpha \left[ \left(\frac{k}{k-1}\right)^\beta - 1 \right] \cdot \frac{1}{n^\beta}$$
| k | Train Fraction | (k/(k-1))^0.5 | (k/(k-1))^1.0 | Relative Bias |
|---|---|---|---|---|
| 2 | 50% | 1.414 | 2.000 | Very High |
| 3 | 67% | 1.225 | 1.500 | High |
| 5 | 80% | 1.118 | 1.250 | Moderate |
| 10 | 90% | 1.054 | 1.111 | Low |
| 20 | 95% | 1.026 | 1.053 | Very Low |
| n (LOOCV) | ≈100% | ≈1.001 | ≈1.001 | Negligible |
Bias is most problematic when: (1) The learning curve is steep (model learns a lot from additional data), (2) Sample size is small (large gap between 80% and 100% of data), (3) You're comparing models with different learning rates (unfair comparison). For large n and mature models, bias differences between k=5 and k=10 are often negligible.
Bias Implications for Model Selection:
Critically, pessimistic bias is usually not a problem for model selection—choosing which model is best. Why? Because all models experience similar relative bias. If model A achieves 90% accuracy with k=5 CV and model B achieves 88%, both estimates are pessimistic, but model A is still likely better.
Bias becomes problematic when:
The Surprising Case of 2-Fold CV:
2-fold CV deserves special attention. With only 50% of data for training, bias can be substantial. However, 2-fold has a unique property: the two estimates are completely independent (non-overlapping training sets). This independence makes variance analysis simpler and can be valuable for certain statistical tests.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.datasets import make_classification def estimate_true_error(X, y, model_class, n_test=10000, n_trials=50): """ Estimate true generalization error by training on all data and testing on fresh samples. """ errors = [] n_samples = len(X) for _ in range(n_trials): # Generate fresh test data from same distribution X_test, y_test = make_classification( n_samples=n_test, n_features=X.shape[1], n_informative=10, random_state=np.random.randint(0, 10000) ) # Train on all available data model = model_class() model.fit(X, y) # Evaluate on fresh data error = 1 - model.score(X_test, y_test) errors.append(error) return np.mean(errors), np.std(errors) def compute_cv_bias(X, y, model_class, k_values=[2, 5, 10, 20]): """ Compute CV estimates and compare to true error. """ # Estimate true error (training on all n samples) true_error, true_std = estimate_true_error(X, y, model_class) print(f"True generalization error (trained on n={len(X)}): " f"{true_error:.4f} ± {true_std:.4f}") print() results = [] for k in k_values: cv_scores = cross_val_score( model_class(), X, y, cv=k, scoring='accuracy' ) cv_error = 1 - cv_scores.mean() cv_std = cv_scores.std() bias = cv_error - true_error train_fraction = (k-1)/k print(f"k={k:2d}: CV error = {cv_error:.4f} ± {cv_std:.4f}, " f"Bias = {bias:+.4f} ({train_fraction:.0%} train)") results.append({ 'k': k, 'cv_error': cv_error, 'cv_std': cv_std, 'bias': bias, 'train_fraction': train_fraction }) return results # Run simulationnp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) print("=" * 60)print("Bias Analysis: Logistic Regression")print("=" * 60)results = compute_cv_bias(X, y, LogisticRegression)Variance in the CV estimate determines how much our estimate would change if we collected a different sample from the same distribution. High variance means unreliable estimates—the same model might appear excellent or mediocre depending on the random split.
Sources of Variance:
The variance of the k-fold CV estimate has multiple sources:
The k fold estimates are NOT independent. In 10-fold CV, each pair of training sets shares 80% of their samples. This positive correlation means the variance reduction from averaging is less than 1/k. Standard formulas that assume independence give overly optimistic confidence intervals.
Variance Decomposition:
For a single k-fold CV run, let $E_1, E_2, ..., E_k$ be the k fold error estimates. The CV estimate is $\bar{E} = \frac{1}{k}\sum_i E_i$.
If the fold estimates were independent with variance $\sigma^2$, we'd have $\text{Var}[\bar{E}] = \sigma^2/k$. But they're correlated with pairwise correlation $\rho > 0$:
$$\text{Var}[\bar{E}] = \frac{\sigma^2}{k} + \frac{k-1}{k}\rho\sigma^2 = \frac{\sigma^2}{k}(1 + (k-1)\rho)$$
For large k and substantial $\rho$, the second term dominates! If $\rho = 0.8$ and $k = 10$:
$$\text{Var}[\bar{E}] = \frac{\sigma^2}{10}(1 + 9 \times 0.8) = 0.82\sigma^2$$
The variance is barely reduced from a single estimate!
| k | ρ = 0 (ideal) | ρ = 0.3 | ρ = 0.5 | ρ = 0.8 |
|---|---|---|---|---|
| 2 | 0.50 | 0.65 | 0.75 | 0.90 |
| 5 | 0.20 | 0.44 | 0.60 | 0.84 |
| 10 | 0.10 | 0.37 | 0.55 | 0.82 |
| 20 | 0.05 | 0.34 | 0.53 | 0.81 |
| ∞ (LOOCV) | →0 | 0.30 | 0.50 | 0.80 |
Critical Insight:
When correlation is high (ρ → 1), increasing k beyond a certain point provides diminishing returns for variance reduction. The correlated variance component $\frac{k-1}{k}\rho\sigma^2$ approaches $\rho\sigma^2$ as k → ∞.
This explains a counterintuitive phenomenon: LOOCV (k=n) can have higher variance than 10-fold CV despite using almost all data for training. The extreme training set overlap (n-1 shared samples between any two iterations) creates high correlation, which inflates variance.
What Determines ρ?
The correlation ρ between fold estimates depends on:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom sklearn.model_selection import cross_val_score, RepeatedKFoldfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classification def analyze_cv_variance(X, y, model_class, k_values=[2, 5, 10, 20], n_repeats=100, random_seed=42): """ Analyze variance of CV estimates across different k values. We run CV multiple times with different random partitions to estimate the full variance of the CV procedure. """ np.random.seed(random_seed) results = {} for k in k_values: cv_means = [] fold_scores_all = [] for i in range(n_repeats): # Different random partition each time scores = cross_val_score( model_class(), X, y, cv=k, scoring='accuracy' ) cv_means.append(scores.mean()) fold_scores_all.append(scores) cv_means = np.array(cv_means) fold_scores_all = np.array(fold_scores_all) # (n_repeats, k) # Mean fold variance (within a single CV run) within_cv_var = np.mean([s.var() for s in fold_scores_all]) # Variance of the CV estimate across repeats cv_estimate_var = cv_means.var() # Estimate correlation between fold scores # Average pairwise correlation within each CV run correlations = [] for scores in fold_scores_all[:20]: # Sample for efficiency if len(scores) >= 2: for i in range(len(scores)): for j in range(i+1, len(scores)): # This is not quite right - we need paired observations pass results[k] = { 'cv_estimate_mean': cv_means.mean(), 'cv_estimate_std': cv_means.std(), 'cv_estimate_var': cv_estimate_var, 'within_cv_std': np.sqrt(within_cv_var), 'coefficient_of_variation': cv_means.std() / cv_means.mean() } print(f"k={k:2d}: CV estimate = {cv_means.mean():.4f} ± {cv_means.std():.4f}") print(f" Within-run fold std = {np.sqrt(within_cv_var):.4f}") print(f" Coefficient of variation = {results[k]['coefficient_of_variation']:.4f}") print() return results # Compare models with different stabilitynp.random.seed(42)X, y = make_classification(n_samples=500, n_features=20, n_informative=10, random_state=42) print("=" * 60)print("Variance Analysis: Logistic Regression (stable model)")print("=" * 60)lr_results = analyze_cv_variance(X, y, LogisticRegression) print("=" * 60)print("Variance Analysis: Random Forest (less stable)") print("=" * 60)rf_results = analyze_cv_variance( X, y, lambda: RandomForestClassifier(n_estimators=50, random_state=None)) # Compare fold overlapprint("=" * 60)print("Training Set Overlap Analysis")print("=" * 60)n = len(X)for k in [2, 5, 10, 20]: train_size = n * (k-1) / k overlap = (k-2) / (k-1) if k > 1 else 0 # Fraction of overlap between two folds print(f"k={k:2d}: Training size = {train_size:.0f} ({(k-1)/k:.0%}), " f"Pairwise training overlap = {overlap:.1%}")We've established that smaller k leads to higher bias (training on less data) while larger k leads to higher variance (due to correlation). Let's synthesize these into a complete picture.
The Mean Squared Error Decomposition:
The quality of an estimator is often measured by Mean Squared Error (MSE):
$$\text{MSE} = \text{Bias}^2 + \text{Variance}$$
For k-fold CV:
The optimal k minimizes the sum.
The Empirical Reality:
Studies by Kohavi (1995), Hastie et al. (2009), and others have empirically examined this tradeoff:
For most practical purposes, k=5 or k=10 works well. The bias at k=5 (only 20% bias multiplier) is small enough for most comparisons, and the variance reduction from averaging 5+ estimates is substantial.
2-fold CV is generally too biased unless you specifically need independent estimates for statistical tests.
LOOCV is not always better despite minimal bias. Its high variance can make it less reliable than 10-fold for model selection.
The "right" k depends on:
The widespread default of k=10 represents a practical sweet spot: 90% of data for training keeps bias low, 10 estimates provide good variance reduction, and the computational cost (10× training) is usually acceptable. When in doubt, k=10 is a reasonable choice.
Leave-One-Out Cross-Validation (LOOCV) is the extreme case of k-fold where k=n. Each "fold" contains exactly one sample for validation, with the remaining n-1 samples for training. Let's analyze its unique properties.
LOOCV Properties:
| Property | LOOCV | 10-Fold CV |
|---|---|---|
| Training set size | n-1 | ≈0.9n |
| Number of iterations | n | 10 |
| Validation set size | 1 | ≈n/10 |
| Training set overlap | (n-2)/(n-1) ≈ 100% | 80% |
| Estimate determinism | Deterministic | Depends on partition |
Advantages of LOOCV:
Minimal bias: Training on n-1 samples makes the estimate nearly unbiased for true n-sample performance.
Deterministic: No randomness in the procedure—run it twice, get identical results.
Maximum data utilization: Uses the absolute maximum training data possible.
Disadvantages of LOOCV:
Computational cost: Requires n model fits. For large n or expensive models, this can be prohibitive.
High variance: Despite averaging n estimates, the extreme correlation between them can inflate variance beyond k-fold with smaller k.
Unstable for some models: For models with high instability (e.g., deep decision trees), the correlation-induced variance can be severe.
Single-sample validation noise: Each fold's estimate is based on just one sample—inherently noisy.
Intuitively, LOOCV should have the lowest variance (most estimates, most training data). In reality, the extreme training set overlap creates such high correlation that overall variance can exceed 10-fold CV. This is especially pronounced for high-variance models like unpruned decision trees.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as npimport timefrom sklearn.model_selection import ( LeaveOneOut, KFold, cross_val_score, cross_val_predict)from sklearn.linear_model import Ridgefrom sklearn.tree import DecisionTreeRegressorfrom sklearn.datasets import make_regression # Generate regression datanp.random.seed(42)X, y = make_regression(n_samples=200, n_features=10, noise=10, random_state=42) def compare_cv_strategies(X, y, model_class, model_name): """Compare LOOCV and k-fold CV.""" print(f"\n{model_name}") print("-" * 40) # LOOCV loo = LeaveOneOut() start = time.time() loo_scores = cross_val_score(model_class(), X, y, cv=loo, scoring='r2') loo_time = time.time() - start print(f"LOOCV (k={len(X)}): R² = {loo_scores.mean():.4f} ± {loo_scores.std():.4f}") print(f" Time: {loo_time:.2f}s, Iterations: {len(loo_scores)}") # Various k-fold for k in [5, 10, 20, 50]: kfold = KFold(n_splits=k, shuffle=True, random_state=42) start = time.time() kf_scores = cross_val_score(model_class(), X, y, cv=kfold, scoring='r2') kf_time = time.time() - start print(f"{k:2d}-Fold: R² = {kf_scores.mean():.4f} ± {kf_scores.std():.4f}") print(f" Time: {kf_time:.2f}s, Iterations: {k}") return loo_scores, loo_time # Compare stable vs unstable modelsprint("=" * 60)print("LOOCV vs K-Fold Comparison")print("=" * 60) # Stable model (Ridge regression)compare_cv_strategies(X, y, lambda: Ridge(alpha=1.0), "Ridge Regression (Stable)") # Unstable model (Deep decision tree)compare_cv_strategies(X, y, lambda: DecisionTreeRegressor(max_depth=None, random_state=None), "Decision Tree (Unstable)") # Demonstrate variance across different random partitionsprint("\n" + "=" * 60)print("Variance Across Random Partitions (10-fold)")print("=" * 60) cv_means = []for seed in range(50): kfold = KFold(n_splits=10, shuffle=True, random_state=seed) scores = cross_val_score( DecisionTreeRegressor(max_depth=3), X, y, cv=kfold, scoring='r2' ) cv_means.append(scores.mean()) print(f"Mean of CV estimates: {np.mean(cv_means):.4f}")print(f"Std of CV estimates: {np.std(cv_means):.4f}")print(f"Range: [{np.min(cv_means):.4f}, {np.max(cv_means):.4f}]")print("\nNote: LOOCV would give the same estimate every time (deterministic)")When to Use LOOCV:
Very small samples (n < 50): When every data point counts for training quality.
Computational shortcuts exist: For linear models, LOOCV can be computed in O(n) via the Sherman-Morrison formula—no slower than a single fit.
Determinism required: When you need identical results across runs for reproducibility.
Low-variance models: Stable models (high bias) suffer less from the correlation problem.
The GCV Shortcut for Linear Models:
For linear regression and related models, Generalized Cross-Validation (GCV) provides an analytical approximation to LOOCV:
$$\text{GCV} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}i}{1 - h{ii}} \right)^2$$
where $h_{ii}$ is the $i$-th diagonal of the hat matrix $H = X(X^TX)^{-1}X^T$. This requires only one model fit!
Statistical learning theory provides rigorous bounds on the behavior of cross-validation estimates. Understanding these bounds informs our confidence in CV results.
The Fundamental Question:
Can we guarantee that the CV estimate is close to the true generalization error? More formally, can we bound:
$$P\left( |\text{CV}(k) - \text{Err}(n)| > \epsilon \right)$$
The answer is nuanced and depends on properties of both the model class and the data distribution.
For the nearest neighbor classifier, LOOCV provides an asymptotically unbiased estimate of the expected error. Specifically, E[LOOCV] → E[Err(n)] as n → ∞. Similar results hold for other bounded-complexity model classes.
Stability-Based Analysis (Kearns & Ron, 1999):
A crucial theoretical framework connects CV reliability to algorithm stability. An algorithm is stable if small changes in the training data produce small changes in predictions.
Definition (Uniform Stability): An algorithm A is β-uniformly stable if for all datasets D and D' differing in one sample:
$$\sup_z |\ell(A(D), z) - \ell(A(D'), z)| \leq \beta$$
where $\ell$ is the loss function.
Theorem (Stability → CV Reliability): If A is β-uniformly stable, then:
$$|E[\text{LOOCV}] - E[\text{Err}(n)]| \leq \beta$$
Implications:
| Algorithm | Stability | CV Reliability | Notes |
|---|---|---|---|
| Ridge Regression | High (β = O(1/λn)) | Excellent | Stability improves with regularization |
| Lasso | Moderate | Good | Depends on regularization strength |
| SVM (RBF kernel) | Moderate-High | Good | Soft margin provides stability |
| Decision Tree (unpruned) | Low | Poor | Small data changes → different split points |
| k-NN (k=1) | Low | Poor | Sensitive to individual points |
| k-NN (k > 1) | Moderate | Reasonable | Averaging provides some stability |
| Random Forest | Moderate-High | Good | Ensemble averaging stabilizes |
| Gradient Boosting | Moderate | Good | Early stopping crucial for stability |
Variance Bounds (Blum et al., 1999):
For bounded loss functions $\ell \in [0, 1]$, the variance of LOOCV can be bounded:
$$\text{Var}[\text{LOOCV}] \leq \frac{1}{n} + \frac{4}{n^2} \sum_{i<j} \text{Cov}(L_i, L_j)$$
where $L_i$ is the loss on sample i. The covariance terms capture the correlation effect we discussed earlier. For stable algorithms, these covariances are small, keeping variance under control.
Practical Implications:
Trust CV more for stable algorithms: The theoretical guarantees are stronger.
Regularization helps CV reliability: By increasing stability.
Cross-validate the regularization: But watch out for model selection bias (covered later).
Large samples help: Both bias and variance decrease with n.
Theory provides principles; empirical studies calibrate practical recommendations. Let's synthesize decades of research and practice into actionable guidelines.
Decision Framework:
Use this framework to select k for your specific situation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
def recommend_cv_strategy(n_samples, n_features, model_type, goal, compute_budget='medium'): """ Recommend cross-validation strategy based on dataset and goals. Parameters: ----------- n_samples : int Number of training samples n_features : int Number of features model_type : str 'linear', 'tree', 'ensemble', 'neural_net', 'stable', 'unstable' goal : str 'model_selection', 'performance_estimation', 'feature_selection' compute_budget : str 'low', 'medium', 'high' Returns: -------- dict with recommended strategy """ recommendation = { 'k': None, 'n_repeats': 1, 'reasoning': [] } # Base k selection by sample size if n_samples < 50: recommendation['k'] = n_samples # LOOCV recommendation['reasoning'].append( f"Very small dataset (n={n_samples}): Use LOOCV to maximize training data" ) elif n_samples < 200: recommendation['k'] = 10 recommendation['reasoning'].append( f"Small dataset (n={n_samples}): Use 10-fold to balance bias/variance" ) elif n_samples < 5000: recommendation['k'] = 5 if compute_budget == 'low' else 10 recommendation['reasoning'].append( f"Medium dataset (n={n_samples}): k=5 or 10 both reasonable" ) else: recommendation['k'] = 5 recommendation['reasoning'].append( f"Large dataset (n={n_samples}): k=5 sufficient, bias minimal" ) # Adjust for model stability unstable_models = ['tree', 'neural_net', 'unstable'] if model_type in unstable_models: recommendation['n_repeats'] = max(3, recommendation['n_repeats']) recommendation['reasoning'].append( f"Unstable model ({model_type}): Use repeated CV to reduce variance" ) # Adjust for goal if goal == 'performance_estimation': if recommendation['k'] < 10: recommendation['k'] = min(10, n_samples // 10) recommendation['reasoning'].append( "Performance estimation: Prefer larger k to reduce pessimistic bias" ) elif goal == 'feature_selection': recommendation['n_repeats'] = max(5, recommendation['n_repeats']) recommendation['reasoning'].append( "Feature selection: Multiple repeats for stable importance rankings" ) # Compute budget constraints if compute_budget == 'low': recommendation['k'] = min(recommendation['k'], 5) recommendation['n_repeats'] = 1 recommendation['reasoning'].append( "Low compute budget: Limiting to k=5, single repeat" ) elif compute_budget == 'high': recommendation['n_repeats'] = max(5, recommendation['n_repeats']) recommendation['reasoning'].append( "High compute budget: Using multiple repeats for robustness" ) return recommendation # Example usagescenarios = [ {'n_samples': 50, 'n_features': 10, 'model_type': 'linear', 'goal': 'model_selection'}, {'n_samples': 500, 'n_features': 50, 'model_type': 'ensemble', 'goal': 'performance_estimation'}, {'n_samples': 200, 'n_features': 100, 'model_type': 'tree', 'goal': 'feature_selection'}, {'n_samples': 10000, 'n_features': 20, 'model_type': 'neural_net', 'goal': 'model_selection', 'compute_budget': 'low'},] for scenario in scenarios: print("Scenario:", scenario) rec = recommend_cv_strategy(**scenario) print(f" Recommended: {rec['k']}-fold with {rec['n_repeats']} repeats") for reason in rec['reasoning']: print(f" - {reason}") print()If you're unsure which k to use: (1) For quick exploratory work: 5-fold, (2) For publication-quality results: 10-fold with 5-10 repeats, (3) For rigorous statistical testing: 5x2 CV (5 repetitions of 2-fold) or repeated 10-fold with proper variance estimation.
We've developed a rigorous understanding of how k affects cross-validation quality. Here are the essential insights:
What's Next:
Now that we understand how the choice of k affects CV quality, the next page addresses practical guidance for choosing k in specific situations—considering sample size, model complexity, computational constraints, and use case requirements.
You now have a deep understanding of the bias-variance tradeoff in cross-validation—the fundamental tension between training on more data (lower bias) and maintaining estimate independence (lower variance). This understanding enables principled selection of k for your specific problems.