K Fold Cross Validation - Learning Module

Loading content...

0/245

K-Fold Procedure

The Foundation of Reliable Model Evaluation

In machine learning, the difference between a model that appears to work and one that actually generalizes to unseen data often comes down to how rigorously we evaluate it. K-fold cross-validation stands as the most widely adopted technique for obtaining reliable performance estimates—a method so fundamental that it's the default evaluation strategy in virtually every machine learning library, research paper, and production ML pipeline.

But k-fold is more than just a recipe to follow. Understanding its mechanics deeply reveals core principles about statistical estimation, computational tradeoffs, and the nature of generalization itself. This page provides a comprehensive exploration of the k-fold procedure, establishing the foundation upon which more advanced validation techniques are built.

What You Will Master

By the end of this page, you will understand the complete k-fold cross-validation procedure from first principles. You'll know exactly how data is partitioned, why each sample appears in both training and validation roles, how fold predictions combine into a single estimate, and what guarantees this provides about generalization performance.

Why We Need Cross-Validation

Before diving into mechanics, let's crystallize the problem cross-validation solves.

The Fundamental Challenge:

We want to estimate how well our trained model will perform on future, unseen data. This quantity—generalization performance—is the true measure of a model's value. Yet by definition, we cannot directly observe performance on data we haven't seen.

A naive approach trains on all available data and evaluates on that same data. This yields training error, which is systematically optimistic because the model has seen these exact examples. The gap between training error and true generalization error can be substantial, especially for flexible models that can memorize training data.

The Danger of Training Error

A decision tree grown to maximum depth achieves 0% training error on any classification task—it simply memorizes every training example. Yet its generalization error might be 40% or worse. Training error tells us almost nothing about real-world performance.

The Holdout Solution and Its Limitations:

The simplest fix is the holdout method: reserve some data (say, 20%) that the model never sees during training, then measure performance on this held-out set. This estimate is unbiased—the model truly hasn't seen this data.

But holdout has critical weaknesses:

High variance: With only 20% of data for testing, measurements are noisy. A particularly easy or hard test set can dramatically skew results.
Data waste: We train on only 80% of our data. With limited samples, this means worse models than we could otherwise achieve.
Single estimate: We get one measurement. We have no sense of its reliability or confidence.
Split sensitivity: Different random splits yield different estimates. Which split's result should we trust?

K-fold cross-validation addresses all these limitations simultaneously.

Comparison of Evaluation Strategies
Method	Training Data Used	Estimate Variance	Computational Cost	Result Stability
Training Error	100%	Very Low (but biased)	1× training	Stable but useless
Single Holdout (80/20)	80%	High	1× training	Split-dependent
5-Fold CV	80% (average)	Moderate	5× training	Much more stable
10-Fold CV	90% (average)	Lower	10× training	Highly stable
Leave-One-Out CV	≈100%	Lowest (but can be biased)	n× training	Perfectly deterministic

The K-Fold Cross-Validation Algorithm

K-fold cross-validation is a systematic procedure that partitions data into k equally-sized subsets (folds), then iteratively uses each fold as a validation set while training on the remaining k-1 folds. This elegant design ensures that:

Every sample is validated exactly once — no sample is wasted
Every sample is used for training in k-1 folds — maximum training data utilization
The estimate is aggregated over k independent measurements — reduced variance

Let's formalize this procedure precisely.

Formal Definition

Given a dataset D = {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)} with n samples and a learning algorithm A that produces a model h = A(D'), we define k-fold cross-validation as follows:

Partition D into k disjoint subsets S₁, S₂, ..., Sₖ of approximately equal size
For i = 1 to k:
- Define training set Dᵢ = D \ Sᵢ (all data except fold i)
- Train model hᵢ = A(Dᵢ)
- Compute validation error Eᵢ = L(hᵢ, Sᵢ) using loss function L
Return cross-validation estimate: CV(k) = (1/k) Σᵢ Eᵢ

Step-by-Step Walkthrough:

Step 1: Partition the Data

The first step is dividing the dataset into k approximately equal-sized folds. If n is divisible by k, each fold contains exactly n/k samples. Otherwise, some folds receive ⌊n/k⌋ samples and others receive ⌈n/k⌉ samples.

The partition is typically random—samples are shuffled before assignment to folds. This randomization prevents any systematic bias in fold composition (e.g., if data arrived in temporal or sorted order).

k_fold_partition.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from typing import List, Tuple
 
def create_k_folds(n_samples: int, k: int, shuffle: bool = True, 
                   random_state: int = None) -> List[np.ndarray]:
    """
    Create k-fold indices for cross-validation.
    
    Parameters:
    -----------
    n_samples : int
        Total number of samples in the dataset
    k : int
        Number of folds to create
    shuffle : bool
        Whether to shuffle indices before partitioning
    random_state : int
        Random seed for reproducibility
        
    Returns:
    --------
    List of k numpy arrays, each containing the indices for one fold
    """
    # Create array of all indices
    indices = np.arange(n_samples)
    
    # Shuffle if requested (almost always should be True)
    if shuffle:
        rng = np.random.default_rng(random_state)
        rng.shuffle(indices)
    
    # Calculate fold sizes
    # If n=100 and k=3: fold_sizes = [34, 33, 33]
    fold_sizes = np.full(k, n_samples // k, dtype=int)
    fold_sizes[:n_samples % k] += 1  # Distribute remainder
    
    # Split indices into folds
    folds = []
    current_idx = 0
    for fold_size in fold_sizes:
        folds.append(indices[current_idx:current_idx + fold_size])
        current_idx += fold_size
    
    return folds
 
# Example: 10 samples into 3 folds
folds = create_k_folds(10, 3, shuffle=True, random_state=42)
for i, fold in enumerate(folds):
    print(f"Fold {i+1}: {fold}")
# Output (example):
# Fold 1: [8 1 5 7]
# Fold 2: [9 3 0]
# Fold 3: [2 4 6]

Step 2: Iterative Training and Validation

For each fold i, we construct a training set by combining all folds except fold i, train our model on this training set, and evaluate on fold i. This loop executes k times, producing k separate performance measurements.

k_fold_evaluation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def k_fold_cross_validation(X, y, model_fn, k=5, metric_fn=None, 
                              random_state=42):
    """
    Perform k-fold cross-validation.
    
    Parameters:
    -----------
    X : array-like of shape (n_samples, n_features)
        Feature matrix
    y : array-like of shape (n_samples,)
        Target vector
    model_fn : callable
        Function that returns a fresh, unfitted model instance
    k : int
        Number of folds
    metric_fn : callable
        Function(y_true, y_pred) -> score. Default: accuracy
        
    Returns:
    --------
    scores : list of float
        Performance score for each fold
    predictions : array
        Out-of-fold predictions for all samples
    """
    import numpy as np
    from sklearn.metrics import accuracy_score
    
    if metric_fn is None:
        metric_fn = accuracy_score
    
    n_samples = len(X)
    folds = create_k_folds(n_samples, k, shuffle=True, 
                           random_state=random_state)
    
    # Store results
    scores = []
    predictions = np.empty(n_samples)
    predictions[:] = np.nan  # Initialize with NaN
    
    for fold_idx in range(k):
        # Identify validation indices (current fold)
        val_indices = folds[fold_idx]
        
        # Training indices = all other folds combined
        train_indices = np.concatenate([
            folds[i] for i in range(k) if i != fold_idx
        ])
        
        # Split data
        X_train, X_val = X[train_indices], X[val_indices]
        y_train, y_val = y[train_indices], y[val_indices]
        
        # Train model (get fresh instance each iteration!)
        model = model_fn()
        model.fit(X_train, y_train)
        
        # Predict on validation fold
        y_pred = model.predict(X_val)
        
        # Store predictions at original indices
        predictions[val_indices] = y_pred
        
        # Compute and store fold score
        fold_score = metric_fn(y_val, y_pred)
        scores.append(fold_score)
        
        print(f"Fold {fold_idx + 1}/{k}: "
              f"Train size = {len(train_indices)}, "
              f"Val size = {len(val_indices)}, "
              f"Score = {fold_score:.4f}")
    
    return scores, predictions
 
# Example usage
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
 
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, 
                           random_state=42)
 
# Define model factory
def model_fn():
    return RandomForestClassifier(n_estimators=100, random_state=42)
 
# Run 5-fold CV
scores, predictions = k_fold_cross_validation(
    X, y, model_fn, k=5, random_state=42
)
 
print(f"\nMean CV Score: {np.mean(scores):.4f} ± {np.std(scores):.4f}")

Step 3: Aggregate the Results

The final cross-validation estimate is typically the mean of all fold scores:

$$\text{CV}(k) = \frac{1}{k} \sum_{i=1}^{k} E_i$$

But aggregation can take several forms depending on the metric and use case:

Mean: Most common, gives average expected performance
Median: Robust to outlier folds (e.g., one unusually easy/hard fold)
Pooled: Concatenate all predictions, compute metric once (important for certain metrics)
Weighted: Weight by fold size if folds have different sizes

Pooled vs Averaged Metrics

For some metrics like AUC-ROC, there's a subtle difference between (1) computing AUC on each fold and averaging, and (2) pooling all out-of-fold predictions together and computing one AUC. Pooled computation is often preferred because it uses all threshold information simultaneously. For accuracy, mean and pooled approaches are equivalent.

Visualizing K-Fold Cross-Validation

A visual representation helps solidify understanding of how data flows through the k-fold procedure. Consider a dataset with 15 samples organized into k=5 folds:

Converting Mermaid diagram...

Key Observations from the Diagram:

Each fold rotates through the validation role: Fold 1 is validation in iteration 1, fold 2 in iteration 2, etc.
Training sets overlap significantly: The training set for iteration 1 shares 9 samples (folds 3,4,5) with the training set for iteration 2. This overlap means the k models are not independent—an important consideration for variance estimation.
Every sample gets exactly one prediction: Sample 1 is predicted only when it's in the validation set (iteration 1). This ensures the final CV predictions are truly out-of-sample.
Model count equals k: We train k separate models. For k=10 on a large dataset, this can be computationally expensive.

5-Fold CV Sample Roles (15 samples)
Iteration	Training Samples	Validation Samples	Training Size	Val Size
1	4,5,6,7,8,9,10,11,12,13,14,15	1,2,3	12	3
2	1,2,3,7,8,9,10,11,12,13,14,15	4,5,6	12	3
3	1,2,3,4,5,6,10,11,12,13,14,15	7,8,9	12	3
4	1,2,3,4,5,6,7,8,9,13,14,15	10,11,12	12	3
5	1,2,3,4,5,6,7,8,9,10,11,12	13,14,15	12	3

Training Set Size Formula

In k-fold CV, each training set contains (k-1)/k of the total data. For 5-fold: 80%. For 10-fold: 90%. This means 10-fold trains on data more similar in size to the full dataset, potentially producing better estimates of full-data performance.

Mathematical Properties of K-Fold CV

Understanding the statistical properties of k-fold CV helps us use it appropriately and interpret results correctly.

What K-Fold CV Estimates:

K-fold CV estimates the expected performance of a model trained on (k-1)/k × n samples. This is a subtle but important point: if k=5, we're estimating performance of a model trained on 80% of our data, not 100%.

Let T denote the true generalization error of a model trained on all n samples. The expected CV estimate differs from T:

$$E[\text{CV}(k)] \approx T + \text{bias}(k, n)$$

The bias arises because models trained on less data generally perform worse. This bias is typically positive (CV overestimates error) but small for large k.

Key Statistical Properties

•Nearly Unbiased: For large k, CV estimates are approximately unbiased for the performance of a model trained on (k-1)/k of the data
•Lower Variance than Holdout: By averaging k estimates, CV reduces variance by roughly 1/√k compared to single holdout
•Correlated Estimates: The k fold estimates are not independent (they share training data), so standard variance formulas don't directly apply
•Complete Sample Utilization: Every sample contributes both to training (in k-1 folds) and validation (in 1 fold)
•Deterministic Given Partition: Once folds are fixed, CV gives the same result every run (unlike random holdout)

Variance of the CV Estimate:

The variance of the k-fold CV estimate is more complex than simply σ²/k because the k estimates are correlated. A useful decomposition:

$$\text{Var}[\text{CV}(k)] = \frac{\sigma^2}{k} + \frac{k-1}{k} \cdot \rho \cdot \sigma^2$$

where:

σ² is the variance of a single fold estimate
ρ is the average correlation between fold estimates
The first term is the variance reduction from averaging
The second term captures the correlation effect

Because ρ > 0 (training sets overlap), the variance reduction from k-fold is less than we'd get from k truly independent estimates. This is why repeated cross-validation (multiple runs with different partitions) is valuable—it introduces additional independent variation.

The Correlation Problem

In 5-fold CV, each pair of training sets shares 60% of samples (3 out of 5 folds). In 10-fold CV, this rises to 80% (8 out of 10 folds). Higher k means more training data per fold but also higher correlation between estimates. This tradeoff underlies the choice of k.

Bias-Variance Tradeoff in K Selection:

The choice of k involves a bias-variance tradeoff:

Small k (e.g., 2-3)	Large k (e.g., 10+)
Higher bias: trains on less data	Lower bias: trains on more data
Lower variance: less correlation between folds	Higher variance (somewhat): more correlation
Computationally cheap	Computationally expensive
Less stable estimates	More stable estimates

The extreme case is Leave-One-Out CV (k=n), which minimizes bias but maximizes correlation. We'll explore this tradeoff deeply in later sections.

Production-Quality Implementation

A robust k-fold CV implementation must handle numerous practical considerations. Let's build a production-quality version step by step.

robust_kfold.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
import numpy as np
from typing import List, Dict, Callable, Any, Optional, Tuple
from dataclasses import dataclass
import time
 
@dataclass
class FoldResult:
    """Results from a single fold."""
    fold_idx: int
    train_size: int
    val_size: int
    train_score: float
    val_score: float
    predictions: np.ndarray
    train_time: float
    predict_time: float
 
@dataclass
class CVResult:
    """Aggregated cross-validation results."""
    fold_results: List[FoldResult]
    mean_train_score: float
    mean_val_score: float
    std_val_score: float
    all_predictions: np.ndarray
    total_time: float
    
    def summary(self) -> str:
        return (f"CV Results: {self.mean_val_score:.4f} "
                f"± {self.std_val_score:.4f} "
                f"(train: {self.mean_train_score:.4f})")
 
class RobustKFold:
    """
    Production-quality k-fold cross-validation implementation.
    
    Features:
    - Reproducible random partitioning
    - Automatic handling of uneven fold sizes
    - Training and validation score tracking
    - Timing information for performance profiling
    - Out-of-fold prediction collection
    - Support for sample weights
    - Proper handling of edge cases
    """
    
    def __init__(self, k: int = 5, shuffle: bool = True, 
                 random_state: Optional[int] = None):
        self.k = k
        self.shuffle = shuffle
        self.random_state = random_state
        self._validate_params()
    
    def _validate_params(self):
        if self.k < 2:
            raise ValueError(f"k must be >= 2, got {self.k}")
        if not isinstance(self.k, int):
            raise TypeError(f"k must be integer, got {type(self.k)}")
    
    def split(self, n_samples: int) -> List[Tuple[np.ndarray, np.ndarray]]:
        """
        Generate train/validation index pairs for each fold.
        
        Returns list of (train_indices, val_indices) tuples.
        """
        if n_samples < self.k:
            raise ValueError(
                f"Cannot create {self.k} folds with only {n_samples} samples"
            )
        
        indices = np.arange(n_samples)
        
        if self.shuffle:
            rng = np.random.default_rng(self.random_state)
            rng.shuffle(indices)
        
        # Calculate fold boundaries
        fold_sizes = np.full(self.k, n_samples // self.k, dtype=int)
        fold_sizes[:n_samples % self.k] += 1
        
        splits = []
        current = 0
        
        for fold_size in fold_sizes:
            val_indices = indices[current:current + fold_size]
            train_indices = np.concatenate([
                indices[:current],
                indices[current + fold_size:]
            ])
            splits.append((train_indices, val_indices))
            current += fold_size
        
        return splits
    
    def cross_validate(
        self,
        X: np.ndarray,
        y: np.ndarray,
        model_factory: Callable[[], Any],
        scorer: Callable[[np.ndarray, np.ndarray], float],
        sample_weight: Optional[np.ndarray] = None,
        return_predictions: bool = True,
        verbose: bool = True
    ) -> CVResult:
        """
        Perform k-fold cross-validation.
        
        Parameters:
        -----------
        X : array of shape (n_samples, n_features)
        y : array of shape (n_samples,)
        model_factory : callable returning fresh model instance
        scorer : callable(y_true, y_pred) -> score
        sample_weight : optional weights for training
        return_predictions : whether to store all OOF predictions
        verbose : print progress
        """
        n_samples = len(X)
        splits = self.split(n_samples)
        
        fold_results = []
        all_predictions = np.full(n_samples, np.nan) if return_predictions else None
        
        total_start = time.time()
        
        for fold_idx, (train_idx, val_idx) in enumerate(splits):
            # Extract fold data
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Handle sample weights
            if sample_weight is not None:
                fold_weights = sample_weight[train_idx]
            else:
                fold_weights = None
            
            # Train model
            model = model_factory()
            train_start = time.time()
            
            if fold_weights is not None:
                model.fit(X_train, y_train, sample_weight=fold_weights)
            else:
                model.fit(X_train, y_train)
            
            train_time = time.time() - train_start
            
            # Predictions
            predict_start = time.time()
            y_train_pred = model.predict(X_train)
            y_val_pred = model.predict(X_val)
            predict_time = time.time() - predict_start
            
            # Scores
            train_score = scorer(y_train, y_train_pred)
            val_score = scorer(y_val, y_val_pred)
            
            # Store OOF predictions
            if return_predictions:
                all_predictions[val_idx] = y_val_pred
            
            # Record results
            fold_result = FoldResult(
                fold_idx=fold_idx,
                train_size=len(train_idx),
                val_size=len(val_idx),
                train_score=train_score,
                val_score=val_score,
                predictions=y_val_pred,
                train_time=train_time,
                predict_time=predict_time
            )
            fold_results.append(fold_result)
            
            if verbose:
                print(f"Fold {fold_idx + 1}/{self.k}: "
                      f"train={train_score:.4f}, val={val_score:.4f}, "
                      f"time={train_time:.2f}s")
        
        total_time = time.time() - total_start
        
        # Aggregate
        val_scores = [r.val_score for r in fold_results]
        train_scores = [r.train_score for r in fold_results]
        
        return CVResult(
            fold_results=fold_results,
            mean_train_score=np.mean(train_scores),
            mean_val_score=np.mean(val_scores),
            std_val_score=np.std(val_scores),
            all_predictions=all_predictions,
            total_time=total_time
        )

Critical Implementation Considerations:

Fresh Model Each Fold: Never reuse a fitted model between folds. The model_factory pattern ensures each fold gets an unfitted model.
Random State Management: Same random_state produces identical partitions—essential for reproducible experiments.
Edge Case Handling: What if n_samples < k? The implementation should fail gracefully with a clear error.
Memory Efficiency: For large datasets, consider yielding fold splits instead of materializing all at once.
Timing Information: Tracking train/predict time per fold helps identify performance bottlenecks.

Common Implementation Bugs

The most common CV bugs: (1) Data leakage by fitting preprocessing on full data before splitting, (2) Reusing the same model object across folds (sklearn models accumulate state), (3) Inconsistent random states causing irreproducible results, (4) Computing metrics incorrectly (e.g., accuracy on imbalanced folds without stratification).

K-Fold Cross-Validation with Scikit-Learn

While understanding the mechanics is crucial, in practice you'll often use scikit-learn's robust implementation. Let's explore the key tools:

sklearn_kfold.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from sklearn.model_selection import (
    KFold, 
    cross_val_score, 
    cross_val_predict,
    cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, make_scorer
import numpy as np
 
# Generate sample data
X, y = make_classification(
    n_samples=1000, 
    n_features=20,
    n_informative=10,
    random_state=42
)
 
# Method 1: Using cross_val_score (simplest)
# Returns array of k scores
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(
    model, X, y, 
    cv=5,  # 5-fold CV
    scoring='accuracy'
)
print(f"cross_val_score: {scores.mean():.4f} ± {scores.std():.4f}")
print(f"Individual folds: {scores}")
 
# Method 2: Using KFold explicitly (more control)
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
 
scores_manual = []
for train_idx, val_idx in kfold.split(X):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X[train_idx], y[train_idx])
    score = model.score(X[val_idx], y[val_idx])
    scores_manual.append(score)
 
print(f"Manual KFold: {np.mean(scores_manual):.4f}")
 
# Method 3: cross_val_predict (get all OOF predictions)
predictions = cross_val_predict(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y,
    cv=5
)
print(f"OOF Accuracy: {accuracy_score(y, predictions):.4f}")
print(f"Predictions shape: {predictions.shape}")  # Same as y
 
# Method 4: cross_validate (comprehensive output)
cv_results = cross_validate(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y,
    cv=5,
    scoring=['accuracy', 'f1', 'roc_auc'],
    return_train_score=True,
    return_estimator=True  # Keep the trained models
)
 
print("\ncross_validate results:")
print(f"Test Accuracy: {cv_results['test_accuracy'].mean():.4f}")
print(f"Test F1: {cv_results['test_f1'].mean():.4f}")
print(f"Test AUC: {cv_results['test_roc_auc'].mean():.4f}")
print(f"Train Accuracy: {cv_results['train_accuracy'].mean():.4f}")
print(f"Fit times: {cv_results['fit_time']}")
print(f"Number of estimators: {len(cv_results['estimator'])}")

Scikit-Learn CV Function Comparison
Function	Returns	Best For	Key Parameters
cross_val_score	Array of k scores	Quick evaluation	cv, scoring
cross_val_predict	Array of n predictions	OOF predictions for stacking	cv, method ('predict', 'predict_proba')
cross_validate	Dict with scores, times, estimators	Comprehensive analysis	scoring (list), return_train_score, return_estimator
KFold.split()	Generator of (train, val) indices	Custom loops, preprocessing	n_splits, shuffle, random_state

Choosing the Right Function

Use cross_val_score for quick sanity checks, cross_validate for thorough analysis including train scores and timing, cross_val_predict when you need OOF predictions (e.g., for model stacking or feature engineering), and explicit KFold when you need custom preprocessing inside the loop.

Out-of-Fold Predictions and Their Applications

One of the most powerful byproducts of k-fold CV is the generation of out-of-fold (OOF) predictions—predictions for every sample in the dataset, where each prediction comes from a model that never saw that sample during training.

This is conceptually profound: we can generate unbiased predictions for our entire training set, not just a held-out portion.

Key Properties of OOF Predictions:

Truly out-of-sample: Each prediction comes from a model trained without that sample
Full coverage: Unlike holdout, we get predictions for 100% of training data
Comparable to test predictions: OOF predictions have the same statistical properties as genuine test predictions
Foundation for stacking: The cornerstone of advanced model ensembling techniques

oof_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, log_loss
import matplotlib.pyplot as plt
 
# Generate data
X, y = make_classification(
    n_samples=1000, n_features=20, 
    n_informative=10, random_state=42
)
 
# Application 1: Reliable Feature Engineering Evaluation
# --------------------------------------------------------
# Generate OOF predictions to use as features (meta-features)
 
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
 
# OOF predictions from multiple models
oof_rf = cross_val_predict(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, cv=kfold, method='predict_proba'
)[:, 1]
 
oof_lr = cross_val_predict(
    LogisticRegression(max_iter=1000, random_state=42),
    X, y, cv=kfold, method='predict_proba'
)[:, 1]
 
print(f"RF OOF Log Loss: {log_loss(y, oof_rf):.4f}")
print(f"LR OOF Log Loss: {log_loss(y, oof_lr):.4f}")
 
# Application 2: Model Stacking with OOF
# ----------------------------------------
# Use OOF predictions as features for a meta-learner
 
meta_features = np.column_stack([oof_rf, oof_lr])
print(f"Meta-features shape: {meta_features.shape}")
 
# Train stacking model with CV on meta-features
meta_model = LogisticRegression()
stacked_score = cross_val_predict(
    meta_model, meta_features, y, cv=kfold, method='predict_proba'
)[:, 1]
 
print(f"Stacked OOF Log Loss: {log_loss(y, stacked_score):.4f}")
 
# Application 3: Calibration Analysis with OOF
# ----------------------------------------------
# OOF probabilities let us assess calibration on the full dataset
 
def plot_calibration_curve(y_true, y_prob, n_bins=10, title=""):
    """Plot calibration curve from OOF predictions."""
    bin_means = []
    true_frequencies = []
    
    bins = np.linspace(0, 1, n_bins + 1)
    for i in range(n_bins):
        mask = (y_prob >= bins[i]) & (y_prob < bins[i+1])
        if mask.sum() > 0:
            bin_means.append(y_prob[mask].mean())
            true_frequencies.append(y_true[mask].mean())
    
    plt.figure(figsize=(8, 6))
    plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
    plt.plot(bin_means, true_frequencies, 'o-', label='Model')
    plt.xlabel('Mean predicted probability')
    plt.ylabel('Fraction of positives')
    plt.title(f'Calibration Curve: {title}')
    plt.legend()
    plt.grid(True, alpha=0.3)
    return plt.gcf()
 
# Application 4: Threshold Selection with OOF
# ---------------------------------------------
# Find optimal decision threshold using OOF predictions
 
from sklearn.metrics import precision_recall_curve, f1_score
 
precision, recall, thresholds = precision_recall_curve(y, oof_rf)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
best_threshold = thresholds[np.argmax(f1_scores[:-1])]
 
print(f"\nOptimal threshold for F1: {best_threshold:.4f}")
print(f"F1 at default (0.5): {f1_score(y, oof_rf > 0.5):.4f}")
print(f"F1 at optimal: {f1_score(y, oof_rf > best_threshold):.4f}")

OOF vs Test Predictions

OOF predictions for training data are analogous to predictions on a genuine test set. This means any analysis you can do on test predictions—calibration, threshold tuning, error analysis—can be done on OOF predictions for the training set. This is incredibly valuable when the test set is unavailable or too small for reliable analysis.

Common OOF Applications

•Model Stacking/Blending: Use OOF predictions as meta-features for a second-level model
•Feature Engineering Validation: Test new features reliably using OOF performance
•Probability Calibration: Use OOF probabilities to fit calibrators (Platt scaling, isotonic regression)
•Threshold Optimization: Find optimal classification thresholds without test set leakage
•Error Analysis: Examine which training samples are consistently misclassified across folds
•Leak Detection: Compare OOF performance to training performance to detect data leakage

Summary: The K-Fold Procedure

We've thoroughly explored the mechanics and properties of k-fold cross-validation. Let's crystallize the essential takeaways:

Key Takeaways

•K-fold partitions data into k equal folds, rotating each through the validation role while training on the other k-1 folds
•Every sample is validated exactly once and used for training in exactly k-1 folds—no data waste
•The CV estimate is the mean of k fold scores, providing lower variance than single holdout
•Fold estimates are correlated due to overlapping training sets, complicating variance estimation
•Training set size is (k-1)/k of total data—this means k=10 trains on 90%, better approximating full-data performance
•Out-of-fold predictions provide unbiased predictions for the entire training set, enabling stacking and calibration
•Always use shuffle=True unless data has meaningful order (e.g., time series)
•Set random_state for reproducibility—different partitions give different estimates

What's Next:

Now that we understand the k-fold procedure, we'll explore the critical question of how to choose k. The next page examines the bias-variance tradeoff in cross-validation, showing how different values of k affect the quality and reliability of our performance estimates.

Page Complete

You now have a complete understanding of the k-fold cross-validation procedure—from high-level motivation through mathematical properties to production-quality implementation. This foundational knowledge prepares you to make informed decisions about evaluation strategies in your own projects.