Stratified And Group Cross Validation - Learning Module

Loading content...

0/278

Stratified K-Fold Cross-Validation

The Hidden Danger of Imbalanced Folds

Imagine you're building a fraud detection model with 99% legitimate transactions and 1% fraudulent ones. You split your data into 5 folds for cross-validation, but by pure chance, one fold ends up with 0.3% fraud cases while another has 1.7%. Your model's performance varies wildly across folds—not because of model instability, but because each fold represents a fundamentally different problem.

This isn't a hypothetical scenario. Unstratified cross-validation on imbalanced data is one of the most common sources of misleading model evaluation in production machine learning. The variance in your performance estimates becomes a function of sampling luck rather than true model quality.

Stratified k-fold cross-validation solves this problem by ensuring every fold mirrors the original class distribution. This seemingly simple modification has profound implications for the reliability of your performance estimates and the validity of your model selection decisions.

What You Will Learn

By the end of this page, you will deeply understand the mathematical foundations of stratified sampling, why it dramatically improves cross-validation reliability for classification problems, how to implement it correctly for binary and multi-class scenarios, and when stratification is essential versus optional.

The Problem with Standard K-Fold

Standard k-fold cross-validation randomly partitions data into k equal-sized folds. While this works well for balanced datasets, it can produce severely biased folds when class proportions are unequal.

The Mathematics of Random Sampling Variance

Consider a dataset with n samples and a minority class proportion p. When we randomly sample n/k examples for a fold, the number of minority class examples follows a binomial distribution:

$$X \sim \text{Binomial}(n/k, p)$$

The expected count is $\mu = np/k$, and the standard deviation is:

$$\sigma = \sqrt{\frac{np(1-p)}{k}}$$

For small p (imbalanced classes), this variance relative to the expected count becomes significant:

$$\text{CV} = \frac{\sigma}{\mu} = \sqrt{\frac{(1-p)k}{np}}$$

The coefficient of variation (CV) explodes as p decreases, meaning fold-to-fold variation in class proportions becomes extreme for rare classes.

Quantifying Sampling Variance in Imbalanced DataLet's compute the expected variation in class proportions across folds for a fraud detection scenario.

Input

Output

The Real-World Impact

When class proportions vary across folds, you're not just measuring variance in model performance—you're measuring variance in the difficulty of each fold's classification problem. This conflates two sources of uncertainty and inflates your confidence intervals.

Why This Matters for Model Selection

High variance in performance estimates due to fold imbalance causes several problems:

Unstable Rankings: The best model on one fold split may be the worst on another, not due to true model differences but sampling artifacts.
Incorrect Confidence Intervals: Standard formulas for CV confidence intervals assume i.i.d. fold errors. Systematic fold imbalance violates this assumption.
Overfitting to Fold Composition: If you tune hyperparameters using unstratified CV, you may inadvertently select models that happen to perform well on your specific (lucky or unlucky) fold splits.
Misleading Generalization Estimates: Production data will have the true class distribution, but your evaluation saw distorted distributions in each fold.

Stratified Sampling: Statistical Foundations

Stratified sampling is a technique from survey statistics that partitions a population into homogeneous subgroups (strata) and samples proportionally from each. Applied to cross-validation, we treat class labels as strata and ensure each fold maintains the original class proportions.

Formal Definition

Let $\mathcal{D} = {(x_i, y_i)}_{i=1}^{n}$ be a dataset with $C$ classes. Define class subsets:

$$\mathcal{D}_c = {(x_i, y_i) \in \mathcal{D} : y_i = c}$$

with sizes $n_c = |\mathcal{D}_c|$ and proportions $p_c = n_c / n$.

Stratified k-fold partitions each class separately into k groups:

$$\mathcal{D}_c = \mathcal{D}_c^{(1)} \cup \mathcal{D}_c^{(2)} \cup \cdots \cup \mathcal{D}_c^{(k)}$$

where $|\mathcal{D}_c^{(j)}| \approx n_c / k$ for all folds j.

The final fold j is formed by union across classes:

$$\mathcal{F}^{(j)} = \bigcup_{c=1}^{C} \mathcal{D}_c^{(j)}$$

This guarantees that each fold has approximately $n_c / k$ examples of class c, preserving the original proportions.

Approximate vs. Exact Proportions

When n_c is not divisible by k, some folds will have one more example of class c than others. This is unavoidable but minimal. For a class with 17 examples and k=5, three folds get 3 examples and two folds get 4. The maximum proportional deviation is (4-3)/3.4 ≈ 29% for that class, but in absolute terms, it's just one example.

Variance Reduction Properties

The key insight is that stratified sampling eliminates between-stratum variance from the estimator. In standard random sampling, the variance of the mean estimator is:

$$\text{Var}_{\text{SRS}}(\bar{y}) = \frac{S^2}{n}$$

where $S^2$ is the population variance.

With stratified sampling:

$$\text{Var}{\text{strat}}(\bar{y}) = \sum{c=1}^{C} \left(\frac{n_c}{n}\right)^2 \frac{S_c^2}{n_c} = \frac{1}{n} \sum_{c=1}^{C} \frac{n_c}{n} S_c^2$$

where $S_c^2$ is the within-stratum variance for class c.

The variance reduction is:

$$\text{Var}{\text{SRS}} - \text{Var}{\text{strat}} = \frac{1}{n} \sum_{c=1}^{C} \frac{n_c}{n} (\mu_c - \mu)^2$$

This equals zero only when all class means are identical. For classification with distinct class-conditional distributions, stratification always reduces variance.

Variance Components in Stratified vs. Simple Random Sampling
Component	Simple Random Sampling	Stratified Sampling
Within-stratum variance	Included	Included
Between-stratum variance	Included	Eliminated
Allocation overhead	None	Minor (proportional assignment)
Fold size guarantee	Approximate	Exact per stratum

The Stratified K-Fold Algorithm

The stratified k-fold algorithm combines proportional allocation with randomization within strata. Here's the complete procedure:

Algorithm: Stratified K-Fold Partitioning

Input: Dataset D of n samples, labels y, number of folds k
Output: k fold assignments {F_1, F_2, ..., F_k}

1. Group indices by class:
   For each class c in unique(y):
       I_c ← indices where y = c
       Shuffle I_c randomly

2. Initialize k empty folds:
   For j = 1 to k:
       F_j ← []

3. Allocate each class proportionally:
   For each class c:
       n_c ← length(I_c)
       base_per_fold ← floor(n_c / k)
       remainder ← n_c mod k
       
       idx ← 0
       For j = 1 to k:
           count ← base_per_fold + (1 if j ≤ remainder else 0)
           F_j ← F_j ∪ I_c[idx : idx + count]
           idx ← idx + count

4. Shuffle within each fold (optional, for iteration randomization)

Return {F_1, F_2, ..., F_k}

The shuffle in Step 1 ensures different runs produce different stratified splits. The allocation in Step 3 handles remainders by giving one extra sample to the first remainder folds.

stratified_kfold_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from collections import defaultdict
from typing import List, Tuple, Iterator
 
def stratified_kfold_split(
    X: np.ndarray,
    y: np.ndarray,
    k: int = 5,
    shuffle: bool = True,
    random_state: int = None
) -> List[Tuple[np.ndarray, np.ndarray]]:
    """
    Perform stratified k-fold splitting from scratch.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix of shape (n_samples, n_features)
    y : np.ndarray
        Label array of shape (n_samples,)
    k : int
        Number of folds
    shuffle : bool
        Whether to shuffle within each class before splitting
    random_state : int
        Random seed for reproducibility
        
    Returns
    -------
    List of (train_indices, test_indices) tuples for each fold
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples = len(y)
    
    # Step 1: Group indices by class label
    class_indices = defaultdict(list)
    for idx, label in enumerate(y):
        class_indices[label].append(idx)
    
    # Convert to arrays and optionally shuffle
    for label in class_indices:
        class_indices[label] = np.array(class_indices[label])
        if shuffle:
            np.random.shuffle(class_indices[label])
    
    # Step 2: Initialize fold assignments
    # fold_assignments[i] = fold number for sample i
    fold_assignments = np.zeros(n_samples, dtype=int)
    
    # Step 3: Allocate each class proportionally to folds
    for label, indices in class_indices.items():
        n_class = len(indices)
        base_per_fold = n_class // k
        remainder = n_class % k
        
        current_idx = 0
        for fold in range(k):
            # First 'remainder' folds get one extra sample
            count = base_per_fold + (1 if fold < remainder else 0)
            fold_indices = indices[current_idx : current_idx + count]
            fold_assignments[fold_indices] = fold
            current_idx += count
    
    # Step 4: Generate train/test splits for each fold
    splits = []
    all_indices = np.arange(n_samples)
    
    for fold in range(k):
        test_mask = fold_assignments == fold
        train_indices = all_indices[~test_mask]
        test_indices = all_indices[test_mask]
        
        if shuffle:
            np.random.shuffle(train_indices)
        
        splits.append((train_indices, test_indices))
    
    return splits
 
 
def verify_stratification(y: np.ndarray, splits: List[Tuple]) -> None:
    """Verify that splits maintain class proportions."""
    original_dist = {}
    for label in np.unique(y):
        original_dist[label] = np.mean(y == label)
    
    print(f"Original distribution: {original_dist}")
    print()
    
    for fold_idx, (train_idx, test_idx) in enumerate(splits):
        y_train, y_test = y[train_idx], y[test_idx]
        
        train_dist = {lbl: np.mean(y_train == lbl) for lbl in np.unique(y)}
        test_dist = {lbl: np.mean(y_test == lbl) for lbl in np.unique(y)}
        
        print(f"Fold {fold_idx + 1}:")
        print(f"  Train size: {len(train_idx)}, Test size: {len(test_idx)}")
        print(f"  Train dist: {train_dist}")
        print(f"  Test dist:  {test_dist}")
        
        # Check deviation from original
        max_deviation = max(
            abs(test_dist[lbl] - original_dist[lbl]) 
            for lbl in original_dist
        )
        print(f"  Max deviation from original: {max_deviation:.4f}")
        print()
 
 
# Demonstration
if __name__ == "__main__":
    # Create imbalanced dataset
    np.random.seed(42)
    n_majority = 900
    n_minority = 100
    
    X = np.random.randn(n_majority + n_minority, 10)
    y = np.array([0] * n_majority + [1] * n_minority)
    
    # Stratified split
    print("=" * 50)
    print("STRATIFIED K-FOLD SPLIT")
    print("=" * 50)
    splits = stratified_kfold_split(X, y, k=5, random_state=42)
    verify_stratification(y, splits)
    
    # Compare with random split (non-stratified)
    print("=" * 50)
    print("RANDOM (NON-STRATIFIED) SPLIT FOR COMPARISON")
    print("=" * 50)
    
    indices = np.random.permutation(len(y))
    fold_size = len(y) // 5
    random_splits = [
        (np.concatenate([indices[:i*fold_size], indices[(i+1)*fold_size:]]),
         indices[i*fold_size:(i+1)*fold_size])
        for i in range(5)
    ]
    verify_stratification(y, random_splits)

Implementation Detail: Handling Remainders

The remainder-handling logic (giving extra samples to the first few folds) is deterministic. When combined with shuffling, this ensures both reproducibility with the same seed and randomness across different seeds. Scikit-learn's implementation uses the same approach.

Multi-Class and Multi-Label Stratification

For multi-class problems with C classes, stratified k-fold extends naturally—we simply stratify on all C classes simultaneously. However, several nuances arise:

Challenge 1: Rare Classes with Fewer Samples Than Folds

If a class has fewer than k samples, we cannot place at least one sample in each fold. Options include:

Reduce k: Use k ≤ min_class_size
Combine rare classes: Merge semantically similar rare classes for stratification purposes
Use leave-one-out: For very small datasets, LOO CV avoids the stratification problem entirely
Accept partial stratification: Some libraries allow folds with zero samples of rare classes, but this defeats the purpose

Challenge 2: Class Label Ties

When multiple classes have exactly the same number of samples and that number isn't divisible by k, the assignment of "extra" samples to folds can create subtle imbalances. Most implementations randomize this assignment.

Multi-Label Stratification

Multi-label problems, where each sample can belong to multiple classes simultaneously, present a harder stratification challenge. Simple per-label stratification fails because label correlations matter.

Iterative Stratification Algorithm (Sechidis et al., 2011):

This algorithm greedily assigns samples to folds while trying to balance all label proportions simultaneously:

Calculate desired label counts per fold based on original proportions
Iteratively select the rarest label (most constrained)
For each sample with that label, assign to the fold with the greatest "need" for that label
Update fold counts and repeat

The algorithm prioritizes rare labels, ensuring they're well-distributed before common labels (which have more slack).

multilabel_stratification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from collections import defaultdict
from typing import List, Tuple
 
def iterative_stratification(
    y: np.ndarray, 
    k: int = 5,
    random_state: int = None
) -> List[np.ndarray]:
    """
    Iterative stratification for multi-label data.
    
    Based on: Sechidis et al. "On the Stratification of Multi-Label Data" (2011)
    
    Parameters
    ----------
    y : np.ndarray
        Binary label matrix of shape (n_samples, n_labels)
    k : int
        Number of folds
    random_state : int
        Random seed
        
    Returns
    -------
    List of index arrays, one per fold
    """
    if random_state is not None:
        np.random.seed(random_state)
    
    n_samples, n_labels = y.shape
    
    # Desired samples per fold
    samples_per_fold = np.full(k, n_samples // k)
    samples_per_fold[:n_samples % k] += 1
    
    # Desired label counts per fold (proportional allocation)
    label_counts = y.sum(axis=0)  # Total per label
    desired_per_fold = np.zeros((k, n_labels))
    for label in range(n_labels):
        # Distribute label proportionally
        base = label_counts[label] // k
        remainder = label_counts[label] % k
        desired_per_fold[:, label] = base
        desired_per_fold[:remainder, label] += 1
    
    # Track current allocations
    current_counts = np.zeros((k, n_labels))
    fold_sizes = np.zeros(k)
    
    # Initialize folds
    folds = [[] for _ in range(k)]
    unassigned = set(range(n_samples))
    
    # Process labels from rarest to most common
    while unassigned:
        # Find label with minimum total remaining samples
        remaining_per_label = {}
        for label in range(n_labels):
            count = sum(y[i, label] for i in unassigned)
            if count > 0:
                remaining_per_label[label] = count
        
        if not remaining_per_label:
            # Assign remaining samples (no labels) arbitrarily
            for idx in list(unassigned):
                fold = int(np.argmin(fold_sizes))
                folds[fold].append(idx)
                fold_sizes[fold] += 1
                unassigned.remove(idx)
            break
        
        # Select rarest label
        rarest_label = min(remaining_per_label, key=remaining_per_label.get)
        
        # Get samples with this label that are unassigned
        candidates = [i for i in unassigned if y[i, rarest_label] == 1]
        np.random.shuffle(candidates)
        
        for idx in candidates:
            # Find fold with greatest "need" for this label
            # Need = desired - current, but also respect fold size limits
            needs = desired_per_fold[:, rarest_label] - current_counts[:, rarest_label]
            
            # Among folds with positive need, prefer those with more space
            space = samples_per_fold - fold_sizes
            
            # Score combines label need and space
            scores = needs + 0.01 * space  # Small tiebreaker for space
            
            # Among valid folds (with space), pick highest score
            valid_folds = np.where(space > 0)[0]
            if len(valid_folds) == 0:
                valid_folds = np.arange(k)  # Fallback
            
            best_fold = valid_folds[np.argmax(scores[valid_folds])]
            
            # Assign sample to fold
            folds[best_fold].append(idx)
            fold_sizes[best_fold] += 1
            current_counts[best_fold] += y[idx]
            unassigned.remove(idx)
    
    return [np.array(fold) for fold in folds]
 
 
# Demonstration
if __name__ == "__main__":
    # Create multi-label dataset
    np.random.seed(42)
    n_samples = 1000
    n_labels = 5
    
    # Create imbalanced multi-label data
    # Label frequencies: 50%, 30%, 10%, 5%, 2%
    frequencies = [0.50, 0.30, 0.10, 0.05, 0.02]
    y = np.zeros((n_samples, n_labels))
    for label, freq in enumerate(frequencies):
        y[:, label] = np.random.random(n_samples) < freq
    
    print("Original label distribution:")
    print(f"  Total samples: {n_samples}")
    for label in range(n_labels):
        count = y[:, label].sum()
        print(f"  Label {label}: {count} ({count/n_samples:.1%})")
    print()
    
    # Apply iterative stratification
    folds = iterative_stratification(y, k=5, random_state=42)
    
    print("Fold-wise label distribution:")
    for fold_idx, fold_indices in enumerate(folds):
        y_fold = y[fold_indices]
        print(f"  Fold {fold_idx + 1} (n={len(fold_indices)}):")
        for label in range(n_labels):
            count = y_fold[:, label].sum()
            pct = count / len(fold_indices)
            orig_pct = frequencies[label]
            diff = abs(pct - orig_pct)
            print(f"    Label {label}: {int(count):3d} ({pct:.1%}) " +
                  f"[deviation: {diff:.1%}]")

When Stratification Is Essential vs. Optional

Stratification isn't always necessary. Understanding when it provides significant benefit helps you make informed decisions.

Essential: High-Impact Scenarios

When Stratification Is Critical

•Imbalanced Classification — When the minority class is <15% of the data, stratification prevents folds with extremely few or no minority examples.
•Small Datasets — With n < 500, random sampling variance is high. Stratification ensures every fold has representation of all classes.
•Rare Class Metrics — If your evaluation focuses on recall, precision, or F1 for minority classes, fold-to-fold variation in minority counts directly impacts metric reliability.
•Multi-Class with Variable Sizes — When class sizes range from hundreds to thousands, stratification prevents dominant classes from being split inconsistently.
•Model Selection — When choosing between models with small performance differences, unstratified CV variance can mask true differences.

Optional: Lower-Impact Scenarios

When Stratification Provides Marginal Benefit

•Balanced Classification — With 40-60% class splits, random sampling variance is low anyway.
•Large Datasets — With n > 10,000, the law of large numbers ensures folds are approximately balanced naturally.
•Regression Problems — There's no clear stratification criterion for continuous targets (though binning can enable stratification).
•Accuracy-Only Evaluation — If you only care about overall accuracy on balanced data, fold imbalance matters less than for minority-class metrics.

Default to Stratification

In practice, always use stratified k-fold for classification problems unless you have a specific reason not to. The computational overhead is negligible, and the benefits for imbalanced data are substantial. Scikit-learn's StratifiedKFold should be your default cross-validation strategy.

Stratification for Regression: Target Binning

While regression lacks discrete classes, you can create pseudo-classes by binning the target variable:

from sklearn.model_selection import StratifiedKFold
import pandas as pd

# Bin continuous target into quintiles
y_binned = pd.qcut(y, q=5, labels=False)

# Use binned labels for stratification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y_binned):
    # Use original y for training/evaluation
    X_train, y_train = X[train_idx], y[train_idx]  # Original y!
    X_test, y_test = X[test_idx], y[test_idx]

This ensures each fold has a similar distribution of target values, which is especially useful for skewed regression targets.

Production-Ready Implementation Patterns

Let's examine production-quality patterns for using stratified k-fold in real ML pipelines.

Pattern 1: Standard Classification Pipeline

production_stratified_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from sklearn.model_selection import (
    StratifiedKFold, 
    cross_val_score,
    cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    make_scorer, 
    f1_score, 
    recall_score, 
    precision_score,
    roc_auc_score
)
from typing import Dict, Any
 
def create_evaluation_pipeline(
    model,
    X: np.ndarray,
    y: np.ndarray,
    n_splits: int = 5,
    random_state: int = 42
) -> Dict[str, Any]:
    """
    Comprehensive stratified CV evaluation pipeline.
    
    Returns multiple metrics with confidence intervals.
    """
    # Define stratified cross-validator
    cv = StratifiedKFold(
        n_splits=n_splits,
        shuffle=True,
        random_state=random_state
    )
    
    # Define multiple scoring metrics
    scoring = {
        'accuracy': 'accuracy',
        'f1': make_scorer(f1_score, average='binary'),
        'recall': make_scorer(recall_score, average='binary'),
        'precision': make_scorer(precision_score, average='binary'),
        'roc_auc': 'roc_auc',
    }
    
    # Create pipeline with preprocessing
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])
    
    # Perform cross-validation
    cv_results = cross_validate(
        pipeline, X, y,
        cv=cv,
        scoring=scoring,
        return_train_score=True,
        n_jobs=-1  # Parallelize across folds
    )
    
    # Compute statistics
    results = {}
    for metric in scoring.keys():
        test_scores = cv_results[f'test_{metric}']
        train_scores = cv_results[f'train_{metric}']
        
        results[metric] = {
            'test_mean': np.mean(test_scores),
            'test_std': np.std(test_scores),
            'test_ci_95': (
                np.mean(test_scores) - 1.96 * np.std(test_scores) / np.sqrt(n_splits),
                np.mean(test_scores) + 1.96 * np.std(test_scores) / np.sqrt(n_splits)
            ),
            'train_mean': np.mean(train_scores),
            'train_std': np.std(train_scores),
            'fold_scores': test_scores.tolist(),
            'overfit_gap': np.mean(train_scores) - np.mean(test_scores)
        }
    
    # Add timing information
    results['timing'] = {
        'fit_time_mean': np.mean(cv_results['fit_time']),
        'score_time_mean': np.mean(cv_results['score_time']),
    }
    
    return results
 
 
def print_cv_report(results: Dict[str, Any]) -> None:
    """Print formatted cross-validation report."""
    print("=" * 60)
    print("STRATIFIED K-FOLD CROSS-VALIDATION REPORT")
    print("=" * 60)
    
    for metric in ['accuracy', 'f1', 'recall', 'precision', 'roc_auc']:
        if metric not in results:
            continue
            
        r = results[metric]
        print(f"
{metric.upper()}:")
        print(f"  Test:  {r['test_mean']:.4f} ± {r['test_std']:.4f}")
        print(f"  95% CI: [{r['test_ci_95'][0]:.4f}, {r['test_ci_95'][1]:.4f}]")
        print(f"  Fold scores: {[f'{s:.4f}' for s in r['fold_scores']]}")
        print(f"  Overfit gap: {r['overfit_gap']:.4f}")
    
    print(f"
TIMING:")
    print(f"  Avg fit time: {results['timing']['fit_time_mean']:.2f}s")
    print(f"  Avg score time: {results['timing']['score_time_mean']:.2f}s")
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Create imbalanced dataset
    X, y = make_classification(
        n_samples=2000,
        n_features=20,
        n_informative=10,
        n_redundant=5,
        n_classes=2,
        weights=[0.9, 0.1],  # 90/10 class imbalance
        random_state=42
    )
    
    print(f"Dataset shape: {X.shape}")
    print(f"Class distribution: {np.bincount(y)}")
    print()
    
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42
    )
    
    results = create_evaluation_pipeline(model, X, y)
    print_cv_report(results)

Pattern 2: Repeated Stratified K-Fold for Reduced Variance

Single k-fold CV depends on one random split. Repeated stratified k-fold runs the process multiple times with different random seeds, then averages results for more stable estimates:

repeated_stratified_kfold.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from sklearn.model_selection import RepeatedStratifiedKFold
import numpy as np
 
def repeated_stratified_evaluation(
    model, X, y,
    n_splits: int = 5,
    n_repeats: int = 10,
    random_state: int = 42
):
    """
    Repeated stratified k-fold for robust performance estimation.
    
    Total evaluations: n_splits × n_repeats
    """
    cv = RepeatedStratifiedKFold(
        n_splits=n_splits,
        n_repeats=n_repeats,
        random_state=random_state
    )
    
    # Track scores across all repetitions
    all_scores = []
    repeat_scores = []  # Scores grouped by repetition
    current_repeat = []
    fold_count = 0
    
    for train_idx, test_idx in cv.split(X, y):
        model_clone = clone(model)
        model_clone.fit(X[train_idx], y[train_idx])
        score = model_clone.score(X[test_idx], y[test_idx])
        
        all_scores.append(score)
        current_repeat.append(score)
        fold_count += 1
        
        if fold_count % n_splits == 0:
            repeat_scores.append(np.mean(current_repeat))
            current_repeat = []
    
    return {
        'overall_mean': np.mean(all_scores),
        'overall_std': np.std(all_scores),
        'repeat_means': repeat_scores,  # Mean per repetition
        'between_repeat_std': np.std(repeat_scores),  # Variability across repeats
        'within_repeat_std': np.mean([
            np.std(all_scores[i*n_splits:(i+1)*n_splits]) 
            for i in range(n_repeats)
        ]),  # Average variability within repeats
        'se_of_mean': np.std(repeat_scores) / np.sqrt(n_repeats)  # Standard error
    }

Variance Decomposition

Repeated CV lets you decompose total variance into 'between-repeat' variance (how much results change with different splits) and 'within-repeat' variance (fold-to-fold variation). High between-repeat variance suggests your single CV estimate is unreliable—the model's apparent performance is sensitive to which samples happen to be in which fold.

Common Pitfalls and How to Avoid Them

Even with stratified k-fold, several subtle errors can undermine your evaluation.

Critical Pitfalls to Avoid

•Stratifying After Feature Selection — If you perform feature selection on the full dataset before CV, you've leaked information. Stratify first, then do all preprocessing within each fold.
•Ignoring Group Structure — Stratification preserves class proportions but not group independence. If samples are grouped (e.g., multiple images from the same patient), use GroupKFold instead (covered in next page).
•Forgetting to Shuffle — Without shuffle=True, consecutive samples from the same class end up in the same fold if your data is sorted by class. Always shuffle.
•Leaking Through Indices — When doing nested CV, ensure the inner CV is also stratified. Using regular KFold for inner loops negates the benefits of stratified outer loops.
•Excessive Folds with Rare Classes — If minority class has 15 samples and you use 10-fold CV, each test fold gets only 1-2 minority samples. Metrics become meaningless.

Wrong Pattern

•Feature selection → Stratified CV
•Same random_state for all experiments
•Using k=10 with 30 minority samples
•Stratified outer CV, random inner CV

Correct Pattern

•Stratified CV → Feature selection per fold
•Different seeds for sensitivity analysis
•Using k=5 or fewer with small minorities
•Stratified CV at all nesting levels

Summary: Stratified K-Fold Cross-Validation

We've covered the complete theory and practice of stratified k-fold cross-validation. Let's consolidate the key insights:

Key Takeaways

•Standard k-fold creates biased folds — Random sampling produces high variance in class proportions for imbalanced data, making CV estimates unreliable.
•Stratification eliminates between-stratum variance — By sampling proportionally from each class, fold composition becomes deterministic.
•The algorithm is simple — Partition each class separately, then combine. Handle remainders by distributing one extra sample to the first r folds.
•Multi-label requires iterative stratification — Simple per-label stratification fails; use algorithms that jointly optimize all label distributions.
•Use stratification by default — For classification, there's no reason not to stratify. The overhead is negligible and benefits are substantial for imbalanced data.
•Repeated stratified k-fold provides variance estimates — Multiple repetitions reveal how sensitive your results are to the specific random split.
•Stratification ≠ group handling — Stratification preserves class proportions but not sample independence. Grouped data requires GroupKFold (next page).

What's Next

You now understand how to preserve class distributions across folds. But what if your samples aren't independent? Medical images from the same patient, transactions from the same user, or measurements from the same sensor violate the i.i.d. assumption in ways that stratification alone cannot address. The next page covers Group K-Fold, which handles correlated samples by ensuring no group appears in both training and test sets.