Loading content...
Imagine you're building a fraud detection model with 99% legitimate transactions and 1% fraudulent ones. You split your data into 5 folds for cross-validation, but by pure chance, one fold ends up with 0.3% fraud cases while another has 1.7%. Your model's performance varies wildly across folds—not because of model instability, but because each fold represents a fundamentally different problem.
This isn't a hypothetical scenario. Unstratified cross-validation on imbalanced data is one of the most common sources of misleading model evaluation in production machine learning. The variance in your performance estimates becomes a function of sampling luck rather than true model quality.
Stratified k-fold cross-validation solves this problem by ensuring every fold mirrors the original class distribution. This seemingly simple modification has profound implications for the reliability of your performance estimates and the validity of your model selection decisions.
By the end of this page, you will deeply understand the mathematical foundations of stratified sampling, why it dramatically improves cross-validation reliability for classification problems, how to implement it correctly for binary and multi-class scenarios, and when stratification is essential versus optional.
Standard k-fold cross-validation randomly partitions data into k equal-sized folds. While this works well for balanced datasets, it can produce severely biased folds when class proportions are unequal.
The Mathematics of Random Sampling Variance
Consider a dataset with n samples and a minority class proportion p. When we randomly sample n/k examples for a fold, the number of minority class examples follows a binomial distribution:
$$X \sim \text{Binomial}(n/k, p)$$
The expected count is $\mu = np/k$, and the standard deviation is:
$$\sigma = \sqrt{\frac{np(1-p)}{k}}$$
For small p (imbalanced classes), this variance relative to the expected count becomes significant:
$$\text{CV} = \frac{\sigma}{\mu} = \sqrt{\frac{(1-p)k}{np}}$$
The coefficient of variation (CV) explodes as p decreases, meaning fold-to-fold variation in class proportions becomes extreme for rare classes.
When class proportions vary across folds, you're not just measuring variance in model performance—you're measuring variance in the difficulty of each fold's classification problem. This conflates two sources of uncertainty and inflates your confidence intervals.
Why This Matters for Model Selection
High variance in performance estimates due to fold imbalance causes several problems:
Unstable Rankings: The best model on one fold split may be the worst on another, not due to true model differences but sampling artifacts.
Incorrect Confidence Intervals: Standard formulas for CV confidence intervals assume i.i.d. fold errors. Systematic fold imbalance violates this assumption.
Overfitting to Fold Composition: If you tune hyperparameters using unstratified CV, you may inadvertently select models that happen to perform well on your specific (lucky or unlucky) fold splits.
Misleading Generalization Estimates: Production data will have the true class distribution, but your evaluation saw distorted distributions in each fold.
Stratified sampling is a technique from survey statistics that partitions a population into homogeneous subgroups (strata) and samples proportionally from each. Applied to cross-validation, we treat class labels as strata and ensure each fold maintains the original class proportions.
Formal Definition
Let $\mathcal{D} = {(x_i, y_i)}_{i=1}^{n}$ be a dataset with $C$ classes. Define class subsets:
$$\mathcal{D}_c = {(x_i, y_i) \in \mathcal{D} : y_i = c}$$
with sizes $n_c = |\mathcal{D}_c|$ and proportions $p_c = n_c / n$.
Stratified k-fold partitions each class separately into k groups:
$$\mathcal{D}_c = \mathcal{D}_c^{(1)} \cup \mathcal{D}_c^{(2)} \cup \cdots \cup \mathcal{D}_c^{(k)}$$
where $|\mathcal{D}_c^{(j)}| \approx n_c / k$ for all folds j.
The final fold j is formed by union across classes:
$$\mathcal{F}^{(j)} = \bigcup_{c=1}^{C} \mathcal{D}_c^{(j)}$$
This guarantees that each fold has approximately $n_c / k$ examples of class c, preserving the original proportions.
When n_c is not divisible by k, some folds will have one more example of class c than others. This is unavoidable but minimal. For a class with 17 examples and k=5, three folds get 3 examples and two folds get 4. The maximum proportional deviation is (4-3)/3.4 ≈ 29% for that class, but in absolute terms, it's just one example.
Variance Reduction Properties
The key insight is that stratified sampling eliminates between-stratum variance from the estimator. In standard random sampling, the variance of the mean estimator is:
$$\text{Var}_{\text{SRS}}(\bar{y}) = \frac{S^2}{n}$$
where $S^2$ is the population variance.
With stratified sampling:
$$\text{Var}{\text{strat}}(\bar{y}) = \sum{c=1}^{C} \left(\frac{n_c}{n}\right)^2 \frac{S_c^2}{n_c} = \frac{1}{n} \sum_{c=1}^{C} \frac{n_c}{n} S_c^2$$
where $S_c^2$ is the within-stratum variance for class c.
The variance reduction is:
$$\text{Var}{\text{SRS}} - \text{Var}{\text{strat}} = \frac{1}{n} \sum_{c=1}^{C} \frac{n_c}{n} (\mu_c - \mu)^2$$
This equals zero only when all class means are identical. For classification with distinct class-conditional distributions, stratification always reduces variance.
| Component | Simple Random Sampling | Stratified Sampling |
|---|---|---|
| Within-stratum variance | Included | Included |
| Between-stratum variance | Included | Eliminated |
| Allocation overhead | None | Minor (proportional assignment) |
| Fold size guarantee | Approximate | Exact per stratum |
The stratified k-fold algorithm combines proportional allocation with randomization within strata. Here's the complete procedure:
Algorithm: Stratified K-Fold Partitioning
Input: Dataset D of n samples, labels y, number of folds k
Output: k fold assignments {F_1, F_2, ..., F_k}
1. Group indices by class:
For each class c in unique(y):
I_c ← indices where y = c
Shuffle I_c randomly
2. Initialize k empty folds:
For j = 1 to k:
F_j ← []
3. Allocate each class proportionally:
For each class c:
n_c ← length(I_c)
base_per_fold ← floor(n_c / k)
remainder ← n_c mod k
idx ← 0
For j = 1 to k:
count ← base_per_fold + (1 if j ≤ remainder else 0)
F_j ← F_j ∪ I_c[idx : idx + count]
idx ← idx + count
4. Shuffle within each fold (optional, for iteration randomization)
Return {F_1, F_2, ..., F_k}
The shuffle in Step 1 ensures different runs produce different stratified splits. The allocation in Step 3 handles remainders by giving one extra sample to the first remainder folds.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
import numpy as npfrom collections import defaultdictfrom typing import List, Tuple, Iterator def stratified_kfold_split( X: np.ndarray, y: np.ndarray, k: int = 5, shuffle: bool = True, random_state: int = None) -> List[Tuple[np.ndarray, np.ndarray]]: """ Perform stratified k-fold splitting from scratch. Parameters ---------- X : np.ndarray Feature matrix of shape (n_samples, n_features) y : np.ndarray Label array of shape (n_samples,) k : int Number of folds shuffle : bool Whether to shuffle within each class before splitting random_state : int Random seed for reproducibility Returns ------- List of (train_indices, test_indices) tuples for each fold """ if random_state is not None: np.random.seed(random_state) n_samples = len(y) # Step 1: Group indices by class label class_indices = defaultdict(list) for idx, label in enumerate(y): class_indices[label].append(idx) # Convert to arrays and optionally shuffle for label in class_indices: class_indices[label] = np.array(class_indices[label]) if shuffle: np.random.shuffle(class_indices[label]) # Step 2: Initialize fold assignments # fold_assignments[i] = fold number for sample i fold_assignments = np.zeros(n_samples, dtype=int) # Step 3: Allocate each class proportionally to folds for label, indices in class_indices.items(): n_class = len(indices) base_per_fold = n_class // k remainder = n_class % k current_idx = 0 for fold in range(k): # First 'remainder' folds get one extra sample count = base_per_fold + (1 if fold < remainder else 0) fold_indices = indices[current_idx : current_idx + count] fold_assignments[fold_indices] = fold current_idx += count # Step 4: Generate train/test splits for each fold splits = [] all_indices = np.arange(n_samples) for fold in range(k): test_mask = fold_assignments == fold train_indices = all_indices[~test_mask] test_indices = all_indices[test_mask] if shuffle: np.random.shuffle(train_indices) splits.append((train_indices, test_indices)) return splits def verify_stratification(y: np.ndarray, splits: List[Tuple]) -> None: """Verify that splits maintain class proportions.""" original_dist = {} for label in np.unique(y): original_dist[label] = np.mean(y == label) print(f"Original distribution: {original_dist}") print() for fold_idx, (train_idx, test_idx) in enumerate(splits): y_train, y_test = y[train_idx], y[test_idx] train_dist = {lbl: np.mean(y_train == lbl) for lbl in np.unique(y)} test_dist = {lbl: np.mean(y_test == lbl) for lbl in np.unique(y)} print(f"Fold {fold_idx + 1}:") print(f" Train size: {len(train_idx)}, Test size: {len(test_idx)}") print(f" Train dist: {train_dist}") print(f" Test dist: {test_dist}") # Check deviation from original max_deviation = max( abs(test_dist[lbl] - original_dist[lbl]) for lbl in original_dist ) print(f" Max deviation from original: {max_deviation:.4f}") print() # Demonstrationif __name__ == "__main__": # Create imbalanced dataset np.random.seed(42) n_majority = 900 n_minority = 100 X = np.random.randn(n_majority + n_minority, 10) y = np.array([0] * n_majority + [1] * n_minority) # Stratified split print("=" * 50) print("STRATIFIED K-FOLD SPLIT") print("=" * 50) splits = stratified_kfold_split(X, y, k=5, random_state=42) verify_stratification(y, splits) # Compare with random split (non-stratified) print("=" * 50) print("RANDOM (NON-STRATIFIED) SPLIT FOR COMPARISON") print("=" * 50) indices = np.random.permutation(len(y)) fold_size = len(y) // 5 random_splits = [ (np.concatenate([indices[:i*fold_size], indices[(i+1)*fold_size:]]), indices[i*fold_size:(i+1)*fold_size]) for i in range(5) ] verify_stratification(y, random_splits)The remainder-handling logic (giving extra samples to the first few folds) is deterministic. When combined with shuffling, this ensures both reproducibility with the same seed and randomness across different seeds. Scikit-learn's implementation uses the same approach.
For multi-class problems with C classes, stratified k-fold extends naturally—we simply stratify on all C classes simultaneously. However, several nuances arise:
Challenge 1: Rare Classes with Fewer Samples Than Folds
If a class has fewer than k samples, we cannot place at least one sample in each fold. Options include:
Challenge 2: Class Label Ties
When multiple classes have exactly the same number of samples and that number isn't divisible by k, the assignment of "extra" samples to folds can create subtle imbalances. Most implementations randomize this assignment.
Multi-Label Stratification
Multi-label problems, where each sample can belong to multiple classes simultaneously, present a harder stratification challenge. Simple per-label stratification fails because label correlations matter.
Iterative Stratification Algorithm (Sechidis et al., 2011):
This algorithm greedily assigns samples to folds while trying to balance all label proportions simultaneously:
The algorithm prioritizes rare labels, ensuring they're well-distributed before common labels (which have more slack).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
import numpy as npfrom collections import defaultdictfrom typing import List, Tuple def iterative_stratification( y: np.ndarray, k: int = 5, random_state: int = None) -> List[np.ndarray]: """ Iterative stratification for multi-label data. Based on: Sechidis et al. "On the Stratification of Multi-Label Data" (2011) Parameters ---------- y : np.ndarray Binary label matrix of shape (n_samples, n_labels) k : int Number of folds random_state : int Random seed Returns ------- List of index arrays, one per fold """ if random_state is not None: np.random.seed(random_state) n_samples, n_labels = y.shape # Desired samples per fold samples_per_fold = np.full(k, n_samples // k) samples_per_fold[:n_samples % k] += 1 # Desired label counts per fold (proportional allocation) label_counts = y.sum(axis=0) # Total per label desired_per_fold = np.zeros((k, n_labels)) for label in range(n_labels): # Distribute label proportionally base = label_counts[label] // k remainder = label_counts[label] % k desired_per_fold[:, label] = base desired_per_fold[:remainder, label] += 1 # Track current allocations current_counts = np.zeros((k, n_labels)) fold_sizes = np.zeros(k) # Initialize folds folds = [[] for _ in range(k)] unassigned = set(range(n_samples)) # Process labels from rarest to most common while unassigned: # Find label with minimum total remaining samples remaining_per_label = {} for label in range(n_labels): count = sum(y[i, label] for i in unassigned) if count > 0: remaining_per_label[label] = count if not remaining_per_label: # Assign remaining samples (no labels) arbitrarily for idx in list(unassigned): fold = int(np.argmin(fold_sizes)) folds[fold].append(idx) fold_sizes[fold] += 1 unassigned.remove(idx) break # Select rarest label rarest_label = min(remaining_per_label, key=remaining_per_label.get) # Get samples with this label that are unassigned candidates = [i for i in unassigned if y[i, rarest_label] == 1] np.random.shuffle(candidates) for idx in candidates: # Find fold with greatest "need" for this label # Need = desired - current, but also respect fold size limits needs = desired_per_fold[:, rarest_label] - current_counts[:, rarest_label] # Among folds with positive need, prefer those with more space space = samples_per_fold - fold_sizes # Score combines label need and space scores = needs + 0.01 * space # Small tiebreaker for space # Among valid folds (with space), pick highest score valid_folds = np.where(space > 0)[0] if len(valid_folds) == 0: valid_folds = np.arange(k) # Fallback best_fold = valid_folds[np.argmax(scores[valid_folds])] # Assign sample to fold folds[best_fold].append(idx) fold_sizes[best_fold] += 1 current_counts[best_fold] += y[idx] unassigned.remove(idx) return [np.array(fold) for fold in folds] # Demonstrationif __name__ == "__main__": # Create multi-label dataset np.random.seed(42) n_samples = 1000 n_labels = 5 # Create imbalanced multi-label data # Label frequencies: 50%, 30%, 10%, 5%, 2% frequencies = [0.50, 0.30, 0.10, 0.05, 0.02] y = np.zeros((n_samples, n_labels)) for label, freq in enumerate(frequencies): y[:, label] = np.random.random(n_samples) < freq print("Original label distribution:") print(f" Total samples: {n_samples}") for label in range(n_labels): count = y[:, label].sum() print(f" Label {label}: {count} ({count/n_samples:.1%})") print() # Apply iterative stratification folds = iterative_stratification(y, k=5, random_state=42) print("Fold-wise label distribution:") for fold_idx, fold_indices in enumerate(folds): y_fold = y[fold_indices] print(f" Fold {fold_idx + 1} (n={len(fold_indices)}):") for label in range(n_labels): count = y_fold[:, label].sum() pct = count / len(fold_indices) orig_pct = frequencies[label] diff = abs(pct - orig_pct) print(f" Label {label}: {int(count):3d} ({pct:.1%}) " + f"[deviation: {diff:.1%}]")Stratification isn't always necessary. Understanding when it provides significant benefit helps you make informed decisions.
Essential: High-Impact Scenarios
Optional: Lower-Impact Scenarios
In practice, always use stratified k-fold for classification problems unless you have a specific reason not to. The computational overhead is negligible, and the benefits for imbalanced data are substantial. Scikit-learn's StratifiedKFold should be your default cross-validation strategy.
Stratification for Regression: Target Binning
While regression lacks discrete classes, you can create pseudo-classes by binning the target variable:
from sklearn.model_selection import StratifiedKFold
import pandas as pd
# Bin continuous target into quintiles
y_binned = pd.qcut(y, q=5, labels=False)
# Use binned labels for stratification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, test_idx in skf.split(X, y_binned):
# Use original y for training/evaluation
X_train, y_train = X[train_idx], y[train_idx] # Original y!
X_test, y_test = X[test_idx], y[test_idx]
This ensures each fold has a similar distribution of target values, which is especially useful for skewed regression targets.
Let's examine production-quality patterns for using stratified k-fold in real ML pipelines.
Pattern 1: Standard Classification Pipeline
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import numpy as npfrom sklearn.model_selection import ( StratifiedKFold, cross_val_score, cross_validate)from sklearn.ensemble import RandomForestClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import ( make_scorer, f1_score, recall_score, precision_score, roc_auc_score)from typing import Dict, Any def create_evaluation_pipeline( model, X: np.ndarray, y: np.ndarray, n_splits: int = 5, random_state: int = 42) -> Dict[str, Any]: """ Comprehensive stratified CV evaluation pipeline. Returns multiple metrics with confidence intervals. """ # Define stratified cross-validator cv = StratifiedKFold( n_splits=n_splits, shuffle=True, random_state=random_state ) # Define multiple scoring metrics scoring = { 'accuracy': 'accuracy', 'f1': make_scorer(f1_score, average='binary'), 'recall': make_scorer(recall_score, average='binary'), 'precision': make_scorer(precision_score, average='binary'), 'roc_auc': 'roc_auc', } # Create pipeline with preprocessing pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', model) ]) # Perform cross-validation cv_results = cross_validate( pipeline, X, y, cv=cv, scoring=scoring, return_train_score=True, n_jobs=-1 # Parallelize across folds ) # Compute statistics results = {} for metric in scoring.keys(): test_scores = cv_results[f'test_{metric}'] train_scores = cv_results[f'train_{metric}'] results[metric] = { 'test_mean': np.mean(test_scores), 'test_std': np.std(test_scores), 'test_ci_95': ( np.mean(test_scores) - 1.96 * np.std(test_scores) / np.sqrt(n_splits), np.mean(test_scores) + 1.96 * np.std(test_scores) / np.sqrt(n_splits) ), 'train_mean': np.mean(train_scores), 'train_std': np.std(train_scores), 'fold_scores': test_scores.tolist(), 'overfit_gap': np.mean(train_scores) - np.mean(test_scores) } # Add timing information results['timing'] = { 'fit_time_mean': np.mean(cv_results['fit_time']), 'score_time_mean': np.mean(cv_results['score_time']), } return results def print_cv_report(results: Dict[str, Any]) -> None: """Print formatted cross-validation report.""" print("=" * 60) print("STRATIFIED K-FOLD CROSS-VALIDATION REPORT") print("=" * 60) for metric in ['accuracy', 'f1', 'recall', 'precision', 'roc_auc']: if metric not in results: continue r = results[metric] print(f"{metric.upper()}:") print(f" Test: {r['test_mean']:.4f} ± {r['test_std']:.4f}") print(f" 95% CI: [{r['test_ci_95'][0]:.4f}, {r['test_ci_95'][1]:.4f}]") print(f" Fold scores: {[f'{s:.4f}' for s in r['fold_scores']]}") print(f" Overfit gap: {r['overfit_gap']:.4f}") print(f"TIMING:") print(f" Avg fit time: {results['timing']['fit_time_mean']:.2f}s") print(f" Avg score time: {results['timing']['score_time_mean']:.2f}s") # Example usageif __name__ == "__main__": from sklearn.datasets import make_classification # Create imbalanced dataset X, y = make_classification( n_samples=2000, n_features=20, n_informative=10, n_redundant=5, n_classes=2, weights=[0.9, 0.1], # 90/10 class imbalance random_state=42 ) print(f"Dataset shape: {X.shape}") print(f"Class distribution: {np.bincount(y)}") print() model = RandomForestClassifier( n_estimators=100, max_depth=10, random_state=42 ) results = create_evaluation_pipeline(model, X, y) print_cv_report(results)Pattern 2: Repeated Stratified K-Fold for Reduced Variance
Single k-fold CV depends on one random split. Repeated stratified k-fold runs the process multiple times with different random seeds, then averages results for more stable estimates:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
from sklearn.model_selection import RepeatedStratifiedKFoldimport numpy as np def repeated_stratified_evaluation( model, X, y, n_splits: int = 5, n_repeats: int = 10, random_state: int = 42): """ Repeated stratified k-fold for robust performance estimation. Total evaluations: n_splits × n_repeats """ cv = RepeatedStratifiedKFold( n_splits=n_splits, n_repeats=n_repeats, random_state=random_state ) # Track scores across all repetitions all_scores = [] repeat_scores = [] # Scores grouped by repetition current_repeat = [] fold_count = 0 for train_idx, test_idx in cv.split(X, y): model_clone = clone(model) model_clone.fit(X[train_idx], y[train_idx]) score = model_clone.score(X[test_idx], y[test_idx]) all_scores.append(score) current_repeat.append(score) fold_count += 1 if fold_count % n_splits == 0: repeat_scores.append(np.mean(current_repeat)) current_repeat = [] return { 'overall_mean': np.mean(all_scores), 'overall_std': np.std(all_scores), 'repeat_means': repeat_scores, # Mean per repetition 'between_repeat_std': np.std(repeat_scores), # Variability across repeats 'within_repeat_std': np.mean([ np.std(all_scores[i*n_splits:(i+1)*n_splits]) for i in range(n_repeats) ]), # Average variability within repeats 'se_of_mean': np.std(repeat_scores) / np.sqrt(n_repeats) # Standard error }Repeated CV lets you decompose total variance into 'between-repeat' variance (how much results change with different splits) and 'within-repeat' variance (fold-to-fold variation). High between-repeat variance suggests your single CV estimate is unreliable—the model's apparent performance is sensitive to which samples happen to be in which fold.
Even with stratified k-fold, several subtle errors can undermine your evaluation.
shuffle=True, consecutive samples from the same class end up in the same fold if your data is sorted by class. Always shuffle.We've covered the complete theory and practice of stratified k-fold cross-validation. Let's consolidate the key insights:
You now understand how to preserve class distributions across folds. But what if your samples aren't independent? Medical images from the same patient, transactions from the same user, or measurements from the same sensor violate the i.i.d. assumption in ways that stratification alone cannot address. The next page covers Group K-Fold, which handles correlated samples by ensuring no group appears in both training and test sets.