Loading content...
You've built a medical image classifier that achieves 95% accuracy in cross-validation. Confident in your results, you deploy it—only to find it performs at 78% on new patients. What went wrong?
The problem: your training and test sets contained images from the same patients. Your model didn't learn to recognize disease; it learned to recognize specific patients. Cross-validation gave you an overly optimistic estimate because the assumption of independent, identically distributed (i.i.d.) samples was violated.
This isn't a rare edge case. Grouped data is everywhere:
Standard cross-validation, even stratified, fails catastrophically on grouped data because it randomly assigns correlated samples to different folds. The correlation leaks information between training and test sets, creating artificially optimistic performance estimates.
Group K-Fold solves this by treating groups as atomic units—all samples from a group appear together in either training or test, never both.
Group leakage is one of the most insidious bugs in machine learning. Your code runs perfectly, metrics look great, and nothing signals a problem until deployment. Yet your model has learned spurious correlations within groups rather than generalizable patterns across groups.
Before diving into algorithms, let's formally define what group structure means and why it matters.
Definition: Grouped Data
A dataset $\mathcal{D} = {(x_i, y_i, g_i)}_{i=1}^{n}$ is grouped when:
The third point is crucial. If the group identity perfectly predicts within-group similarity (e.g., patient ID explains all patient-specific variation), then group leakage allows models to exploit this rather than learning generalizable features.
The Generalization Target
Standard ML assumes we want to generalize to new samples from the same distribution. With grouped data, we typically want to generalize to new groups, not just new samples from existing groups.
This fundamentally changes what "generalization" means:
| Domain | Sample Unit | Group Unit | Why Groups Matter |
|---|---|---|---|
| Medical imaging | Individual scan | Patient | Anatomy, imaging device, disease progression are consistent within patient |
| Fraud detection | Transaction | Account | Spending patterns, legitimate behavior, fraud patterns are account-specific |
| Sentiment analysis | Review | Author | Writing style, vocabulary, sentiment expression vary by author |
| Time series | Time point | Sensor/device | Drift, calibration, noise characteristics are device-specific |
| Recommendation | Interaction | User | User preferences, browsing patterns are user-specific |
| Manufacturing | Part measurement | Batch/lot | Machine settings, material lots create batch correlations |
Quantifying Group Correlation
The intraclass correlation coefficient (ICC) measures how much variation is explained by group membership:
$$\text{ICC} = \frac{\sigma_{\text{between}}^2}{\sigma_{\text{between}}^2 + \sigma_{\text{within}}^2}$$
In practice, you can estimate ICC using ANOVA or mixed-effects models before deciding on your CV strategy.
If you're uncertain whether group effects matter, compute ICC for your target variable. If ICC > 0.1, use group-aware cross-validation. The computational cost is identical, but the validity difference can be dramatic.
Group K-Fold ensures that all samples from the same group are in the same fold, preventing information leakage across the train/test boundary.
Algorithm: Group K-Fold Partitioning
Input: Dataset D of n samples, group labels g, number of folds k
Output: k fold assignments where each group is entirely in one fold
1. Identify unique groups:
unique_groups ← sorted(unique(g))
G ← count(unique_groups)
2. Assign groups to folds (approximately equal group counts):
groups_per_fold ← assign_groups_to_folds(unique_groups, k)
# Various strategies: round-robin, size-balanced, etc.
3. Map samples to folds based on group assignment:
fold_assignments ← empty array of length n
For each sample i:
group = g[i]
fold_assignments[i] = fold containing group
4. Generate train/test splits:
For j = 1 to k:
test_indices ← samples where fold_assignments = j
train_indices ← samples where fold_assignments ≠ j
yield (train_indices, test_indices)
The key insight is that we partition groups, not samples. Sample-level fold sizes will vary based on group sizes, but group integrity is preserved.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
import numpy as npfrom collections import defaultdictfrom typing import List, Tuple, Dict def group_kfold_split( X: np.ndarray, y: np.ndarray, groups: np.ndarray, k: int = 5, balance_strategy: str = "size" # "size" or "count") -> List[Tuple[np.ndarray, np.ndarray]]: """ Group K-Fold cross-validation splitting. Parameters ---------- X : np.ndarray Feature matrix y : np.ndarray Labels groups : np.ndarray Group identifiers for each sample k : int Number of folds balance_strategy : str "count" - Equal number of groups per fold "size" - Balanced total samples per fold (accounts for group sizes) Returns ------- List of (train_indices, test_indices) tuples """ n_samples = len(groups) unique_groups = np.unique(groups) n_groups = len(unique_groups) if n_groups < k: raise ValueError( f"Cannot have more folds ({k}) than groups ({n_groups})" ) # Map groups to their sample indices group_to_indices: Dict[any, List[int]] = defaultdict(list) for idx, group in enumerate(groups): group_to_indices[group].append(idx) # Compute group sizes group_sizes = {g: len(indices) for g, indices in group_to_indices.items()} # Assign groups to folds if balance_strategy == "count": # Round-robin assignment: equal group counts per fold group_to_fold = {} for i, group in enumerate(unique_groups): group_to_fold[group] = i % k elif balance_strategy == "size": # Greedy bin-packing: balance total samples per fold # Sort groups by size (descending) for better packing sorted_groups = sorted( unique_groups, key=lambda g: group_sizes[g], reverse=True ) fold_sizes = np.zeros(k) group_to_fold = {} for group in sorted_groups: # Assign to fold with smallest current size target_fold = int(np.argmin(fold_sizes)) group_to_fold[group] = target_fold fold_sizes[target_fold] += group_sizes[group] else: raise ValueError(f"Unknown balance_strategy: {balance_strategy}") # Create sample-level fold assignments sample_to_fold = np.zeros(n_samples, dtype=int) for group, fold in group_to_fold.items(): for idx in group_to_indices[group]: sample_to_fold[idx] = fold # Generate splits splits = [] all_indices = np.arange(n_samples) for fold in range(k): test_mask = sample_to_fold == fold train_indices = all_indices[~test_mask] test_indices = all_indices[test_mask] splits.append((train_indices, test_indices)) return splits def verify_group_separation( groups: np.ndarray, splits: List[Tuple[np.ndarray, np.ndarray]]) -> bool: """Verify that no group appears in both train and test.""" all_valid = True for fold_idx, (train_idx, test_idx) in enumerate(splits): train_groups = set(groups[train_idx]) test_groups = set(groups[test_idx]) overlap = train_groups.intersection(test_groups) if overlap: print(f"LEAK: Fold {fold_idx + 1} has overlapping groups: {overlap}") all_valid = False else: print(f"Fold {fold_idx + 1}: ✓ No overlap") print(f" Train: {len(train_idx)} samples, {len(train_groups)} groups") print(f" Test: {len(test_idx)} samples, {len(test_groups)} groups") return all_valid # Demonstrationif __name__ == "__main__": np.random.seed(42) # Simulate medical imaging: multiple scans per patient n_patients = 50 scans_per_patient = np.random.randint(3, 15, size=n_patients) # Create dataset groups = [] y = [] for patient_id, n_scans in enumerate(scans_per_patient): groups.extend([patient_id] * n_scans) # Each patient has a disease label (random for demo) patient_label = np.random.randint(0, 2) y.extend([patient_label] * n_scans) groups = np.array(groups) y = np.array(y) X = np.random.randn(len(y), 10) # Dummy features print(f"Dataset: {len(y)} samples from {n_patients} patients") print(f"Scans per patient: min={scans_per_patient.min()}, " + f"max={scans_per_patient.max()}, mean={scans_per_patient.mean():.1f}") print() # Compare balancing strategies print("=" * 50) print("COUNT-BALANCED GROUP K-FOLD") print("=" * 50) splits_count = group_kfold_split(X, y, groups, k=5, balance_strategy="count") verify_group_separation(groups, splits_count) print() print("=" * 50) print("SIZE-BALANCED GROUP K-FOLD") print("=" * 50) splits_size = group_kfold_split(X, y, groups, k=5, balance_strategy="size") verify_group_separation(groups, splits_size)Unlike standard k-fold, Group K-Fold often produces unequal fold sizes because groups have different numbers of samples. The 'size' balancing strategy mitigates this but cannot eliminate it entirely. Accept this variance as the price of valid evaluation—biased test sets are worse than unequal test sizes.
How groups are assigned to folds significantly impacts fold balance and evaluation quality. Let's examine the main strategies:
Strategy 1: Count-Based Assignment (Round-Robin)
Assign groups to folds in rotation, giving each fold approximately equal numbers of groups:
Groups: [A(100), B(10), C(50), D(20), E(80)]
5 Folds: F1=[A,B] F2=[C] F3=[D] F4=[E] F5=[] ← Bad!
Corrected: F1=[A] F2=[B] F3=[C] F4=[D] F5=[E] ← One group per fold
Strategy 2: Size-Balanced Assignment (Greedy Bin-Packing)
Assign groups to minimize variance in fold sizes (total samples per fold):
Strategy 3: Random Assignment
Shuffle groups randomly, then assign round-robin:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as npfrom typing import List, Dict def compare_assignment_strategies( group_sizes: Dict[str, int], k: int = 5) -> None: """Compare different group assignment strategies.""" groups = list(group_sizes.keys()) sizes = np.array([group_sizes[g] for g in groups]) total_samples = sizes.sum() ideal_fold_size = total_samples / k print(f"Total: {total_samples} samples, {len(groups)} groups") print(f"Ideal fold size: {ideal_fold_size:.1f}") print() def evaluate_assignment(name: str, assignment: Dict[str, int]) -> None: fold_sizes = {i: 0 for i in range(k)} for group, fold in assignment.items(): fold_sizes[fold] += group_sizes[group] sizes_list = list(fold_sizes.values()) cv = np.std(sizes_list) / np.mean(sizes_list) # Coefficient of variation print(f"{name}:") print(f" Fold sizes: {sizes_list}") print(f" Range: {min(sizes_list)} - {max(sizes_list)}") print(f" CV: {cv:.3f} (lower is better)") print() # Strategy 1: Count-balanced (round-robin) count_assignment = {g: i % k for i, g in enumerate(groups)} evaluate_assignment("Count-balanced (round-robin)", count_assignment) # Strategy 2: Size-balanced (greedy) sorted_groups = sorted(groups, key=lambda g: group_sizes[g], reverse=True) fold_totals = np.zeros(k) size_assignment = {} for group in sorted_groups: target = int(np.argmin(fold_totals)) size_assignment[group] = target fold_totals[target] += group_sizes[group] evaluate_assignment("Size-balanced (greedy)", size_assignment) # Strategy 3: Random np.random.seed(42) shuffled = np.random.permutation(groups) random_assignment = {g: i % k for i, g in enumerate(shuffled)} evaluate_assignment("Random assignment", random_assignment) # Example with realistic group size distributionif __name__ == "__main__": # Simulate patients with varying scan counts np.random.seed(42) n_patients = 25 # Power-law-ish distribution: some patients have many scans group_sizes = {} for i in range(n_patients): size = int(np.random.pareto(1.5) * 10 + 5) # Range ~5-100 group_sizes[f"P{i:02d}"] = size print("Group sizes (sorted):") sorted_sizes = sorted(group_sizes.items(), key=lambda x: -x[1]) for g, s in sorted_sizes[:5]: print(f" {g}: {s}") print(f" ... ({len(group_sizes) - 5} more groups)") print() compare_assignment_strategies(group_sizes, k=5)| Strategy | Fold Size Variance | Complexity | Best For |
|---|---|---|---|
| Round-robin | High | O(G) | Equal group sizes, quick experiments |
| Size-balanced | Low | O(G log G) | Unequal group sizes, production evaluation |
| Random | Medium-High | O(G) | Avoiding systematic bias, repeated CV |
Understanding the statistical properties of Group K-Fold helps you interpret results and choose appropriate aggregation methods.
The Effective Sample Size Problem
With grouped data, the effective sample size for variance estimation is closer to the number of groups G than the number of samples n. This is because samples within groups are correlated and don't provide fully independent information.
The effective sample size under equicorrelated data is:
$$n_{\text{eff}} = \frac{n}{1 + (m - 1) \cdot \text{ICC}}$$
where $m$ is the average group size and ICC is the intraclass correlation coefficient.
Example:
$$n_{\text{eff}} = \frac{1000}{1 + (20 - 1) \cdot 0.6} = \frac{1000}{12.4} \approx 81$$
Your effective sample size is 81, not 1000! Standard error formulas using n will dramatically underestimate uncertainty.
Variance of Cross-Validation Estimator
The variance of the CV performance estimate under group structure has two components:
$$\text{Var}(\hat{\theta}{CV}) = \underbrace{\text{Var}{\text{between-group}}}{\text{captured by group CV}} + \underbrace{\text{Var}{\text{within-group}}}_{\text{ignored by group CV}}$$
Standard CV conflates these. Group CV isolates between-group variance, which is the relevant uncertainty for new-group generalization. However, it ignores within-group variance, potentially underestimating uncertainty if within-group samples are very different from each other.
Standard formulas like SE = std(fold_scores) / sqrt(k) assume independent folds. With groups, fold independence is satisfied, but the effective sample size per fold varies wildly. For reliable confidence intervals, use bootstrap methods that resample at the group level.
Impact on Model Selection
When comparing models A and B, group structure affects the correlation between their performance estimates. If both models are evaluated on the same group-based folds, their performance differences are less variable than if evaluated on independent samples.
This has two implications:
Paired tests are appropriate: Use paired t-tests or Wilcoxon signed-rank tests comparing model performances fold-by-fold, not unpaired tests.
Correlation can inflate or deflate detected differences: If both models struggle on the same groups (high correlation), the variance of the difference is small, making small differences significant. If they fail on different groups (low correlation), the variance is large, requiring larger differences for significance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npfrom scipy import statsfrom typing import List, Tuple def group_bootstrap_ci( group_scores: List[float], n_bootstrap: int = 10000, confidence: float = 0.95, random_state: int = 42) -> Tuple[float, float, float]: """ Bootstrap confidence interval resampling at group level. Parameters ---------- group_scores : List[float] Performance metric for each group (or fold, if one group per fold) n_bootstrap : int Number of bootstrap samples confidence : float Confidence level random_state : int Random seed Returns ------- (mean, ci_lower, ci_upper) """ np.random.seed(random_state) n_groups = len(group_scores) scores_array = np.array(group_scores) # Bootstrap: resample groups with replacement bootstrap_means = [] for _ in range(n_bootstrap): resampled_indices = np.random.randint(0, n_groups, size=n_groups) bootstrap_means.append(np.mean(scores_array[resampled_indices])) bootstrap_means = np.array(bootstrap_means) alpha = 1 - confidence ci_lower = np.percentile(bootstrap_means, 100 * alpha / 2) ci_upper = np.percentile(bootstrap_means, 100 * (1 - alpha / 2)) return np.mean(scores_array), ci_lower, ci_upper def paired_model_comparison( model_a_scores: List[float], model_b_scores: List[float], alpha: float = 0.05) -> dict: """ Statistical comparison of two models evaluated on same group splits. Returns dictionary with test results. """ differences = np.array(model_a_scores) - np.array(model_b_scores) n = len(differences) # Paired t-test t_stat, t_pval = stats.ttest_rel(model_a_scores, model_b_scores) # Wilcoxon signed-rank test (nonparametric alternative) try: w_stat, w_pval = stats.wilcoxon(differences) except ValueError: # All differences are zero w_stat, w_pval = 0, 1.0 # Effect size (Cohen's d for paired samples) cohens_d = np.mean(differences) / np.std(differences, ddof=1) return { "mean_a": np.mean(model_a_scores), "mean_b": np.mean(model_b_scores), "mean_difference": np.mean(differences), "std_difference": np.std(differences, ddof=1), "t_statistic": t_stat, "t_pvalue": t_pval, "wilcoxon_pvalue": w_pval, "cohens_d": cohens_d, "significant_t": t_pval < alpha, "significant_wilcoxon": w_pval < alpha } # Demonstrationif __name__ == "__main__": np.random.seed(42) # Simulate 5-fold group CV results for two models # Model A: 85% mean accuracy, Model B: 83% mean accuracy true_diff = 0.02 # 2% true advantage for A model_a_scores = np.random.normal(0.85, 0.03, size=5) model_b_scores = model_a_scores - true_diff + np.random.normal(0, 0.01, size=5) print("Fold-wise scores:") for i, (a, b) in enumerate(zip(model_a_scores, model_b_scores)): print(f" Fold {i+1}: A={a:.3f}, B={b:.3f}, diff={a-b:+.3f}") print() # Bootstrap CI for Model A mean_a, ci_low, ci_high = group_bootstrap_ci(model_a_scores.tolist()) print(f"Model A: {mean_a:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}])") mean_b, ci_low, ci_high = group_bootstrap_ci(model_b_scores.tolist()) print(f"Model B: {mean_b:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}])") print() # Model comparison comparison = paired_model_comparison( model_a_scores.tolist(), model_b_scores.tolist() ) print("Paired comparison:") print(f" Mean difference: {comparison['mean_difference']:.3f}") print(f" Cohen's d: {comparison['cohens_d']:.2f}") print(f" t-test p-value: {comparison['t_pvalue']:.4f}") print(f" Significant at α=0.05: {comparison['significant_t']}")Scikit-learn provides GroupKFold for production use. Let's examine best practices for integrating it into real ML pipelines.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199
import numpy as npimport pandas as pdfrom sklearn.model_selection import ( GroupKFold, cross_val_score, cross_validate)from sklearn.ensemble import RandomForestClassifierfrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import make_scorer, f1_score, roc_auc_scorefrom typing import Dict, Any, Optional class GroupCVEvaluator: """ Production-ready group cross-validation evaluator. Handles: - Multiple metrics - Preprocessing within folds - Group-level statistics - Proper confidence intervals """ def __init__( self, n_splits: int = 5, metrics: Optional[Dict[str, str]] = None ): self.n_splits = n_splits self.metrics = metrics or { 'accuracy': 'accuracy', 'f1': make_scorer(f1_score, average='binary'), 'roc_auc': 'roc_auc' } self.group_cv = GroupKFold(n_splits=n_splits) def evaluate( self, model, X: np.ndarray, y: np.ndarray, groups: np.ndarray, preprocess: bool = True ) -> Dict[str, Any]: """ Perform group-aware cross-validation. Returns detailed evaluation results. """ # Create pipeline with optional preprocessing if preprocess: pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', model) ]) else: pipeline = model # Group-level statistics unique_groups = np.unique(groups) n_groups = len(unique_groups) group_sizes = pd.Series(groups).value_counts() # Perform cross-validation cv_results = cross_validate( pipeline, X, y, cv=self.group_cv, groups=groups, scoring=self.metrics, return_train_score=True, return_estimator=False, n_jobs=-1 ) # Analyze fold composition fold_info = [] for fold_idx, (train_idx, test_idx) in enumerate( self.group_cv.split(X, y, groups) ): train_groups = np.unique(groups[train_idx]) test_groups = np.unique(groups[test_idx]) fold_info.append({ 'fold': fold_idx + 1, 'n_train_samples': len(train_idx), 'n_test_samples': len(test_idx), 'n_train_groups': len(train_groups), 'n_test_groups': len(test_groups), 'test_class_dist': dict(zip(*np.unique(y[test_idx], return_counts=True))) }) # Compile results results = { 'n_samples': len(y), 'n_groups': n_groups, 'group_size_stats': { 'min': int(group_sizes.min()), 'max': int(group_sizes.max()), 'mean': float(group_sizes.mean()), 'std': float(group_sizes.std()) }, 'fold_info': fold_info, 'metrics': {} } # Process each metric for metric in self.metrics.keys(): test_scores = cv_results[f'test_{metric}'] train_scores = cv_results[f'train_{metric}'] # Bootstrap CI (group-aware) n_bootstrap = 10000 np.random.seed(42) bootstrap_means = [ np.mean(test_scores[np.random.randint(0, len(test_scores), len(test_scores))]) for _ in range(n_bootstrap) ] results['metrics'][metric] = { 'test_mean': float(np.mean(test_scores)), 'test_std': float(np.std(test_scores)), 'test_fold_scores': test_scores.tolist(), 'ci_95': ( float(np.percentile(bootstrap_means, 2.5)), float(np.percentile(bootstrap_means, 97.5)) ), 'train_mean': float(np.mean(train_scores)), 'overfit_gap': float(np.mean(train_scores) - np.mean(test_scores)) } return results def print_report(self, results: Dict[str, Any]) -> None: """Print formatted evaluation report.""" print("=" * 70) print("GROUP K-FOLD CROSS-VALIDATION REPORT") print("=" * 70) print(f"Dataset: {results['n_samples']} samples, {results['n_groups']} groups") gs = results['group_size_stats'] print(f"Group sizes: {gs['min']}-{gs['max']} (mean: {gs['mean']:.1f} ± {gs['std']:.1f})") print(f"Fold Details:") for fi in results['fold_info']: print(f" Fold {fi['fold']}: {fi['n_test_samples']} test samples " f"({fi['n_test_groups']} groups), class dist: {fi['test_class_dist']}") print(f"Metrics:") for metric, data in results['metrics'].items(): print(f" {metric.upper()}:") print(f" Test: {data['test_mean']:.4f} ± {data['test_std']:.4f}") print(f" 95% CI: [{data['ci_95'][0]:.4f}, {data['ci_95'][1]:.4f}]") print(f" Fold scores: {[f'{s:.3f}' for s in data['test_fold_scores']]}") print(f" Overfit gap: {data['overfit_gap']:.4f}") # Full demonstrationif __name__ == "__main__": from sklearn.datasets import make_classification # Create grouped classification dataset np.random.seed(42) n_groups = 40 samples_per_group = np.random.randint(10, 50, size=n_groups) X_list, y_list, groups_list = [], [], [] for group_id, n_samples in enumerate(samples_per_group): # Create features with group-specific bias group_mean = np.random.randn(10) * 0.5 # Group-specific effect X_group = np.random.randn(n_samples, 10) + group_mean # Label based on group (with noise) group_label_prob = np.random.uniform(0.2, 0.8) y_group = (np.random.random(n_samples) < group_label_prob).astype(int) X_list.append(X_group) y_list.append(y_group) groups_list.extend([group_id] * n_samples) X = np.vstack(X_list) y = np.concatenate(y_list) groups = np.array(groups_list) print(f"Created dataset: {X.shape[0]} samples, {n_groups} groups") print(f"Class distribution: {np.bincount(y)}") print() # Evaluate evaluator = GroupCVEvaluator(n_splits=5) model = RandomForestClassifier(n_estimators=100, random_state=42) results = evaluator.evaluate(model, X, y, groups) evaluator.print_report(results)Always pass the groups parameter explicitly to cross_validate() and cross_val_score(). Forgetting this parameter causes the functions to silently ignore GroupKFold's group structure, falling back to regular k-fold behavior. This silent failure is a common source of bugs.
Group K-Fold is straightforward in concept but tricky in practice. Here are the most common failure modes:
groups= silently falls back to ungrouped behavior. Always double-check.Debugging Group Leakage
When you suspect group leakage, use these diagnostic checks:
def check_for_leakage(groups, train_idx, test_idx):
"""Verify no group overlap between train and test."""
train_groups = set(groups[train_idx])
test_groups = set(groups[test_idx])
overlap = train_groups & test_groups
if overlap:
raise ValueError(f"LEAKAGE DETECTED! Overlapping groups: {overlap}")
return True
# Verify all folds
for fold, (train_idx, test_idx) in enumerate(cv.split(X, y, groups)):
check_for_leakage(groups, train_idx, test_idx)
print(f"Fold {fold}: No leakage ✓")
Standard Group K-Fold handles one layer of grouping. But real data often has more complex structure:
Hierarchical Groups
In multi-level hierarchies (students within classrooms within schools), grouping at the lowest level (students) doesn't prevent leakage from shared classroom or school effects.
Solution: Group at the highest relevant level. For school effects, use school as the group unit, even though it means fewer groups and more variance in fold sizes.
Multiple Grouping Variables
Sometimes samples belong to multiple groups that shouldn't be split. Medical images might have both patient ID and imaging device relationships.
Solution: Create composite group IDs: group = f"{patient_id}_{device_id}". This is conservative (never splits either grouping) but may create too many groups.
Time-Based Groups with Samples
Grouped time series (multiple time points per entity) require special handling covered in Module 4.
Stratification + Groups
Sometimes you need both stratified class proportions AND group separation. This is covered in the "Grouped Stratification" page later in this module.
When uncertain about group structure, be conservative: group at higher levels, use fewer folds, and assume more correlation than less. Overly conservative evaluation produces higher variance estimates but valid conclusions. Insufficiently conservative evaluation produces optimistically biased estimates and deployment disasters.
We've comprehensively covered Group K-Fold cross-validation—a critical technique for valid evaluation on non-i.i.d. data. Here are the essential takeaways:
Group K-Fold handles grouped samples but uses multiple groups per fold. Sometimes you need to leave exactly one group out for testing—especially for small group counts or when each group represents a critical evaluation scenario. The next page covers Leave-One-Group-Out (LOGO) cross-validation, its use cases, and implementation.