Stratified And Group Cross Validation - Learning Module

Loading content...

0/278

Group K-Fold Cross-Validation

When Independence Assumptions Break Down

You've built a medical image classifier that achieves 95% accuracy in cross-validation. Confident in your results, you deploy it—only to find it performs at 78% on new patients. What went wrong?

The problem: your training and test sets contained images from the same patients. Your model didn't learn to recognize disease; it learned to recognize specific patients. Cross-validation gave you an overly optimistic estimate because the assumption of independent, identically distributed (i.i.d.) samples was violated.

This isn't a rare edge case. Grouped data is everywhere:

Multiple medical scans per patient
Multiple transactions per user
Multiple time points per sensor
Multiple essays per student
Multiple products per manufacturer

Standard cross-validation, even stratified, fails catastrophically on grouped data because it randomly assigns correlated samples to different folds. The correlation leaks information between training and test sets, creating artificially optimistic performance estimates.

Group K-Fold solves this by treating groups as atomic units—all samples from a group appear together in either training or test, never both.

The Silent Killer of ML Projects

Group leakage is one of the most insidious bugs in machine learning. Your code runs perfectly, metrics look great, and nothing signals a problem until deployment. Yet your model has learned spurious correlations within groups rather than generalizable patterns across groups.

Understanding Group Structure in Data

Before diving into algorithms, let's formally define what group structure means and why it matters.

Definition: Grouped Data

A dataset $\mathcal{D} = {(x_i, y_i, g_i)}_{i=1}^{n}$ is grouped when:

Each sample $i$ belongs to exactly one group $g_i \in {1, 2, ..., G}$
Samples within the same group are correlated: $\text{Cov}(x_i, x_j) eq 0$ when $g_i = g_j$
The correlation structure is not captured by the features

The third point is crucial. If the group identity perfectly predicts within-group similarity (e.g., patient ID explains all patient-specific variation), then group leakage allows models to exploit this rather than learning generalizable features.

The Generalization Target

Standard ML assumes we want to generalize to new samples from the same distribution. With grouped data, we typically want to generalize to new groups, not just new samples from existing groups.

This fundamentally changes what "generalization" means:

Sample-level generalization: New test samples may come from groups seen during training
Group-level generalization: Test samples come from entirely new groups not in training

Group Structure Examples Across Domains
Domain	Sample Unit	Group Unit	Why Groups Matter
Medical imaging	Individual scan	Patient	Anatomy, imaging device, disease progression are consistent within patient
Fraud detection	Transaction	Account	Spending patterns, legitimate behavior, fraud patterns are account-specific
Sentiment analysis	Review	Author	Writing style, vocabulary, sentiment expression vary by author
Time series	Time point	Sensor/device	Drift, calibration, noise characteristics are device-specific
Recommendation	Interaction	User	User preferences, browsing patterns are user-specific
Manufacturing	Part measurement	Batch/lot	Machine settings, material lots create batch correlations

Quantifying Group Correlation

The intraclass correlation coefficient (ICC) measures how much variation is explained by group membership:

$$\text{ICC} = \frac{\sigma_{\text{between}}^2}{\sigma_{\text{between}}^2 + \sigma_{\text{within}}^2}$$

ICC ≈ 0: Minimal group effect; standard CV is fine
ICC ≈ 0.5: Substantial group effect; group-aware CV recommended
ICC ≈ 1: Almost all variation is between groups; group-aware CV essential

In practice, you can estimate ICC using ANOVA or mixed-effects models before deciding on your CV strategy.

Rule of Thumb

If you're uncertain whether group effects matter, compute ICC for your target variable. If ICC > 0.1, use group-aware cross-validation. The computational cost is identical, but the validity difference can be dramatic.

The Group K-Fold Algorithm

Group K-Fold ensures that all samples from the same group are in the same fold, preventing information leakage across the train/test boundary.

Algorithm: Group K-Fold Partitioning

Input: Dataset D of n samples, group labels g, number of folds k
Output: k fold assignments where each group is entirely in one fold

1. Identify unique groups:
   unique_groups ← sorted(unique(g))
   G ← count(unique_groups)

2. Assign groups to folds (approximately equal group counts):
   groups_per_fold ← assign_groups_to_folds(unique_groups, k)
   # Various strategies: round-robin, size-balanced, etc.

3. Map samples to folds based on group assignment:
   fold_assignments ← empty array of length n
   For each sample i:
       group = g[i]
       fold_assignments[i] = fold containing group

4. Generate train/test splits:
   For j = 1 to k:
       test_indices ← samples where fold_assignments = j
       train_indices ← samples where fold_assignments ≠ j
       yield (train_indices, test_indices)

The key insight is that we partition groups, not samples. Sample-level fold sizes will vary based on group sizes, but group integrity is preserved.

group_kfold_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
import numpy as np
from collections import defaultdict
from typing import List, Tuple, Dict
 
def group_kfold_split(
    X: np.ndarray,
    y: np.ndarray,
    groups: np.ndarray,
    k: int = 5,
    balance_strategy: str = "size"  # "size" or "count"
) -> List[Tuple[np.ndarray, np.ndarray]]:
    """
    Group K-Fold cross-validation splitting.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix
    y : np.ndarray
        Labels
    groups : np.ndarray
        Group identifiers for each sample
    k : int
        Number of folds
    balance_strategy : str
        "count" - Equal number of groups per fold
        "size" - Balanced total samples per fold (accounts for group sizes)
        
    Returns
    -------
    List of (train_indices, test_indices) tuples
    """
    n_samples = len(groups)
    unique_groups = np.unique(groups)
    n_groups = len(unique_groups)
    
    if n_groups < k:
        raise ValueError(
            f"Cannot have more folds ({k}) than groups ({n_groups})"
        )
    
    # Map groups to their sample indices
    group_to_indices: Dict[any, List[int]] = defaultdict(list)
    for idx, group in enumerate(groups):
        group_to_indices[group].append(idx)
    
    # Compute group sizes
    group_sizes = {g: len(indices) for g, indices in group_to_indices.items()}
    
    # Assign groups to folds
    if balance_strategy == "count":
        # Round-robin assignment: equal group counts per fold
        group_to_fold = {}
        for i, group in enumerate(unique_groups):
            group_to_fold[group] = i % k
    
    elif balance_strategy == "size":
        # Greedy bin-packing: balance total samples per fold
        # Sort groups by size (descending) for better packing
        sorted_groups = sorted(
            unique_groups, 
            key=lambda g: group_sizes[g], 
            reverse=True
        )
        
        fold_sizes = np.zeros(k)
        group_to_fold = {}
        
        for group in sorted_groups:
            # Assign to fold with smallest current size
            target_fold = int(np.argmin(fold_sizes))
            group_to_fold[group] = target_fold
            fold_sizes[target_fold] += group_sizes[group]
    
    else:
        raise ValueError(f"Unknown balance_strategy: {balance_strategy}")
    
    # Create sample-level fold assignments
    sample_to_fold = np.zeros(n_samples, dtype=int)
    for group, fold in group_to_fold.items():
        for idx in group_to_indices[group]:
            sample_to_fold[idx] = fold
    
    # Generate splits
    splits = []
    all_indices = np.arange(n_samples)
    
    for fold in range(k):
        test_mask = sample_to_fold == fold
        train_indices = all_indices[~test_mask]
        test_indices = all_indices[test_mask]
        splits.append((train_indices, test_indices))
    
    return splits
 
 
def verify_group_separation(
    groups: np.ndarray, 
    splits: List[Tuple[np.ndarray, np.ndarray]]
) -> bool:
    """Verify that no group appears in both train and test."""
    all_valid = True
    
    for fold_idx, (train_idx, test_idx) in enumerate(splits):
        train_groups = set(groups[train_idx])
        test_groups = set(groups[test_idx])
        
        overlap = train_groups.intersection(test_groups)
        
        if overlap:
            print(f"LEAK: Fold {fold_idx + 1} has overlapping groups: {overlap}")
            all_valid = False
        else:
            print(f"Fold {fold_idx + 1}: ✓ No overlap")
            print(f"  Train: {len(train_idx)} samples, {len(train_groups)} groups")
            print(f"  Test:  {len(test_idx)} samples, {len(test_groups)} groups")
    
    return all_valid
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Simulate medical imaging: multiple scans per patient
    n_patients = 50
    scans_per_patient = np.random.randint(3, 15, size=n_patients)
    
    # Create dataset
    groups = []
    y = []
    for patient_id, n_scans in enumerate(scans_per_patient):
        groups.extend([patient_id] * n_scans)
        # Each patient has a disease label (random for demo)
        patient_label = np.random.randint(0, 2)
        y.extend([patient_label] * n_scans)
    
    groups = np.array(groups)
    y = np.array(y)
    X = np.random.randn(len(y), 10)  # Dummy features
    
    print(f"Dataset: {len(y)} samples from {n_patients} patients")
    print(f"Scans per patient: min={scans_per_patient.min()}, " +
          f"max={scans_per_patient.max()}, mean={scans_per_patient.mean():.1f}")
    print()
    
    # Compare balancing strategies
    print("=" * 50)
    print("COUNT-BALANCED GROUP K-FOLD")
    print("=" * 50)
    splits_count = group_kfold_split(X, y, groups, k=5, balance_strategy="count")
    verify_group_separation(groups, splits_count)
    print()
    
    print("=" * 50)
    print("SIZE-BALANCED GROUP K-FOLD")
    print("=" * 50)
    splits_size = group_kfold_split(X, y, groups, k=5, balance_strategy="size")
    verify_group_separation(groups, splits_size)

Fold Size Imbalance

Unlike standard k-fold, Group K-Fold often produces unequal fold sizes because groups have different numbers of samples. The 'size' balancing strategy mitigates this but cannot eliminate it entirely. Accept this variance as the price of valid evaluation—biased test sets are worse than unequal test sizes.

Group Assignment Strategies

How groups are assigned to folds significantly impacts fold balance and evaluation quality. Let's examine the main strategies:

Strategy 1: Count-Based Assignment (Round-Robin)

Assign groups to folds in rotation, giving each fold approximately equal numbers of groups:

Pros: Simple, deterministic, guarantees even group distribution
Cons: Ignores group sizes; test folds can be 2-3x different in sample count

Groups:    [A(100), B(10), C(50), D(20), E(80)]
5 Folds:   F1=[A,B] F2=[C] F3=[D] F4=[E] F5=[]  ← Bad!
Corrected: F1=[A] F2=[B] F3=[C] F4=[D] F5=[E]   ← One group per fold

Strategy 2: Size-Balanced Assignment (Greedy Bin-Packing)

Assign groups to minimize variance in fold sizes (total samples per fold):

Sort groups by size (largest first)
For each group, assign to the fold with smallest current total
This is a greedy approximation to the NP-hard bin-packing problem

Pros: Roughly equal test set sizes, more stable metric estimates
Cons: More complex, may cluster similar-sized groups together

Strategy 3: Random Assignment

Shuffle groups randomly, then assign round-robin:

Pros: Avoids any systematic bias from group ordering
Cons: High variance across different random seeds

group_assignment_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
from typing import List, Dict
 
def compare_assignment_strategies(
    group_sizes: Dict[str, int],
    k: int = 5
) -> None:
    """Compare different group assignment strategies."""
    
    groups = list(group_sizes.keys())
    sizes = np.array([group_sizes[g] for g in groups])
    total_samples = sizes.sum()
    ideal_fold_size = total_samples / k
    
    print(f"Total: {total_samples} samples, {len(groups)} groups")
    print(f"Ideal fold size: {ideal_fold_size:.1f}")
    print()
    
    def evaluate_assignment(name: str, assignment: Dict[str, int]) -> None:
        fold_sizes = {i: 0 for i in range(k)}
        for group, fold in assignment.items():
            fold_sizes[fold] += group_sizes[group]
        
        sizes_list = list(fold_sizes.values())
        cv = np.std(sizes_list) / np.mean(sizes_list)  # Coefficient of variation
        
        print(f"{name}:")
        print(f"  Fold sizes: {sizes_list}")
        print(f"  Range: {min(sizes_list)} - {max(sizes_list)}")
        print(f"  CV: {cv:.3f} (lower is better)")
        print()
    
    # Strategy 1: Count-balanced (round-robin)
    count_assignment = {g: i % k for i, g in enumerate(groups)}
    evaluate_assignment("Count-balanced (round-robin)", count_assignment)
    
    # Strategy 2: Size-balanced (greedy)
    sorted_groups = sorted(groups, key=lambda g: group_sizes[g], reverse=True)
    fold_totals = np.zeros(k)
    size_assignment = {}
    for group in sorted_groups:
        target = int(np.argmin(fold_totals))
        size_assignment[group] = target
        fold_totals[target] += group_sizes[group]
    evaluate_assignment("Size-balanced (greedy)", size_assignment)
    
    # Strategy 3: Random
    np.random.seed(42)
    shuffled = np.random.permutation(groups)
    random_assignment = {g: i % k for i, g in enumerate(shuffled)}
    evaluate_assignment("Random assignment", random_assignment)
 
 
# Example with realistic group size distribution
if __name__ == "__main__":
    # Simulate patients with varying scan counts
    np.random.seed(42)
    n_patients = 25
    
    # Power-law-ish distribution: some patients have many scans
    group_sizes = {}
    for i in range(n_patients):
        size = int(np.random.pareto(1.5) * 10 + 5)  # Range ~5-100
        group_sizes[f"P{i:02d}"] = size
    
    print("Group sizes (sorted):")
    sorted_sizes = sorted(group_sizes.items(), key=lambda x: -x[1])
    for g, s in sorted_sizes[:5]:
        print(f"  {g}: {s}")
    print(f"  ... ({len(group_sizes) - 5} more groups)")
    print()
    
    compare_assignment_strategies(group_sizes, k=5)

Comparison of Group Assignment Strategies
Strategy	Fold Size Variance	Complexity	Best For
Round-robin	High	O(G)	Equal group sizes, quick experiments
Size-balanced	Low	O(G log G)	Unequal group sizes, production evaluation
Random	Medium-High	O(G)	Avoiding systematic bias, repeated CV

Theoretical Foundations and Variance Analysis

Understanding the statistical properties of Group K-Fold helps you interpret results and choose appropriate aggregation methods.

The Effective Sample Size Problem

With grouped data, the effective sample size for variance estimation is closer to the number of groups G than the number of samples n. This is because samples within groups are correlated and don't provide fully independent information.

The effective sample size under equicorrelated data is:

$$n_{\text{eff}} = \frac{n}{1 + (m - 1) \cdot \text{ICC}}$$

where $m$ is the average group size and ICC is the intraclass correlation coefficient.

Example:

n = 1000 samples, G = 50 groups (m = 20 samples/group)
ICC = 0.6

$$n_{\text{eff}} = \frac{1000}{1 + (20 - 1) \cdot 0.6} = \frac{1000}{12.4} \approx 81$$

Your effective sample size is 81, not 1000! Standard error formulas using n will dramatically underestimate uncertainty.

Variance of Cross-Validation Estimator

The variance of the CV performance estimate under group structure has two components:

$$\text{Var}(\hat{\theta}{CV}) = \underbrace{\text{Var}{\text{between-group}}}{\text{captured by group CV}} + \underbrace{\text{Var}{\text{within-group}}}_{\text{ignored by group CV}}$$

Standard CV conflates these. Group CV isolates between-group variance, which is the relevant uncertainty for new-group generalization. However, it ignores within-group variance, potentially underestimating uncertainty if within-group samples are very different from each other.

Confidence Intervals with Group CV

Standard formulas like SE = std(fold_scores) / sqrt(k) assume independent folds. With groups, fold independence is satisfied, but the effective sample size per fold varies wildly. For reliable confidence intervals, use bootstrap methods that resample at the group level.

Impact on Model Selection

When comparing models A and B, group structure affects the correlation between their performance estimates. If both models are evaluated on the same group-based folds, their performance differences are less variable than if evaluated on independent samples.

This has two implications:

Paired tests are appropriate: Use paired t-tests or Wilcoxon signed-rank tests comparing model performances fold-by-fold, not unpaired tests.
Correlation can inflate or deflate detected differences: If both models struggle on the same groups (high correlation), the variance of the difference is small, making small differences significant. If they fail on different groups (low correlation), the variance is large, requiring larger differences for significance.

group_cv_inference.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from scipy import stats
from typing import List, Tuple
 
def group_bootstrap_ci(
    group_scores: List[float],
    n_bootstrap: int = 10000,
    confidence: float = 0.95,
    random_state: int = 42
) -> Tuple[float, float, float]:
    """
    Bootstrap confidence interval resampling at group level.
    
    Parameters
    ----------
    group_scores : List[float]
        Performance metric for each group (or fold, if one group per fold)
    n_bootstrap : int
        Number of bootstrap samples
    confidence : float
        Confidence level
    random_state : int
        Random seed
        
    Returns
    -------
    (mean, ci_lower, ci_upper)
    """
    np.random.seed(random_state)
    n_groups = len(group_scores)
    scores_array = np.array(group_scores)
    
    # Bootstrap: resample groups with replacement
    bootstrap_means = []
    for _ in range(n_bootstrap):
        resampled_indices = np.random.randint(0, n_groups, size=n_groups)
        bootstrap_means.append(np.mean(scores_array[resampled_indices]))
    
    bootstrap_means = np.array(bootstrap_means)
    
    alpha = 1 - confidence
    ci_lower = np.percentile(bootstrap_means, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_means, 100 * (1 - alpha / 2))
    
    return np.mean(scores_array), ci_lower, ci_upper
 
 
def paired_model_comparison(
    model_a_scores: List[float],
    model_b_scores: List[float],
    alpha: float = 0.05
) -> dict:
    """
    Statistical comparison of two models evaluated on same group splits.
    
    Returns dictionary with test results.
    """
    differences = np.array(model_a_scores) - np.array(model_b_scores)
    n = len(differences)
    
    # Paired t-test
    t_stat, t_pval = stats.ttest_rel(model_a_scores, model_b_scores)
    
    # Wilcoxon signed-rank test (nonparametric alternative)
    try:
        w_stat, w_pval = stats.wilcoxon(differences)
    except ValueError:  # All differences are zero
        w_stat, w_pval = 0, 1.0
    
    # Effect size (Cohen's d for paired samples)
    cohens_d = np.mean(differences) / np.std(differences, ddof=1)
    
    return {
        "mean_a": np.mean(model_a_scores),
        "mean_b": np.mean(model_b_scores),
        "mean_difference": np.mean(differences),
        "std_difference": np.std(differences, ddof=1),
        "t_statistic": t_stat,
        "t_pvalue": t_pval,
        "wilcoxon_pvalue": w_pval,
        "cohens_d": cohens_d,
        "significant_t": t_pval < alpha,
        "significant_wilcoxon": w_pval < alpha
    }
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Simulate 5-fold group CV results for two models
    # Model A: 85% mean accuracy, Model B: 83% mean accuracy
    true_diff = 0.02  # 2% true advantage for A
    
    model_a_scores = np.random.normal(0.85, 0.03, size=5)
    model_b_scores = model_a_scores - true_diff + np.random.normal(0, 0.01, size=5)
    
    print("Fold-wise scores:")
    for i, (a, b) in enumerate(zip(model_a_scores, model_b_scores)):
        print(f"  Fold {i+1}: A={a:.3f}, B={b:.3f}, diff={a-b:+.3f}")
    print()
    
    # Bootstrap CI for Model A
    mean_a, ci_low, ci_high = group_bootstrap_ci(model_a_scores.tolist())
    print(f"Model A: {mean_a:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}])")
    
    mean_b, ci_low, ci_high = group_bootstrap_ci(model_b_scores.tolist())
    print(f"Model B: {mean_b:.3f} (95% CI: [{ci_low:.3f}, {ci_high:.3f}])")
    print()
    
    # Model comparison
    comparison = paired_model_comparison(
        model_a_scores.tolist(),
        model_b_scores.tolist()
    )
    print("Paired comparison:")
    print(f"  Mean difference: {comparison['mean_difference']:.3f}")
    print(f"  Cohen's d: {comparison['cohens_d']:.2f}")
    print(f"  t-test p-value: {comparison['t_pvalue']:.4f}")
    print(f"  Significant at α=0.05: {comparison['significant_t']}")

Production Implementation with Scikit-Learn

Scikit-learn provides GroupKFold for production use. Let's examine best practices for integrating it into real ML pipelines.

production_group_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
import numpy as np
import pandas as pd
from sklearn.model_selection import (
    GroupKFold,
    cross_val_score,
    cross_validate
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score, roc_auc_score
from typing import Dict, Any, Optional
 
class GroupCVEvaluator:
    """
    Production-ready group cross-validation evaluator.
    
    Handles:
    - Multiple metrics
    - Preprocessing within folds
    - Group-level statistics
    - Proper confidence intervals
    """
    
    def __init__(
        self,
        n_splits: int = 5,
        metrics: Optional[Dict[str, str]] = None
    ):
        self.n_splits = n_splits
        self.metrics = metrics or {
            'accuracy': 'accuracy',
            'f1': make_scorer(f1_score, average='binary'),
            'roc_auc': 'roc_auc'
        }
        self.group_cv = GroupKFold(n_splits=n_splits)
    
    def evaluate(
        self,
        model,
        X: np.ndarray,
        y: np.ndarray,
        groups: np.ndarray,
        preprocess: bool = True
    ) -> Dict[str, Any]:
        """
        Perform group-aware cross-validation.
        
        Returns detailed evaluation results.
        """
        # Create pipeline with optional preprocessing
        if preprocess:
            pipeline = Pipeline([
                ('scaler', StandardScaler()),
                ('model', model)
            ])
        else:
            pipeline = model
        
        # Group-level statistics
        unique_groups = np.unique(groups)
        n_groups = len(unique_groups)
        group_sizes = pd.Series(groups).value_counts()
        
        # Perform cross-validation
        cv_results = cross_validate(
            pipeline, X, y,
            cv=self.group_cv,
            groups=groups,
            scoring=self.metrics,
            return_train_score=True,
            return_estimator=False,
            n_jobs=-1
        )
        
        # Analyze fold composition
        fold_info = []
        for fold_idx, (train_idx, test_idx) in enumerate(
            self.group_cv.split(X, y, groups)
        ):
            train_groups = np.unique(groups[train_idx])
            test_groups = np.unique(groups[test_idx])
            
            fold_info.append({
                'fold': fold_idx + 1,
                'n_train_samples': len(train_idx),
                'n_test_samples': len(test_idx),
                'n_train_groups': len(train_groups),
                'n_test_groups': len(test_groups),
                'test_class_dist': dict(zip(*np.unique(y[test_idx], return_counts=True)))
            })
        
        # Compile results
        results = {
            'n_samples': len(y),
            'n_groups': n_groups,
            'group_size_stats': {
                'min': int(group_sizes.min()),
                'max': int(group_sizes.max()),
                'mean': float(group_sizes.mean()),
                'std': float(group_sizes.std())
            },
            'fold_info': fold_info,
            'metrics': {}
        }
        
        # Process each metric
        for metric in self.metrics.keys():
            test_scores = cv_results[f'test_{metric}']
            train_scores = cv_results[f'train_{metric}']
            
            # Bootstrap CI (group-aware)
            n_bootstrap = 10000
            np.random.seed(42)
            bootstrap_means = [
                np.mean(test_scores[np.random.randint(0, len(test_scores), len(test_scores))])
                for _ in range(n_bootstrap)
            ]
            
            results['metrics'][metric] = {
                'test_mean': float(np.mean(test_scores)),
                'test_std': float(np.std(test_scores)),
                'test_fold_scores': test_scores.tolist(),
                'ci_95': (
                    float(np.percentile(bootstrap_means, 2.5)),
                    float(np.percentile(bootstrap_means, 97.5))
                ),
                'train_mean': float(np.mean(train_scores)),
                'overfit_gap': float(np.mean(train_scores) - np.mean(test_scores))
            }
        
        return results
    
    def print_report(self, results: Dict[str, Any]) -> None:
        """Print formatted evaluation report."""
        print("=" * 70)
        print("GROUP K-FOLD CROSS-VALIDATION REPORT")
        print("=" * 70)
        
        print(f"
Dataset: {results['n_samples']} samples, {results['n_groups']} groups")
        gs = results['group_size_stats']
        print(f"Group sizes: {gs['min']}-{gs['max']} (mean: {gs['mean']:.1f} ± {gs['std']:.1f})")
        
        print(f"
Fold Details:")
        for fi in results['fold_info']:
            print(f"  Fold {fi['fold']}: {fi['n_test_samples']} test samples "
                  f"({fi['n_test_groups']} groups), class dist: {fi['test_class_dist']}")
        
        print(f"
Metrics:")
        for metric, data in results['metrics'].items():
            print(f"  {metric.upper()}:")
            print(f"    Test: {data['test_mean']:.4f} ± {data['test_std']:.4f}")
            print(f"    95% CI: [{data['ci_95'][0]:.4f}, {data['ci_95'][1]:.4f}]")
            print(f"    Fold scores: {[f'{s:.3f}' for s in data['test_fold_scores']]}")
            print(f"    Overfit gap: {data['overfit_gap']:.4f}")
 
 
# Full demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # Create grouped classification dataset
    np.random.seed(42)
    
    n_groups = 40
    samples_per_group = np.random.randint(10, 50, size=n_groups)
    
    X_list, y_list, groups_list = [], [], []
    
    for group_id, n_samples in enumerate(samples_per_group):
        # Create features with group-specific bias
        group_mean = np.random.randn(10) * 0.5  # Group-specific effect
        X_group = np.random.randn(n_samples, 10) + group_mean
        
        # Label based on group (with noise)
        group_label_prob = np.random.uniform(0.2, 0.8)
        y_group = (np.random.random(n_samples) < group_label_prob).astype(int)
        
        X_list.append(X_group)
        y_list.append(y_group)
        groups_list.extend([group_id] * n_samples)
    
    X = np.vstack(X_list)
    y = np.concatenate(y_list)
    groups = np.array(groups_list)
    
    print(f"Created dataset: {X.shape[0]} samples, {n_groups} groups")
    print(f"Class distribution: {np.bincount(y)}")
    print()
    
    # Evaluate
    evaluator = GroupCVEvaluator(n_splits=5)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    
    results = evaluator.evaluate(model, X, y, groups)
    evaluator.print_report(results)

Critical Implementation Note

Always pass the groups parameter explicitly to cross_validate() and cross_val_score(). Forgetting this parameter causes the functions to silently ignore GroupKFold's group structure, falling back to regular k-fold behavior. This silent failure is a common source of bugs.

Common Pitfalls in Group Cross-Validation

Group K-Fold is straightforward in concept but tricky in practice. Here are the most common failure modes:

Critical Pitfalls

•Forgetting the groups parameter — Passing GroupKFold to cross_validate without groups= silently falls back to ungrouped behavior. Always double-check.
•Incorrect group identification — Including session IDs as 'groups' when patient IDs are the true unit of independence. Think carefully about the correlation structure.
•Feature leakage via group statistics — Computing global mean encoding or target statistics before splitting allows test groups to influence train features.
•Too few groups for the number of folds — With 10 groups and 10-fold CV, you have zero training groups in each fold. Ensure groups >> k.
•Class imbalance across groups — If some groups have all positive labels and others all negative, stratification within groups is impossible. Check group-level label distributions.
•Hierarchical groups — Patients within hospitals, students within schools. You may need to group at the higher level to avoid all correlation.

Incorrect Patterns

•GroupKFold() without groups=groups
•Global target encoding → Group CV
•k > n_groups
•Grouping by date when user is the unit
•Assuming balanced groups

Correct Patterns

•Always pass groups=groups explicitly
•Target encoding within each train fold
•k << n_groups (k ≤ n_groups / 3)
•Grouping by the correlated entity
•Check group sizes before splitting

Debugging Group Leakage

When you suspect group leakage, use these diagnostic checks:

def check_for_leakage(groups, train_idx, test_idx):
    """Verify no group overlap between train and test."""
    train_groups = set(groups[train_idx])
    test_groups = set(groups[test_idx])
    overlap = train_groups & test_groups
    
    if overlap:
        raise ValueError(f"LEAKAGE DETECTED! Overlapping groups: {overlap}")
    return True

# Verify all folds
for fold, (train_idx, test_idx) in enumerate(cv.split(X, y, groups)):
    check_for_leakage(groups, train_idx, test_idx)
    print(f"Fold {fold}: No leakage ✓")

When Standard Group K-Fold Isn't Enough

Standard Group K-Fold handles one layer of grouping. But real data often has more complex structure:

Hierarchical Groups

In multi-level hierarchies (students within classrooms within schools), grouping at the lowest level (students) doesn't prevent leakage from shared classroom or school effects.

Solution: Group at the highest relevant level. For school effects, use school as the group unit, even though it means fewer groups and more variance in fold sizes.

Multiple Grouping Variables

Sometimes samples belong to multiple groups that shouldn't be split. Medical images might have both patient ID and imaging device relationships.

Solution: Create composite group IDs: group = f"{patient_id}_{device_id}". This is conservative (never splits either grouping) but may create too many groups.

Time-Based Groups with Samples

Grouped time series (multiple time points per entity) require special handling covered in Module 4.

Stratification + Groups

Sometimes you need both stratified class proportions AND group separation. This is covered in the "Grouped Stratification" page later in this module.

The Conservative Principle

When uncertain about group structure, be conservative: group at higher levels, use fewer folds, and assume more correlation than less. Overly conservative evaluation produces higher variance estimates but valid conclusions. Insufficiently conservative evaluation produces optimistically biased estimates and deployment disasters.

Summary: Group K-Fold Cross-Validation

We've comprehensively covered Group K-Fold cross-validation—a critical technique for valid evaluation on non-i.i.d. data. Here are the essential takeaways:

Key Takeaways

•Group structure violates i.i.d. assumptions — When samples within groups are correlated, standard CV overestimates performance on new groups.
•Group K-Fold treats groups as atomic units — All samples from a group are in the same fold, preventing between-fold information leakage.
•Effective sample size is closer to G than n — With n=1000 samples in G=50 groups at ICC=0.6, you have ~80 effective samples, not 1000.
•Size-balanced group assignment reduces fold variance — Greedy bin-packing produces more equal test set sizes than round-robin assignment.
•Bootstrap at the group level for valid CIs — Resampling samples rather than groups underestimates uncertainty.
•Always pass groups= explicitly — Forgetting this parameter silences the grouping and falls back to ungrouped CV.
•When uncertain, group conservatively — Higher-level groups and fewer folds are safer than optimism.

What's Next

Group K-Fold handles grouped samples but uses multiple groups per fold. Sometimes you need to leave exactly one group out for testing—especially for small group counts or when each group represents a critical evaluation scenario. The next page covers Leave-One-Group-Out (LOGO) cross-validation, its use cases, and implementation.