Stratified And Group Cross Validation - Learning Module

Loading content...

0/278

Leave-One-Group-Out Cross-Validation

The Ultimate Test of Generalization

Consider a clinical trial with 30 hospitals. Your model must work at hospitals never seen during training. How do you evaluate this? You could use Group 5-Fold, but each test set would contain data from 6 hospitals mixed together—you'd never know if failure stems from one problematic hospital or a general weakness.

Leave-One-Group-Out (LOGO) cross-validation takes the most conservative approach: each fold uses exactly one group for testing and all others for training. With 30 hospitals, you get 30 iterations, each testing on a single hospital.

This approach provides:

Individual group diagnostics: See exactly which groups your model struggles with
Maximum training data: Each iteration trains on n-1 groups
Worst-case assessment: If you pass LOGO, you can handle any single new group

The cost? Computational intensity (G iterations instead of k) and high variance (single-group test sets may be small). Understanding when this tradeoff is worthwhile is essential for production ML.

What You Will Learn

By the end of this page, you will master LOGO's theoretical foundations, understand its variance properties versus Group K-Fold, know precisely when to use LOGO over other strategies, implement production-ready LOGO pipelines, and analyze group-level failure modes.

Formal Definition and Properties

Definition: Leave-One-Group-Out Cross-Validation

Given a dataset $\mathcal{D} = {(x_i, y_i, g_i)}_{i=1}^{n}$ with $G$ unique groups, LOGO performs $G$ iterations:

For iteration $j$ where $j \in {1, 2, ..., G}$:

Test set: $\mathcal{D}_{\text{test}}^{(j)} = {(x_i, y_i) : g_i = j}$
Train set: $\mathcal{D}_{\text{train}}^{(j)} = {(x_i, y_i) : g_i eq j}$

The performance estimate is the average over all held-out groups:

$$\hat{\theta}{\text{LOGO}} = \frac{1}{G} \sum{j=1}^{G} \theta(\mathcal{D}{\text{train}}^{(j)}, \mathcal{D}{\text{test}}^{(j)})$$

where $\theta(\cdot, \cdot)$ is the performance metric evaluated using the model trained on train and tested on test.

Key Properties:

Group Integrity: Each group appears in exactly one test set
Exhaustive: Every sample is tested exactly once
Maximum Training: Each iteration uses $G-1$ groups for training
Deterministic: No randomness; always produces the same splits

LOGO vs. Group K-Fold Comparison
Property	LOGO	Group K-Fold (k=5)
Number of iterations	G (one per group)	k (fixed)
Test set size	One group each	G/k groups each
Training set size	G-1 groups	(k-1)/k × G groups
Per-group diagnostics	Yes	No (groups are mixed)
Computational cost	G model trainings	k model trainings
Variance	High (small test sets)	Lower (larger, pooled test sets)
Bias	Low (maximum training)	Higher (less training data)

The Bias-Variance Tradeoff in LOGO

LOGO maximizes training data (low bias) but creates high variance due to small test sets. The components:

Bias Component: With G-1 groups for training, LOGO uses $(G-1)/G \times 100%$ of the data—typically 95%+ for G ≥ 20. This closely approximates the full-data model.

Variance Component: Each test set contains only one group. If group sizes vary, some iterations test on 10 samples while others test on 1000. The variance of the estimator is dominated by the smallest groups.

For group j with size $n_j$, the variance contribution scales as:

$$\text{Var}(\hat{\theta}_j) \propto \frac{1}{n_j}$$

The overall variance is approximately:

$$\text{Var}(\hat{\theta}{\text{LOGO}}) \approx \frac{1}{G^2} \sum{j=1}^{G} \text{Var}(\hat{\theta}j) + \text{Var}{\text{between-groups}}$$

When High Variance Is Acceptable

High variance in LOGO isn't always problematic. If your goal is to understand model behavior on individual groups (e.g., 'which hospitals will struggle?'), the granular per-group estimates are exactly what you need, even if the aggregate mean has high uncertainty.

Implementation: From Scratch and Production

Let's implement LOGO from first principles, then compare with Scikit-learn's production implementation.

logo_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
import numpy as np
from collections import defaultdict
from typing import List, Tuple, Dict, Any, Iterator
 
class LeaveOneGroupOutCV:
    """
    Leave-One-Group-Out cross-validator from scratch.
    
    Provides detailed diagnostics not available in sklearn's implementation.
    """
    
    def __init__(self):
        self.group_info_: Dict[Any, Dict] = {}
    
    def split(
        self,
        X: np.ndarray,
        y: np.ndarray = None,
        groups: np.ndarray = None
    ) -> Iterator[Tuple[np.ndarray, np.ndarray]]:
        """
        Generate train/test indices for each group.
        
        Yields
        ------
        train_indices, test_indices for each group as test set
        """
        if groups is None:
            raise ValueError("groups must be provided for LeaveOneGroupOut")
        
        n_samples = len(groups)
        unique_groups = np.unique(groups)
        
        # Build index mapping
        group_to_indices = defaultdict(list)
        for idx, group in enumerate(groups):
            group_to_indices[group].append(idx)
        
        # Store group info for diagnostics
        for group in unique_groups:
            indices = group_to_indices[group]
            self.group_info_[group] = {
                'n_samples': len(indices),
                'indices': np.array(indices),
                'class_dist': dict(zip(*np.unique(y[indices], return_counts=True))) if y is not None else None
            }
        
        # Generate splits
        all_indices = np.arange(n_samples)
        
        for group in unique_groups:
            test_indices = np.array(group_to_indices[group])
            train_mask = np.ones(n_samples, dtype=bool)
            train_mask[test_indices] = False
            train_indices = all_indices[train_mask]
            
            yield train_indices, test_indices
    
    def get_n_splits(self, X=None, y=None, groups=None) -> int:
        """Return number of splits (equals number of unique groups)."""
        if groups is None:
            raise ValueError("groups must be provided")
        return len(np.unique(groups))
    
    def get_group_diagnostics(self) -> Dict[Any, Dict]:
        """Return detailed information about each group."""
        return self.group_info_
 
 
def logo_evaluate_with_diagnostics(
    model,
    X: np.ndarray,
    y: np.ndarray,
    groups: np.ndarray,
    metric_func,
    return_predictions: bool = False
) -> Dict[str, Any]:
    """
    Perform LOGO CV with comprehensive diagnostics.
    
    Returns per-group scores, failure analysis, and optional predictions.
    """
    cv = LeaveOneGroupOutCV()
    
    group_results = {}
    all_predictions = np.zeros_like(y, dtype=float)
    
    for fold_idx, (train_idx, test_idx) in enumerate(cv.split(X, y, groups)):
        # Get group ID for this fold
        test_group = groups[test_idx[0]]  # All test samples have same group
        
        # Train model
        model_clone = clone_model(model)
        model_clone.fit(X[train_idx], y[train_idx])
        
        # Predict
        y_pred = model_clone.predict(X[test_idx])
        if hasattr(model_clone, 'predict_proba'):
            y_proba = model_clone.predict_proba(X[test_idx])[:, 1]
        else:
            y_proba = y_pred
        
        # Compute metric
        score = metric_func(y[test_idx], y_pred)
        
        # Store results
        group_results[test_group] = {
            'score': score,
            'n_samples': len(test_idx),
            'class_distribution': dict(zip(*np.unique(y[test_idx], return_counts=True))),
            'train_n_samples': len(train_idx),
            'predictions': y_pred if return_predictions else None,
            'probabilities': y_proba if return_predictions else None
        }
        
        if return_predictions:
            all_predictions[test_idx] = y_proba
    
    # Aggregate statistics
    scores = [r['score'] for r in group_results.values()]
    sizes = [r['n_samples'] for r in group_results.values()]
    
    # Weighted mean (by test set size)
    weighted_mean = np.average(scores, weights=sizes)
    
    # Identify problem groups
    mean_score = np.mean(scores)
    std_score = np.std(scores)
    problem_groups = {
        group: info for group, info in group_results.items()
        if info['score'] < mean_score - 2 * std_score
    }
    
    return {
        'group_results': group_results,
        'aggregate': {
            'mean': np.mean(scores),
            'std': np.std(scores),
            'weighted_mean': weighted_mean,
            'min': np.min(scores),
            'max': np.max(scores),
            'min_group': min(group_results, key=lambda g: group_results[g]['score']),
            'max_group': max(group_results, key=lambda g: group_results[g]['score']),
            'n_groups': len(group_results)
        },
        'problem_groups': problem_groups,
        'all_predictions': all_predictions if return_predictions else None
    }
 
 
def clone_model(model):
    """Simple model cloning for demonstration."""
    from sklearn.base import clone
    return clone(model)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score
    
    np.random.seed(42)
    
    # Create dataset with group structure
    n_groups = 12
    group_sizes = np.random.randint(20, 100, size=n_groups)
    
    X_list, y_list, groups_list = [], [], []
    
    for group_id, n_samples in enumerate(group_sizes):
        # Some groups are "harder" (noisier features)
        noise_scale = 0.5 + (group_id % 3) * 0.3  # Groups 2,5,8,11 are hardest
        
        X_group = np.random.randn(n_samples, 10) * noise_scale
        y_group = (X_group[:, 0] + X_group[:, 1] > 0).astype(int)
        
        X_list.append(X_group)
        y_list.append(y_group)
        groups_list.extend([f"Hospital_{group_id}"] * n_samples)
    
    X = np.vstack(X_list)
    y = np.concatenate(y_list)
    groups = np.array(groups_list)
    
    print(f"Dataset: {len(y)} samples, {n_groups} groups")
    print()
    
    # Evaluate with LOGO
    model = RandomForestClassifier(n_estimators=50, random_state=42)
    results = logo_evaluate_with_diagnostics(
        model, X, y, groups, 
        metric_func=accuracy_score,
        return_predictions=False
    )
    
    print("=" * 60)
    print("LEAVE-ONE-GROUP-OUT CROSS-VALIDATION RESULTS")
    print("=" * 60)
    
    print(f"
Aggregate Performance:")
    agg = results['aggregate']
    print(f"  Mean accuracy: {agg['mean']:.4f} ± {agg['std']:.4f}")
    print(f"  Weighted mean: {agg['weighted_mean']:.4f}")
    print(f"  Range: [{agg['min']:.4f}, {agg['max']:.4f}]")
    print(f"  Best group: {agg['max_group']} ({agg['max']:.4f})")
    print(f"  Worst group: {agg['min_group']} ({agg['min']:.4f})")
    
    print(f"
Per-Group Details:")
    for group, info in sorted(results['group_results'].items()):
        status = "⚠️" if group in results['problem_groups'] else "✓"
        print(f"  {group}: {info['score']:.4f} ({info['n_samples']} samples) {status}")
    
    if results['problem_groups']:
        print(f"
Problem Groups (score < mean - 2σ):")
        for group, info in results['problem_groups'].items():
            print(f"  {group}: {info['score']:.4f}")

Production Usage with Scikit-learn

For production pipelines, scikit-learn's LeaveOneGroupOut integrates seamlessly with cross_validate:

sklearn_logo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
from sklearn.model_selection import LeaveOneGroupOut, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score
 
def production_logo_evaluation(model, X, y, groups):
    """
    Production-ready LOGO evaluation with sklearn.
    """
    # Create preprocessing pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])
    
    # Define LOGO cross-validator
    logo = LeaveOneGroupOut()
    
    # Define metrics
    scoring = {
        'accuracy': 'accuracy',
        'f1': make_scorer(f1_score, average='binary', zero_division=0),
        'roc_auc': 'roc_auc'
    }
    
    # Perform cross-validation
    # CRITICAL: Pass groups= explicitly!
    results = cross_validate(
        pipeline, X, y,
        cv=logo,
        groups=groups,  # <-- Don't forget this!
        scoring=scoring,
        return_train_score=True,
        return_estimator=True,  # Keep models for analysis
        n_jobs=-1
    )
    
    # Map results to groups
    unique_groups = np.unique(groups)
    group_scores = {}
    
    for i, group in enumerate(unique_groups):
        group_scores[group] = {
            metric: results[f'test_{metric}'][i]
            for metric in scoring.keys()
        }
    
    return results, group_scores

When to Use LOGO vs. Group K-Fold

Choosing between LOGO and Group K-Fold depends on your goals, constraints, and data characteristics.

Use LOGO When

•Per-group diagnostics are essential — You need to know how the model performs on each specific group (hospital, user, device) for debugging or stakeholder reporting.
•Groups are individually meaningful — Each group represents a distinct context (clinical trial site, manufacturing plant) where you need individual guarantees.
•Groups are few but large — With 10-30 large groups, LOGO is computationally feasible and provides stable per-group estimates.
•You're doing sensitivity analysis — Understanding which groups drive aggregate performance variance helps prioritize data collection or model improvements.
•Deployment is per-group — If you'll deploy separate models or configurations per group, you need per-group performance estimates.

Use Group K-Fold When

•Groups are numerous — With 500+ groups, LOGO requires 500+ model trainings. Group K-Fold with k=5 is 100x faster.
•Groups are small — If each group has <20 samples, LOGO test sets are too small for stable metric estimates. Pooling into K folds reduces variance.
•Only aggregate performance matters — If you just need overall expected performance on new groups, pooled estimates are sufficient.
•Computational budget is limited — Model training dominates cost; minimizing iterations is crucial.
•Groups have similar characteristics — When groups are interchangeable (random user IDs with similar distributions), pooled evaluation is appropriate.

Decision Guide: LOGO vs. Group K-Fold
Factor	Favors LOGO	Favors Group K-Fold
Number of groups	10-50	50-10,000+
Samples per group	50-10,000	<50
Per-group reporting	Required	Not needed
Computational budget	Generous	Tight
Group heterogeneity	High (want to see each)	Low (groups similar)
Model training time	Fast (<1 min)	Slow (hours)

Hybrid Approach

For large G with meaningful subgroups, consider hierarchical approaches: Use Group K-Fold at the top level (e.g., regions) and LOGO within each fold to diagnose individual groups (e.g., hospitals within regions). This balances computational cost with diagnostic granularity.

Variance Properties and Confidence Intervals

Understanding LOGO's variance properties is crucial for interpreting results and constructing valid confidence intervals.

Sources of Variance

The total variance of the LOGO estimator has two components:

$$\text{Var}(\hat{\theta}{\text{LOGO}}) = \text{Var}{\text{between-groups}} + \text{Var}_{\text{within-groups}}$$

Between-group variance: How much true performance varies across groups (model works better on some groups than others)
Within-group variance: Sampling variance from estimating performance on finite test sets

For small groups, within-group variance dominates. For large groups, between-group variance dominates.

Confidence Interval Construction

Standard formulas underestimate uncertainty because they assume independent fold errors. For LOGO, use:

Method 1: Group-Bootstrap Resample groups (not samples) with replacement, compute mean, repeat.

Method 2: Jackknife-After-Bootstrap Combine jackknife influence diagnostics with bootstrap standard errors for better coverage.

Method 3: Corrected t-Interval Use a t-distribution with G-1 degrees of freedom:

$$\hat{\theta} \pm t_{G-1, 1-\alpha/2} \times \frac{s}{\sqrt{G}}$$

where $s$ is the standard deviation of group-level scores.

logo_confidence_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
import numpy as np
from scipy import stats
from typing import List, Tuple
 
def logo_confidence_interval(
    group_scores: List[float],
    group_sizes: List[int] = None,
    method: str = "bootstrap",
    confidence: float = 0.95,
    n_bootstrap: int = 10000,
    random_state: int = 42
) -> Tuple[float, float, float]:
    """
    Compute confidence interval for LOGO estimate.
    
    Parameters
    ----------
    group_scores : List[float]
        Performance metric for each group
    group_sizes : List[int], optional
        Number of samples in each group (for weighted estimates)
    method : str
        "bootstrap", "t", or "weighted_bootstrap"
    confidence : float
        Confidence level (e.g., 0.95 for 95% CI)
    
    Returns
    -------
    (mean, ci_lower, ci_upper)
    """
    np.random.seed(random_state)
    scores = np.array(group_scores)
    G = len(scores)
    
    alpha = 1 - confidence
    
    if method == "t":
        # Corrected t-interval (assumes approximately normal group scores)
        mean = np.mean(scores)
        se = np.std(scores, ddof=1) / np.sqrt(G)
        t_crit = stats.t.ppf(1 - alpha/2, df=G-1)
        return mean, mean - t_crit * se, mean + t_crit * se
    
    elif method == "bootstrap":
        # Group-level bootstrap
        bootstrap_means = []
        for _ in range(n_bootstrap):
            resampled = scores[np.random.randint(0, G, size=G)]
            bootstrap_means.append(np.mean(resampled))
        
        bootstrap_means = np.array(bootstrap_means)
        return (
            np.mean(scores),
            np.percentile(bootstrap_means, 100 * alpha/2),
            np.percentile(bootstrap_means, 100 * (1 - alpha/2))
        )
    
    elif method == "weighted_bootstrap":
        if group_sizes is None:
            raise ValueError("group_sizes required for weighted bootstrap")
        
        sizes = np.array(group_sizes)
        
        # Weighted mean
        weighted_mean = np.average(scores, weights=sizes)
        
        # Bootstrap with fixed group probabilities
        bootstrap_means = []
        for _ in range(n_bootstrap):
            resampled_idx = np.random.choice(G, size=G, replace=True)
            resampled_mean = np.average(
                scores[resampled_idx], 
                weights=sizes[resampled_idx]
            )
            bootstrap_means.append(resampled_mean)
        
        bootstrap_means = np.array(bootstrap_means)
        return (
            weighted_mean,
            np.percentile(bootstrap_means, 100 * alpha/2),
            np.percentile(bootstrap_means, 100 * (1 - alpha/2))
        )
    
    else:
        raise ValueError(f"Unknown method: {method}")
 
 
def analyze_variance_components(
    group_scores: List[float],
    group_sizes: List[int],
    overall_variance_estimate: float = None
) -> dict:
    """
    Decompose total variance into between-group and within-group components.
    
    Uses one-way ANOVA-style decomposition.
    """
    scores = np.array(group_scores)
    sizes = np.array(group_sizes)
    G = len(scores)
    n_total = sizes.sum()
    
    # Overall mean (unweighted)
    grand_mean = np.mean(scores)
    
    # Between-group variance (variance of group means)
    between_variance = np.var(scores, ddof=1)
    
    # Estimate within-group variance from group size
    # Assuming binomial-like sampling: Var(accuracy) ≈ p(1-p)/n
    # Rough estimate: use observed variance / sqrt(size)
    estimated_within_var = np.mean([
        s * (1 - s) / sz for s, sz in zip(scores, sizes)
    ])
    
    # Compute contributions to total variance
    # For LOGO mean: Var = (1/G²) × Σ[Var(score_g)]
    # ≈ (between_var + mean_within_var) / G
    
    return {
        'grand_mean': grand_mean,
        'between_group_variance': between_variance,
        'estimated_within_group_variance': estimated_within_var,
        'variance_of_mean_estimate': between_variance / G,
        'se_of_mean': np.sqrt(between_variance / G),
        'cv_of_scores': np.std(scores) / np.mean(scores),  # Coefficient of variation
        'min_score': np.min(scores),
        'max_score': np.max(scores),
        'range': np.max(scores) - np.min(scores)
    }
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Simulate LOGO results with varying group sizes
    n_groups = 15
    group_sizes = np.random.randint(30, 200, size=n_groups)
    
    # True performance varies by group, plus sampling noise
    true_group_performance = np.random.normal(0.85, 0.05, size=n_groups)
    # Add sampling noise proportional to 1/sqrt(n)
    observed_scores = true_group_performance + np.random.normal(0, 0.1) / np.sqrt(group_sizes)
    observed_scores = np.clip(observed_scores, 0, 1)  # Keep in valid range
    
    print("Group-level LOGO Results:")
    for i, (score, size) in enumerate(zip(observed_scores, group_sizes)):
        print(f"  Group {i+1}: {score:.4f} (n={size})")
    print()
    
    # Compute confidence intervals using different methods
    print("Confidence Interval Comparison:")
    for method in ["t", "bootstrap", "weighted_bootstrap"]:
        if method == "weighted_bootstrap":
            mean, ci_low, ci_high = logo_confidence_interval(
                observed_scores.tolist(),
                group_sizes.tolist(),
                method=method
            )
        else:
            mean, ci_low, ci_high = logo_confidence_interval(
                observed_scores.tolist(),
                method=method
            )
        width = ci_high - ci_low
        print(f"  {method:20s}: {mean:.4f} [{ci_low:.4f}, {ci_high:.4f}] (width={width:.4f})")
    
    print()
    
    # Variance analysis
    variance_analysis = analyze_variance_components(
        observed_scores.tolist(),
        group_sizes.tolist()
    )
    print("Variance Analysis:")
    for key, value in variance_analysis.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.6f}")

Small Group Warning

When some groups have <20 samples, their individual score estimates have high variance. This doesn't mean the group is inherently problematic—it means you don't have enough data to tell. Consider flagging these for follow-up data collection rather than trusting the score.

Failure Mode Analysis with LOGO

One of LOGO's primary advantages is identifying which groups cause problems. Let's develop a systematic approach to failure mode analysis.

Step 1: Identify Outlier Groups

Define problem groups as those with scores significantly below the mean:

Simple threshold: Score < mean - 2σ
Quantile-based: Score in bottom 10th percentile
Business-driven: Score below minimum acceptable threshold (e.g., <70% accuracy)

Step 2: Characterize Problem Groups

For each problem group, analyze:

Group size (small groups have high variance estimates)
Class distribution (imbalanced groups may be inherently harder)
Feature distributions (do problem groups have different feature patterns?)
Temporal patterns (early groups vs. late groups, if applicable)
External metadata (if groups are hospitals, are problem ones in specific regions?)

Step 3: Distinguish Technical from Data Issues

Technical issues: Model architecture, feature engineering, or hyperparameters that fail on certain group characteristics
Data issues: Mislabeled data, annotation inconsistencies, or genuinely ambiguous cases in certain groups

Step 4: Prioritize Remediation

Rank problem groups by:

Impact: Size × (threshold - score)
Actionability: Can you get more data, fix labels, or adjust the model?

failure_mode_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
import numpy as np
import pandas as pd
from typing import Dict, Any, List
 
class LOGOFailureModeAnalyzer:
    """
    Systematic failure mode analysis for LOGO results.
    """
    
    def __init__(
        self,
        threshold_method: str = "zscore",
        zscore_threshold: float = 2.0,
        quantile_threshold: float = 0.1,
        minimum_score: float = None
    ):
        self.threshold_method = threshold_method
        self.zscore_threshold = zscore_threshold
        self.quantile_threshold = quantile_threshold
        self.minimum_score = minimum_score
    
    def identify_problem_groups(
        self,
        group_scores: Dict[str, float]
    ) -> Dict[str, Dict]:
        """Identify groups performing significantly below average."""
        scores = np.array(list(group_scores.values()))
        groups = list(group_scores.keys())
        
        mean_score = np.mean(scores)
        std_score = np.std(scores)
        
        problems = {}
        
        for group, score in group_scores.items():
            issues = []
            
            # Z-score check
            zscore = (score - mean_score) / std_score if std_score > 0 else 0
            if zscore < -self.zscore_threshold:
                issues.append(f"zscore = {zscore:.2f}")
            
            # Quantile check
            percentile = (scores < score).mean()
            if percentile < self.quantile_threshold:
                issues.append(f"bottom {percentile*100:.1f}th percentile")
            
            # Minimum threshold check
            if self.minimum_score and score < self.minimum_score:
                issues.append(f"below minimum ({self.minimum_score})")
            
            if issues:
                problems[group] = {
                    'score': score,
                    'zscore': zscore,
                    'percentile': percentile,
                    'issues': issues,
                    'gap_from_mean': mean_score - score
                }
        
        return problems
    
    def analyze_group_characteristics(
        self,
        X: np.ndarray,
        y: np.ndarray,
        groups: np.ndarray,
        problem_groups: Dict[str, Dict],
        group_metadata: Dict[str, Dict] = None
    ) -> pd.DataFrame:
        """
        Compare feature distributions between problem and non-problem groups.
        """
        problem_set = set(problem_groups.keys())
        
        analysis_rows = []
        
        for group in np.unique(groups):
            mask = groups == group
            X_group = X[mask]
            y_group = y[mask]
            
            row = {
                'group': group,
                'is_problem': group in problem_set,
                'n_samples': len(y_group),
                'class_1_ratio': np.mean(y_group),
                'feature_mean': np.mean(X_group),
                'feature_std': np.std(X_group),
                'feature_min': np.min(X_group),
                'feature_max': np.max(X_group),
            }
            
            # Add per-feature statistics
            for feat_idx in range(min(X.shape[1], 5)):  # First 5 features
                row[f'feat_{feat_idx}_mean'] = np.mean(X_group[:, feat_idx])
                row[f'feat_{feat_idx}_std'] = np.std(X_group[:, feat_idx])
            
            # Add metadata if available
            if group_metadata and group in group_metadata:
                row.update({f'meta_{k}': v for k, v in group_metadata[group].items()})
            
            analysis_rows.append(row)
        
        df = pd.DataFrame(analysis_rows)
        return df
    
    def generate_remediation_report(
        self,
        problem_groups: Dict[str, Dict],
        characteristic_analysis: pd.DataFrame
    ) -> str:
        """Generate actionable remediation recommendations."""
        report_lines = [
            "=" * 60,
            "LOGO FAILURE MODE ANALYSIS REPORT",
            "=" * 60,
            f"
Total problem groups identified: {len(problem_groups)}",
            ""
        ]
        
        # Sort by severity (gap from mean)
        sorted_problems = sorted(
            problem_groups.items(),
            key=lambda x: x[1]['gap_from_mean'],
            reverse=True
        )
        
        for group, info in sorted_problems:
            group_row = characteristic_analysis[
                characteristic_analysis['group'] == group
            ].iloc[0]
            
            report_lines.append(f"
{'─' * 40}")
            report_lines.append(f"Group: {group}")
            report_lines.append(f"  Score: {info['score']:.4f} (z={info['zscore']:.2f})")
            report_lines.append(f"  Gap from mean: {info['gap_from_mean']:.4f}")
            report_lines.append(f"  Issues: {', '.join(info['issues'])}")
            report_lines.append(f"
  Characteristics:")
            report_lines.append(f"    N samples: {group_row['n_samples']}")
            report_lines.append(f"    Class 1 ratio: {group_row['class_1_ratio']:.2%}")
            report_lines.append(f"    Feature mean: {group_row['feature_mean']:.4f}")
            
            # Recommendations
            report_lines.append(f"
  Recommendations:")
            
            if group_row['n_samples'] < 30:
                report_lines.append("    ⚠️ Small sample size - consider getting more data")
            
            if group_row['class_1_ratio'] < 0.1 or group_row['class_1_ratio'] > 0.9:
                report_lines.append("    ⚠️ Extreme class imbalance - check label quality")
            
            if abs(group_row['feature_mean']) > 2:
                report_lines.append("    ⚠️ Feature distribution differs - check data pipeline")
        
        # Add comparative statistics
        df = characteristic_analysis
        problem_df = df[df['is_problem'] == True]
        normal_df = df[df['is_problem'] == False]
        
        if len(problem_df) > 0 and len(normal_df) > 0:
            report_lines.extend([
                "
" + "=" * 60,
                "COMPARATIVE ANALYSIS: Problem vs. Normal Groups",
                "=" * 60
            ])
            
            for col in ['n_samples', 'class_1_ratio', 'feature_mean', 'feature_std']:
                p_mean = problem_df[col].mean()
                n_mean = normal_df[col].mean()
                diff = p_mean - n_mean
                report_lines.append(
                    f"  {col:20s}: Problem={p_mean:.3f}, Normal={n_mean:.3f}, Δ={diff:+.3f}"
                )
        
        return "
".join(report_lines)
 
 
# Demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Create synthetic LOGO results with some problem groups
    group_scores = {
        'Hospital_A': 0.92,
        'Hospital_B': 0.88,
        'Hospital_C': 0.65,  # Problem group
        'Hospital_D': 0.85,
        'Hospital_E': 0.87,
        'Hospital_F': 0.58,  # Problem group
        'Hospital_G': 0.90,
        'Hospital_H': 0.91,
        'Hospital_I': 0.84,
        'Hospital_J': 0.86,
    }
    
    # Simulate feature data
    n_per_group = {g: np.random.randint(50, 150) for g in group_scores}
    X_list, y_list, groups_list = [], [], []
    
    for group, score in group_scores.items():
        n = n_per_group[group]
        # Problem groups have different feature distributions
        if score < 0.7:
            X_g = np.random.randn(n, 10) + 1.5  # Shifted
        else:
            X_g = np.random.randn(n, 10)
        
        y_g = (np.random.random(n) < score).astype(int)
        
        X_list.append(X_g)
        y_list.append(y_g)
        groups_list.extend([group] * n)
    
    X = np.vstack(X_list)
    y = np.concatenate(y_list)
    groups = np.array(groups_list)
    
    # Analyze failures
    analyzer = LOGOFailureModeAnalyzer(
        threshold_method="zscore",
        zscore_threshold=1.5
    )
    
    problems = analyzer.identify_problem_groups(group_scores)
    characteristics = analyzer.analyze_group_characteristics(X, y, groups, problems)
    report = analyzer.generate_remediation_report(problems, characteristics)
    
    print(report)

Computational Considerations and Optimization

LOGO requires G model trainings, which can be prohibitive for large G or expensive models. Here are strategies to manage computational cost:

Strategy 1: Parallelization

LOGO iterations are embarrassingly parallel—each can run independently. Use joblib, multiprocessing, or distributed computing:

from joblib import Parallel, delayed

results = Parallel(n_jobs=-1)(
    delayed(train_and_evaluate)(model, X[train], y[train], X[test], y[test])
    for train, test in logo.split(X, y, groups)
)

Strategy 2: Progressive Evaluation

Start with a subset of groups, compute confidence intervals, and stop early if precision is sufficient:

Randomly sample G' < G groups for initial evaluation
Compute CI width on these groups
If CI is narrow enough, stop; otherwise, evaluate more groups
Use running variance estimation to predict when to stop

Strategy 3: Model Caching

If groups overlap significantly in training data (G-1 groups are always shared), consider:

Warm-starting models from a base model trained on all data
Incremental learning that removes one group rather than rebuilding
Model distillation to create fast approximations

Strategy 4: Approximate LOGO

For very large G, use pseudo-LOGO:

Group groups into super-groups of 3-5
Run LOGO on super-groups (G/4 iterations instead of G)
Trade per-group granularity for computational efficiency

Computational Cost Optimization Strategies
Strategy	Speedup	Trade-off	Best For
Parallelization	Up to n_cpus×	Memory usage	Any computational budget
Progressive evaluation	Variable	May stop too early	Wide CIs acceptable
Model caching	2-5×	Implementation complexity	Expensive model training
Approximate LOGO	G/k×	Loss of per-group detail	Very large G (1000+)

Cloud Computing

Cloud platforms allow spinning up G instances in parallel, running one fold each, and aggregating results. For 100 groups with 10-minute model training, wall-clock time drops from 1000 minutes to ~15 minutes (including overhead). The marginal cost of parallelization often justifies rigorous evaluation.

Summary: Leave-One-Group-Out Cross-Validation

We've thoroughly explored Leave-One-Group-Out cross-validation—the most granular form of group-aware evaluation. Here are the essential takeaways:

Key Takeaways

•LOGO uses each group exactly once as the test set — G iterations, G-1 training groups each, maximum training data utilization.
•Primary advantage: per-group diagnostics — You see exactly which groups succeed or fail, enabling targeted debugging.
•High variance from small test sets — Single-group test sets may have few samples; interpret scores cautiously for small groups.
•Use when G is moderate (10-50) and groups matter individually — Clinical trial sites, manufacturing plants, key customers.
•Bootstrap at group level for valid CIs — Standard formulas underestimate uncertainty; use group-level resampling.
•Failure mode analysis is LOGO's killer feature — Systematically identify, characterize, and prioritize problem groups.
•Computational cost scales with G — Parallelize, use progressive evaluation, or approximate for large G.

What's Next

Both stratification (preserving class proportions) and group handling (preventing leakage) are critical—but what if you need both? The next page covers Grouped Stratification, techniques for maintaining class balance within groups while ensuring group separation across folds.