Loading content...
Consider a clinical trial with 30 hospitals. Your model must work at hospitals never seen during training. How do you evaluate this? You could use Group 5-Fold, but each test set would contain data from 6 hospitals mixed together—you'd never know if failure stems from one problematic hospital or a general weakness.
Leave-One-Group-Out (LOGO) cross-validation takes the most conservative approach: each fold uses exactly one group for testing and all others for training. With 30 hospitals, you get 30 iterations, each testing on a single hospital.
This approach provides:
The cost? Computational intensity (G iterations instead of k) and high variance (single-group test sets may be small). Understanding when this tradeoff is worthwhile is essential for production ML.
By the end of this page, you will master LOGO's theoretical foundations, understand its variance properties versus Group K-Fold, know precisely when to use LOGO over other strategies, implement production-ready LOGO pipelines, and analyze group-level failure modes.
Definition: Leave-One-Group-Out Cross-Validation
Given a dataset $\mathcal{D} = {(x_i, y_i, g_i)}_{i=1}^{n}$ with $G$ unique groups, LOGO performs $G$ iterations:
For iteration $j$ where $j \in {1, 2, ..., G}$:
The performance estimate is the average over all held-out groups:
$$\hat{\theta}{\text{LOGO}} = \frac{1}{G} \sum{j=1}^{G} \theta(\mathcal{D}{\text{train}}^{(j)}, \mathcal{D}{\text{test}}^{(j)})$$
where $\theta(\cdot, \cdot)$ is the performance metric evaluated using the model trained on train and tested on test.
Key Properties:
| Property | LOGO | Group K-Fold (k=5) |
|---|---|---|
| Number of iterations | G (one per group) | k (fixed) |
| Test set size | One group each | G/k groups each |
| Training set size | G-1 groups | (k-1)/k × G groups |
| Per-group diagnostics | Yes | No (groups are mixed) |
| Computational cost | G model trainings | k model trainings |
| Variance | High (small test sets) | Lower (larger, pooled test sets) |
| Bias | Low (maximum training) | Higher (less training data) |
The Bias-Variance Tradeoff in LOGO
LOGO maximizes training data (low bias) but creates high variance due to small test sets. The components:
Bias Component: With G-1 groups for training, LOGO uses $(G-1)/G \times 100%$ of the data—typically 95%+ for G ≥ 20. This closely approximates the full-data model.
Variance Component: Each test set contains only one group. If group sizes vary, some iterations test on 10 samples while others test on 1000. The variance of the estimator is dominated by the smallest groups.
For group j with size $n_j$, the variance contribution scales as:
$$\text{Var}(\hat{\theta}_j) \propto \frac{1}{n_j}$$
The overall variance is approximately:
$$\text{Var}(\hat{\theta}{\text{LOGO}}) \approx \frac{1}{G^2} \sum{j=1}^{G} \text{Var}(\hat{\theta}j) + \text{Var}{\text{between-groups}}$$
High variance in LOGO isn't always problematic. If your goal is to understand model behavior on individual groups (e.g., 'which hospitals will struggle?'), the granular per-group estimates are exactly what you need, even if the aggregate mean has high uncertainty.
Let's implement LOGO from first principles, then compare with Scikit-learn's production implementation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219
import numpy as npfrom collections import defaultdictfrom typing import List, Tuple, Dict, Any, Iterator class LeaveOneGroupOutCV: """ Leave-One-Group-Out cross-validator from scratch. Provides detailed diagnostics not available in sklearn's implementation. """ def __init__(self): self.group_info_: Dict[Any, Dict] = {} def split( self, X: np.ndarray, y: np.ndarray = None, groups: np.ndarray = None ) -> Iterator[Tuple[np.ndarray, np.ndarray]]: """ Generate train/test indices for each group. Yields ------ train_indices, test_indices for each group as test set """ if groups is None: raise ValueError("groups must be provided for LeaveOneGroupOut") n_samples = len(groups) unique_groups = np.unique(groups) # Build index mapping group_to_indices = defaultdict(list) for idx, group in enumerate(groups): group_to_indices[group].append(idx) # Store group info for diagnostics for group in unique_groups: indices = group_to_indices[group] self.group_info_[group] = { 'n_samples': len(indices), 'indices': np.array(indices), 'class_dist': dict(zip(*np.unique(y[indices], return_counts=True))) if y is not None else None } # Generate splits all_indices = np.arange(n_samples) for group in unique_groups: test_indices = np.array(group_to_indices[group]) train_mask = np.ones(n_samples, dtype=bool) train_mask[test_indices] = False train_indices = all_indices[train_mask] yield train_indices, test_indices def get_n_splits(self, X=None, y=None, groups=None) -> int: """Return number of splits (equals number of unique groups).""" if groups is None: raise ValueError("groups must be provided") return len(np.unique(groups)) def get_group_diagnostics(self) -> Dict[Any, Dict]: """Return detailed information about each group.""" return self.group_info_ def logo_evaluate_with_diagnostics( model, X: np.ndarray, y: np.ndarray, groups: np.ndarray, metric_func, return_predictions: bool = False) -> Dict[str, Any]: """ Perform LOGO CV with comprehensive diagnostics. Returns per-group scores, failure analysis, and optional predictions. """ cv = LeaveOneGroupOutCV() group_results = {} all_predictions = np.zeros_like(y, dtype=float) for fold_idx, (train_idx, test_idx) in enumerate(cv.split(X, y, groups)): # Get group ID for this fold test_group = groups[test_idx[0]] # All test samples have same group # Train model model_clone = clone_model(model) model_clone.fit(X[train_idx], y[train_idx]) # Predict y_pred = model_clone.predict(X[test_idx]) if hasattr(model_clone, 'predict_proba'): y_proba = model_clone.predict_proba(X[test_idx])[:, 1] else: y_proba = y_pred # Compute metric score = metric_func(y[test_idx], y_pred) # Store results group_results[test_group] = { 'score': score, 'n_samples': len(test_idx), 'class_distribution': dict(zip(*np.unique(y[test_idx], return_counts=True))), 'train_n_samples': len(train_idx), 'predictions': y_pred if return_predictions else None, 'probabilities': y_proba if return_predictions else None } if return_predictions: all_predictions[test_idx] = y_proba # Aggregate statistics scores = [r['score'] for r in group_results.values()] sizes = [r['n_samples'] for r in group_results.values()] # Weighted mean (by test set size) weighted_mean = np.average(scores, weights=sizes) # Identify problem groups mean_score = np.mean(scores) std_score = np.std(scores) problem_groups = { group: info for group, info in group_results.items() if info['score'] < mean_score - 2 * std_score } return { 'group_results': group_results, 'aggregate': { 'mean': np.mean(scores), 'std': np.std(scores), 'weighted_mean': weighted_mean, 'min': np.min(scores), 'max': np.max(scores), 'min_group': min(group_results, key=lambda g: group_results[g]['score']), 'max_group': max(group_results, key=lambda g: group_results[g]['score']), 'n_groups': len(group_results) }, 'problem_groups': problem_groups, 'all_predictions': all_predictions if return_predictions else None } def clone_model(model): """Simple model cloning for demonstration.""" from sklearn.base import clone return clone(model) # Demonstrationif __name__ == "__main__": from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score np.random.seed(42) # Create dataset with group structure n_groups = 12 group_sizes = np.random.randint(20, 100, size=n_groups) X_list, y_list, groups_list = [], [], [] for group_id, n_samples in enumerate(group_sizes): # Some groups are "harder" (noisier features) noise_scale = 0.5 + (group_id % 3) * 0.3 # Groups 2,5,8,11 are hardest X_group = np.random.randn(n_samples, 10) * noise_scale y_group = (X_group[:, 0] + X_group[:, 1] > 0).astype(int) X_list.append(X_group) y_list.append(y_group) groups_list.extend([f"Hospital_{group_id}"] * n_samples) X = np.vstack(X_list) y = np.concatenate(y_list) groups = np.array(groups_list) print(f"Dataset: {len(y)} samples, {n_groups} groups") print() # Evaluate with LOGO model = RandomForestClassifier(n_estimators=50, random_state=42) results = logo_evaluate_with_diagnostics( model, X, y, groups, metric_func=accuracy_score, return_predictions=False ) print("=" * 60) print("LEAVE-ONE-GROUP-OUT CROSS-VALIDATION RESULTS") print("=" * 60) print(f"Aggregate Performance:") agg = results['aggregate'] print(f" Mean accuracy: {agg['mean']:.4f} ± {agg['std']:.4f}") print(f" Weighted mean: {agg['weighted_mean']:.4f}") print(f" Range: [{agg['min']:.4f}, {agg['max']:.4f}]") print(f" Best group: {agg['max_group']} ({agg['max']:.4f})") print(f" Worst group: {agg['min_group']} ({agg['min']:.4f})") print(f"Per-Group Details:") for group, info in sorted(results['group_results'].items()): status = "⚠️" if group in results['problem_groups'] else "✓" print(f" {group}: {info['score']:.4f} ({info['n_samples']} samples) {status}") if results['problem_groups']: print(f"Problem Groups (score < mean - 2σ):") for group, info in results['problem_groups'].items(): print(f" {group}: {info['score']:.4f}")Production Usage with Scikit-learn
For production pipelines, scikit-learn's LeaveOneGroupOut integrates seamlessly with cross_validate:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
from sklearn.model_selection import LeaveOneGroupOut, cross_validatefrom sklearn.preprocessing import StandardScalerfrom sklearn.pipeline import Pipelinefrom sklearn.metrics import make_scorer, f1_score def production_logo_evaluation(model, X, y, groups): """ Production-ready LOGO evaluation with sklearn. """ # Create preprocessing pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', model) ]) # Define LOGO cross-validator logo = LeaveOneGroupOut() # Define metrics scoring = { 'accuracy': 'accuracy', 'f1': make_scorer(f1_score, average='binary', zero_division=0), 'roc_auc': 'roc_auc' } # Perform cross-validation # CRITICAL: Pass groups= explicitly! results = cross_validate( pipeline, X, y, cv=logo, groups=groups, # <-- Don't forget this! scoring=scoring, return_train_score=True, return_estimator=True, # Keep models for analysis n_jobs=-1 ) # Map results to groups unique_groups = np.unique(groups) group_scores = {} for i, group in enumerate(unique_groups): group_scores[group] = { metric: results[f'test_{metric}'][i] for metric in scoring.keys() } return results, group_scoresChoosing between LOGO and Group K-Fold depends on your goals, constraints, and data characteristics.
| Factor | Favors LOGO | Favors Group K-Fold |
|---|---|---|
| Number of groups | 10-50 | 50-10,000+ |
| Samples per group | 50-10,000 | <50 |
| Per-group reporting | Required | Not needed |
| Computational budget | Generous | Tight |
| Group heterogeneity | High (want to see each) | Low (groups similar) |
| Model training time | Fast (<1 min) | Slow (hours) |
For large G with meaningful subgroups, consider hierarchical approaches: Use Group K-Fold at the top level (e.g., regions) and LOGO within each fold to diagnose individual groups (e.g., hospitals within regions). This balances computational cost with diagnostic granularity.
Understanding LOGO's variance properties is crucial for interpreting results and constructing valid confidence intervals.
Sources of Variance
The total variance of the LOGO estimator has two components:
$$\text{Var}(\hat{\theta}{\text{LOGO}}) = \text{Var}{\text{between-groups}} + \text{Var}_{\text{within-groups}}$$
For small groups, within-group variance dominates. For large groups, between-group variance dominates.
Confidence Interval Construction
Standard formulas underestimate uncertainty because they assume independent fold errors. For LOGO, use:
Method 1: Group-Bootstrap Resample groups (not samples) with replacement, compute mean, repeat.
Method 2: Jackknife-After-Bootstrap Combine jackknife influence diagnostics with bootstrap standard errors for better coverage.
Method 3: Corrected t-Interval Use a t-distribution with G-1 degrees of freedom:
$$\hat{\theta} \pm t_{G-1, 1-\alpha/2} \times \frac{s}{\sqrt{G}}$$
where $s$ is the standard deviation of group-level scores.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179
import numpy as npfrom scipy import statsfrom typing import List, Tuple def logo_confidence_interval( group_scores: List[float], group_sizes: List[int] = None, method: str = "bootstrap", confidence: float = 0.95, n_bootstrap: int = 10000, random_state: int = 42) -> Tuple[float, float, float]: """ Compute confidence interval for LOGO estimate. Parameters ---------- group_scores : List[float] Performance metric for each group group_sizes : List[int], optional Number of samples in each group (for weighted estimates) method : str "bootstrap", "t", or "weighted_bootstrap" confidence : float Confidence level (e.g., 0.95 for 95% CI) Returns ------- (mean, ci_lower, ci_upper) """ np.random.seed(random_state) scores = np.array(group_scores) G = len(scores) alpha = 1 - confidence if method == "t": # Corrected t-interval (assumes approximately normal group scores) mean = np.mean(scores) se = np.std(scores, ddof=1) / np.sqrt(G) t_crit = stats.t.ppf(1 - alpha/2, df=G-1) return mean, mean - t_crit * se, mean + t_crit * se elif method == "bootstrap": # Group-level bootstrap bootstrap_means = [] for _ in range(n_bootstrap): resampled = scores[np.random.randint(0, G, size=G)] bootstrap_means.append(np.mean(resampled)) bootstrap_means = np.array(bootstrap_means) return ( np.mean(scores), np.percentile(bootstrap_means, 100 * alpha/2), np.percentile(bootstrap_means, 100 * (1 - alpha/2)) ) elif method == "weighted_bootstrap": if group_sizes is None: raise ValueError("group_sizes required for weighted bootstrap") sizes = np.array(group_sizes) # Weighted mean weighted_mean = np.average(scores, weights=sizes) # Bootstrap with fixed group probabilities bootstrap_means = [] for _ in range(n_bootstrap): resampled_idx = np.random.choice(G, size=G, replace=True) resampled_mean = np.average( scores[resampled_idx], weights=sizes[resampled_idx] ) bootstrap_means.append(resampled_mean) bootstrap_means = np.array(bootstrap_means) return ( weighted_mean, np.percentile(bootstrap_means, 100 * alpha/2), np.percentile(bootstrap_means, 100 * (1 - alpha/2)) ) else: raise ValueError(f"Unknown method: {method}") def analyze_variance_components( group_scores: List[float], group_sizes: List[int], overall_variance_estimate: float = None) -> dict: """ Decompose total variance into between-group and within-group components. Uses one-way ANOVA-style decomposition. """ scores = np.array(group_scores) sizes = np.array(group_sizes) G = len(scores) n_total = sizes.sum() # Overall mean (unweighted) grand_mean = np.mean(scores) # Between-group variance (variance of group means) between_variance = np.var(scores, ddof=1) # Estimate within-group variance from group size # Assuming binomial-like sampling: Var(accuracy) ≈ p(1-p)/n # Rough estimate: use observed variance / sqrt(size) estimated_within_var = np.mean([ s * (1 - s) / sz for s, sz in zip(scores, sizes) ]) # Compute contributions to total variance # For LOGO mean: Var = (1/G²) × Σ[Var(score_g)] # ≈ (between_var + mean_within_var) / G return { 'grand_mean': grand_mean, 'between_group_variance': between_variance, 'estimated_within_group_variance': estimated_within_var, 'variance_of_mean_estimate': between_variance / G, 'se_of_mean': np.sqrt(between_variance / G), 'cv_of_scores': np.std(scores) / np.mean(scores), # Coefficient of variation 'min_score': np.min(scores), 'max_score': np.max(scores), 'range': np.max(scores) - np.min(scores) } # Demonstrationif __name__ == "__main__": np.random.seed(42) # Simulate LOGO results with varying group sizes n_groups = 15 group_sizes = np.random.randint(30, 200, size=n_groups) # True performance varies by group, plus sampling noise true_group_performance = np.random.normal(0.85, 0.05, size=n_groups) # Add sampling noise proportional to 1/sqrt(n) observed_scores = true_group_performance + np.random.normal(0, 0.1) / np.sqrt(group_sizes) observed_scores = np.clip(observed_scores, 0, 1) # Keep in valid range print("Group-level LOGO Results:") for i, (score, size) in enumerate(zip(observed_scores, group_sizes)): print(f" Group {i+1}: {score:.4f} (n={size})") print() # Compute confidence intervals using different methods print("Confidence Interval Comparison:") for method in ["t", "bootstrap", "weighted_bootstrap"]: if method == "weighted_bootstrap": mean, ci_low, ci_high = logo_confidence_interval( observed_scores.tolist(), group_sizes.tolist(), method=method ) else: mean, ci_low, ci_high = logo_confidence_interval( observed_scores.tolist(), method=method ) width = ci_high - ci_low print(f" {method:20s}: {mean:.4f} [{ci_low:.4f}, {ci_high:.4f}] (width={width:.4f})") print() # Variance analysis variance_analysis = analyze_variance_components( observed_scores.tolist(), group_sizes.tolist() ) print("Variance Analysis:") for key, value in variance_analysis.items(): if isinstance(value, float): print(f" {key}: {value:.6f}")When some groups have <20 samples, their individual score estimates have high variance. This doesn't mean the group is inherently problematic—it means you don't have enough data to tell. Consider flagging these for follow-up data collection rather than trusting the score.
One of LOGO's primary advantages is identifying which groups cause problems. Let's develop a systematic approach to failure mode analysis.
Step 1: Identify Outlier Groups
Define problem groups as those with scores significantly below the mean:
Step 2: Characterize Problem Groups
For each problem group, analyze:
Step 3: Distinguish Technical from Data Issues
Step 4: Prioritize Remediation
Rank problem groups by:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235
import numpy as npimport pandas as pdfrom typing import Dict, Any, List class LOGOFailureModeAnalyzer: """ Systematic failure mode analysis for LOGO results. """ def __init__( self, threshold_method: str = "zscore", zscore_threshold: float = 2.0, quantile_threshold: float = 0.1, minimum_score: float = None ): self.threshold_method = threshold_method self.zscore_threshold = zscore_threshold self.quantile_threshold = quantile_threshold self.minimum_score = minimum_score def identify_problem_groups( self, group_scores: Dict[str, float] ) -> Dict[str, Dict]: """Identify groups performing significantly below average.""" scores = np.array(list(group_scores.values())) groups = list(group_scores.keys()) mean_score = np.mean(scores) std_score = np.std(scores) problems = {} for group, score in group_scores.items(): issues = [] # Z-score check zscore = (score - mean_score) / std_score if std_score > 0 else 0 if zscore < -self.zscore_threshold: issues.append(f"zscore = {zscore:.2f}") # Quantile check percentile = (scores < score).mean() if percentile < self.quantile_threshold: issues.append(f"bottom {percentile*100:.1f}th percentile") # Minimum threshold check if self.minimum_score and score < self.minimum_score: issues.append(f"below minimum ({self.minimum_score})") if issues: problems[group] = { 'score': score, 'zscore': zscore, 'percentile': percentile, 'issues': issues, 'gap_from_mean': mean_score - score } return problems def analyze_group_characteristics( self, X: np.ndarray, y: np.ndarray, groups: np.ndarray, problem_groups: Dict[str, Dict], group_metadata: Dict[str, Dict] = None ) -> pd.DataFrame: """ Compare feature distributions between problem and non-problem groups. """ problem_set = set(problem_groups.keys()) analysis_rows = [] for group in np.unique(groups): mask = groups == group X_group = X[mask] y_group = y[mask] row = { 'group': group, 'is_problem': group in problem_set, 'n_samples': len(y_group), 'class_1_ratio': np.mean(y_group), 'feature_mean': np.mean(X_group), 'feature_std': np.std(X_group), 'feature_min': np.min(X_group), 'feature_max': np.max(X_group), } # Add per-feature statistics for feat_idx in range(min(X.shape[1], 5)): # First 5 features row[f'feat_{feat_idx}_mean'] = np.mean(X_group[:, feat_idx]) row[f'feat_{feat_idx}_std'] = np.std(X_group[:, feat_idx]) # Add metadata if available if group_metadata and group in group_metadata: row.update({f'meta_{k}': v for k, v in group_metadata[group].items()}) analysis_rows.append(row) df = pd.DataFrame(analysis_rows) return df def generate_remediation_report( self, problem_groups: Dict[str, Dict], characteristic_analysis: pd.DataFrame ) -> str: """Generate actionable remediation recommendations.""" report_lines = [ "=" * 60, "LOGO FAILURE MODE ANALYSIS REPORT", "=" * 60, f"Total problem groups identified: {len(problem_groups)}", "" ] # Sort by severity (gap from mean) sorted_problems = sorted( problem_groups.items(), key=lambda x: x[1]['gap_from_mean'], reverse=True ) for group, info in sorted_problems: group_row = characteristic_analysis[ characteristic_analysis['group'] == group ].iloc[0] report_lines.append(f"{'─' * 40}") report_lines.append(f"Group: {group}") report_lines.append(f" Score: {info['score']:.4f} (z={info['zscore']:.2f})") report_lines.append(f" Gap from mean: {info['gap_from_mean']:.4f}") report_lines.append(f" Issues: {', '.join(info['issues'])}") report_lines.append(f" Characteristics:") report_lines.append(f" N samples: {group_row['n_samples']}") report_lines.append(f" Class 1 ratio: {group_row['class_1_ratio']:.2%}") report_lines.append(f" Feature mean: {group_row['feature_mean']:.4f}") # Recommendations report_lines.append(f" Recommendations:") if group_row['n_samples'] < 30: report_lines.append(" ⚠️ Small sample size - consider getting more data") if group_row['class_1_ratio'] < 0.1 or group_row['class_1_ratio'] > 0.9: report_lines.append(" ⚠️ Extreme class imbalance - check label quality") if abs(group_row['feature_mean']) > 2: report_lines.append(" ⚠️ Feature distribution differs - check data pipeline") # Add comparative statistics df = characteristic_analysis problem_df = df[df['is_problem'] == True] normal_df = df[df['is_problem'] == False] if len(problem_df) > 0 and len(normal_df) > 0: report_lines.extend([ "" + "=" * 60, "COMPARATIVE ANALYSIS: Problem vs. Normal Groups", "=" * 60 ]) for col in ['n_samples', 'class_1_ratio', 'feature_mean', 'feature_std']: p_mean = problem_df[col].mean() n_mean = normal_df[col].mean() diff = p_mean - n_mean report_lines.append( f" {col:20s}: Problem={p_mean:.3f}, Normal={n_mean:.3f}, Δ={diff:+.3f}" ) return "".join(report_lines) # Demonstrationif __name__ == "__main__": np.random.seed(42) # Create synthetic LOGO results with some problem groups group_scores = { 'Hospital_A': 0.92, 'Hospital_B': 0.88, 'Hospital_C': 0.65, # Problem group 'Hospital_D': 0.85, 'Hospital_E': 0.87, 'Hospital_F': 0.58, # Problem group 'Hospital_G': 0.90, 'Hospital_H': 0.91, 'Hospital_I': 0.84, 'Hospital_J': 0.86, } # Simulate feature data n_per_group = {g: np.random.randint(50, 150) for g in group_scores} X_list, y_list, groups_list = [], [], [] for group, score in group_scores.items(): n = n_per_group[group] # Problem groups have different feature distributions if score < 0.7: X_g = np.random.randn(n, 10) + 1.5 # Shifted else: X_g = np.random.randn(n, 10) y_g = (np.random.random(n) < score).astype(int) X_list.append(X_g) y_list.append(y_g) groups_list.extend([group] * n) X = np.vstack(X_list) y = np.concatenate(y_list) groups = np.array(groups_list) # Analyze failures analyzer = LOGOFailureModeAnalyzer( threshold_method="zscore", zscore_threshold=1.5 ) problems = analyzer.identify_problem_groups(group_scores) characteristics = analyzer.analyze_group_characteristics(X, y, groups, problems) report = analyzer.generate_remediation_report(problems, characteristics) print(report)LOGO requires G model trainings, which can be prohibitive for large G or expensive models. Here are strategies to manage computational cost:
Strategy 1: Parallelization
LOGO iterations are embarrassingly parallel—each can run independently. Use joblib, multiprocessing, or distributed computing:
from joblib import Parallel, delayed
results = Parallel(n_jobs=-1)(
delayed(train_and_evaluate)(model, X[train], y[train], X[test], y[test])
for train, test in logo.split(X, y, groups)
)
Strategy 2: Progressive Evaluation
Start with a subset of groups, compute confidence intervals, and stop early if precision is sufficient:
Strategy 3: Model Caching
If groups overlap significantly in training data (G-1 groups are always shared), consider:
Strategy 4: Approximate LOGO
For very large G, use pseudo-LOGO:
| Strategy | Speedup | Trade-off | Best For |
|---|---|---|---|
| Parallelization | Up to n_cpus× | Memory usage | Any computational budget |
| Progressive evaluation | Variable | May stop too early | Wide CIs acceptable |
| Model caching | 2-5× | Implementation complexity | Expensive model training |
| Approximate LOGO | G/k× | Loss of per-group detail | Very large G (1000+) |
Cloud platforms allow spinning up G instances in parallel, running one fold each, and aggregating results. For 100 groups with 10-minute model training, wall-clock time drops from 1000 minutes to ~15 minutes (including overhead). The marginal cost of parallelization often justifies rigorous evaluation.
We've thoroughly explored Leave-One-Group-Out cross-validation—the most granular form of group-aware evaluation. Here are the essential takeaways:
Both stratification (preserving class proportions) and group handling (preventing leakage) are critical—but what if you need both? The next page covers Grouped Stratification, techniques for maintaining class balance within groups while ensuring group separation across folds.