Loading learning content...
You've trained several models, computed their ROC and PR curves, and calculated AUC and Average Precision. Now comes the critical question: Which model should you deploy?
This question seems simple when one model dominates everywhere, but reality is rarely so clean:
Curve comparison is the art and science of extracting actionable insights from these complex situations—turning visualization and statistics into defensible model selection decisions.
By the end of this page, you will master visual comparison techniques for curves, statistical tests for significant differences, strategies for handling crossing curves, multi-model comparison frameworks, and decision guides based on operational requirements.
The ideal scenario for model comparison is when one classifier unambiguously dominates another across all operating points.
Classifier A ROC-dominates classifier B if:
Interpretation: At every possible false positive rate, A achieves at least as high a true positive rate as B (and strictly better somewhere).
Classifier A PR-dominates classifier B if:
Interpretation: At every desired recall level, A achieves at least as high precision as B.
A fundamental result connects ROC and PR dominance:
If classifier A dominates B in ROC space, then A also dominates B in PR space.
However, the converse is NOT true: A can dominate in PR space without dominating in ROC space.
Implication: ROC dominance is a stronger condition. PR dominance can occur without ROC dominance due to class imbalance effects.
When neither classifier dominates, we have a partial ordering:
This is the common case in practice. Curves cross, and the 'better' model depends on the operating region that matters for your application.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom sklearn.metrics import roc_curve, precision_recall_curve def check_roc_dominance(fpr_a, tpr_a, fpr_b, tpr_b, n_points=100): """ Check if curve A dominates curve B in ROC space. Returns: -------- 'A_dominates' : A strictly dominates B 'B_dominates' : B strictly dominates A 'neither' : Curves cross (partial ordering) 'equal' : Curves are identical """ # Interpolate both curves at common FPR points fpr_common = np.linspace(0, 1, n_points) tpr_a_interp = np.interp(fpr_common, fpr_a, tpr_a) tpr_b_interp = np.interp(fpr_common, fpr_b, tpr_b) diff = tpr_a_interp - tpr_b_interp a_better = np.sum(diff > 1e-6) # A strictly better b_better = np.sum(diff < -1e-6) # B strictly better equal = np.sum(np.abs(diff) <= 1e-6) if a_better > 0 and b_better == 0: return 'A_dominates' elif b_better > 0 and a_better == 0: return 'B_dominates' elif a_better == 0 and b_better == 0: return 'equal' else: return 'neither' def check_pr_dominance(prec_a, rec_a, prec_b, rec_b, n_points=100): """ Check if curve A dominates curve B in PR space. """ # Interpolate at common recall points rec_common = np.linspace(0, 1, n_points) # For PR curves, need careful interpolation (use max precision at >= recall) def interpolate_pr(prec, rec, rec_query): result = np.zeros_like(rec_query) for i, r in enumerate(rec_query): valid = prec[rec >= r] result[i] = np.max(valid) if len(valid) > 0 else 0 return result prec_a_interp = interpolate_pr(prec_a, rec_a, rec_common) prec_b_interp = interpolate_pr(prec_b, rec_b, rec_common) diff = prec_a_interp - prec_b_interp a_better = np.sum(diff > 1e-6) b_better = np.sum(diff < -1e-6) if a_better > 0 and b_better == 0: return 'A_dominates' elif b_better > 0 and a_better == 0: return 'B_dominates' elif a_better == 0 and b_better == 0: return 'equal' else: return 'neither' # Example usagenp.random.seed(42)n = 500 # Generate data where neither model dominatessignal = np.random.randn(n)labels = (signal > 0).astype(int) scores_a = signal + np.random.randn(n) * 0.8 # Model Ascores_b = 0.3 * signal + 0.7 * (signal ** 3 / 3) + np.random.randn(n) * 0.6 # Model B (non-linear) fpr_a, tpr_a, _ = roc_curve(labels, scores_a)fpr_b, tpr_b, _ = roc_curve(labels, scores_b) print(f"ROC Dominance: {check_roc_dominance(fpr_a, tpr_a, fpr_b, tpr_b)}")Effective visualization is often the first and most informative step in curve comparison. Let's explore best practices.
The standard approach: plot multiple curves on the same axes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import (roc_curve, precision_recall_curve, roc_auc_score, average_precision_score)from sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.svm import SVC def create_comparison_visualization(X, y, models_dict, figsize=(14, 5)): """ Create publication-quality ROC and PR curve comparisons. """ X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y, random_state=42 ) fig, axes = plt.subplots(1, 2, figsize=figsize) # Color palette for accessibility colors = plt.cm.tab10(np.linspace(0, 1, len(models_dict))) results = {} for (name, model), color in zip(models_dict.items(), colors): # Train and predict model.fit(X_train, y_train) # Get probability scores if hasattr(model, 'predict_proba'): probs = model.predict_proba(X_test)[:, 1] else: probs = model.decision_function(X_test) # Compute curves fpr, tpr, _ = roc_curve(y_test, probs) prec, rec, _ = precision_recall_curve(y_test, probs) # Compute metrics auc = roc_auc_score(y_test, probs) ap = average_precision_score(y_test, probs) results[name] = {'auc': auc, 'ap': ap} # Plot ROC axes[0].plot(fpr, tpr, color=color, linewidth=2, label=f'{name} (AUC={auc:.3f})') # Plot PR axes[1].plot(rec, prec, color=color, linewidth=2, label=f'{name} (AP={ap:.3f})') # ROC formatting axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random') axes[0].set_xlabel('False Positive Rate', fontsize=12) axes[0].set_ylabel('True Positive Rate', fontsize=12) axes[0].set_title('ROC Curve Comparison', fontsize=14, fontweight='bold') axes[0].legend(loc='lower right', fontsize=9) axes[0].set_xlim([-0.02, 1.02]) axes[0].set_ylim([-0.02, 1.02]) axes[0].grid(alpha=0.3) axes[0].set_aspect('equal') # PR formatting base_rate = np.mean(y_test) axes[1].axhline(y=base_rate, color='gray', linestyle='--', linewidth=1, label=f'Random (base={base_rate:.2f})') axes[1].set_xlabel('Recall', fontsize=12) axes[1].set_ylabel('Precision', fontsize=12) axes[1].set_title('Precision-Recall Curve Comparison', fontsize=14, fontweight='bold') axes[1].legend(loc='lower left', fontsize=9) axes[1].set_xlim([-0.02, 1.02]) axes[1].set_ylim([-0.02, 1.02]) axes[1].grid(alpha=0.3) axes[1].set_aspect('equal') plt.tight_layout() return fig, results # Example usageX, y = make_classification(n_samples=2000, n_features=20, weights=[0.9, 0.1], random_state=42) models = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'Gradient Boosting': GradientBoostingClassifier(random_state=42),} fig, results = create_comparison_visualization(X, y, models)plt.savefig('curve_comparison.png', dpi=150, bbox_inches='tight')plt.show()When curves are close, overlaid plots can be hard to interpret. Difference plots show the gap between curves explicitly:
123456789101112131415161718192021222324252627282930313233343536373839
def plot_curve_difference(fpr_a, tpr_a, fpr_b, tpr_b, name_a='Model A', name_b='Model B'): """ Plot the TPR difference between two ROC curves. Positive values = A is better; Negative = B is better. """ # Common FPR points fpr_common = np.linspace(0, 1, 200) tpr_diff = np.interp(fpr_common, fpr_a, tpr_a) - np.interp(fpr_common, fpr_b, tpr_b) fig, axes = plt.subplots(1, 2, figsize=(12, 4)) # Overlaid curves axes[0].plot(fpr_a, tpr_a, 'b-', linewidth=2, label=name_a) axes[0].plot(fpr_b, tpr_b, 'r-', linewidth=2, label=name_b) axes[0].plot([0, 1], [0, 1], 'k--') axes[0].legend() axes[0].set_xlabel('FPR') axes[0].set_ylabel('TPR') axes[0].set_title('ROC Curves') axes[0].grid(alpha=0.3) # Difference plot axes[1].fill_between(fpr_common, 0, tpr_diff, where=(tpr_diff >= 0), color='blue', alpha=0.3, label=f'{name_a} better') axes[1].fill_between(fpr_common, 0, tpr_diff, where=(tpr_diff < 0), color='red', alpha=0.3, label=f'{name_b} better') axes[1].axhline(y=0, color='black', linewidth=1) axes[1].set_xlabel('FPR') axes[1].set_ylabel('TPR Difference') axes[1].set_title(f'TPR({name_a}) - TPR({name_b})') axes[1].legend() axes[1].grid(alpha=0.3) plt.tight_layout() return figVisual differences might be due to chance. Statistical tests help determine if observed differences are significant.
The DeLong test (1988) is the gold standard for comparing correlated AUCs. We covered this in the AUC page, but here's the practical application:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom scipy import statsfrom sklearn.metrics import roc_auc_score, average_precision_score def comprehensive_comparison(scores_a, scores_b, labels, n_bootstrap=2000, alpha=0.05): """ Comprehensive statistical comparison between two classifiers. Returns significance tests for both AUC and AP. """ from sklearn.utils import resample # Point estimates auc_a = roc_auc_score(labels, scores_a) auc_b = roc_auc_score(labels, scores_b) ap_a = average_precision_score(labels, scores_a) ap_b = average_precision_score(labels, scores_b) # Bootstrap for confidence intervals and p-values auc_diffs = [] ap_diffs = [] for _ in range(n_bootstrap): idx = resample(range(len(labels)), stratify=labels) auc_a_boot = roc_auc_score(labels[idx], scores_a[idx]) auc_b_boot = roc_auc_score(labels[idx], scores_b[idx]) ap_a_boot = average_precision_score(labels[idx], scores_a[idx]) ap_b_boot = average_precision_score(labels[idx], scores_b[idx]) auc_diffs.append(auc_a_boot - auc_b_boot) ap_diffs.append(ap_a_boot - ap_b_boot) auc_diffs = np.array(auc_diffs) ap_diffs = np.array(ap_diffs) # Compute confidence intervals def ci(diffs): return np.percentile(diffs, [2.5, 97.5]) # Compute p-values (proportion of bootstrap samples on opposite side of 0) def pvalue(diffs, observed): if observed >= 0: return 2 * np.mean(diffs <= 0) else: return 2 * np.mean(diffs >= 0) auc_ci = ci(auc_diffs) ap_ci = ci(ap_diffs) auc_pvalue = min(1.0, pvalue(auc_diffs, auc_a - auc_b)) ap_pvalue = min(1.0, pvalue(ap_diffs, ap_a - ap_b)) results = { 'auc_a': auc_a, 'auc_b': auc_b, 'auc_diff': auc_a - auc_b, 'auc_ci': auc_ci, 'auc_pvalue': auc_pvalue, 'auc_significant': auc_pvalue < alpha, 'ap_a': ap_a, 'ap_b': ap_b, 'ap_diff': ap_a - ap_b, 'ap_ci': ap_ci, 'ap_pvalue': ap_pvalue, 'ap_significant': ap_pvalue < alpha, } return results def print_comparison_report(results, name_a='Model A', name_b='Model B'): """Pretty-print comparison results.""" print("=" * 60) print(f"CLASSIFIER COMPARISON: {name_a} vs {name_b}") print("=" * 60) print(f"\n{'Metric':<15} | {name_a:<10} | {name_b:<10} | {'Diff':>8} | {'p-value':>8}") print("-" * 60) print(f"{'AUC-ROC':<15} | {results['auc_a']:<10.4f} | {results['auc_b']:<10.4f} | " f"{results['auc_diff']:>+8.4f} | {results['auc_pvalue']:>8.4f}" f"{'*' if results['auc_significant'] else ''}") print(f"{'Avg Precision':<15} | {results['ap_a']:<10.4f} | {results['ap_b']:<10.4f} | " f"{results['ap_diff']:>+8.4f} | {results['ap_pvalue']:>8.4f}" f"{'*' if results['ap_significant'] else ''}") print("\n95% Confidence Intervals for differences:") print(f" AUC diff: [{results['auc_ci'][0]:+.4f}, {results['auc_ci'][1]:+.4f}]") print(f" AP diff: [{results['ap_ci'][0]:+.4f}, {results['ap_ci'][1]:+.4f}]") print("\n* indicates p < 0.05") # Usage examplenp.random.seed(42)n = 800labels = np.random.binomial(1, 0.15, n)scores_a = np.random.beta(3, 2 if labels else 5, n).diagonal() if False else \ np.where(labels, np.random.beta(4, 2, n), np.random.beta(2, 4, n))scores_b = np.where(labels, np.random.beta(3.5, 2.5, n), np.random.beta(2.5, 3.5, n)) results = comprehensive_comparison(scores_a, scores_b, labels)print_comparison_report(results, 'GradBoost', 'LogisticReg')When comparing many models pairwise, apply multiple comparisons correction (Bonferroni, Holm, or FDR). With 10 models, you have 45 pairwise comparisons—at α=0.05, you'd expect ~2 false positives by chance alone.
When curves cross, neither model dominates. The 'better' model depends on where you operate.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
def find_roc_crossovers(fpr_a, tpr_a, fpr_b, tpr_b): """ Find FPR values where ROC curves A and B cross. """ # Interpolate to common grid fpr_common = np.linspace(0, 1, 1000) tpr_a_interp = np.interp(fpr_common, fpr_a, tpr_a) tpr_b_interp = np.interp(fpr_common, fpr_b, tpr_b) # Find sign changes diff = tpr_a_interp - tpr_b_interp sign_changes = np.where(np.diff(np.sign(diff)))[0] crossovers = [] for idx in sign_changes: crossover_fpr = (fpr_common[idx] + fpr_common[idx+1]) / 2 crossover_tpr = (tpr_a_interp[idx] + tpr_a_interp[idx+1]) / 2 crossovers.append({ 'fpr': crossover_fpr, 'tpr': crossover_tpr, 'a_better_before': diff[idx] > 0 }) return crossovers def analyze_regions(crossovers, fpr_a, tpr_a, fpr_b, tpr_b, auc_a, auc_b): """ Analyze which model is better in each FPR region. """ boundaries = [0] + [c['fpr'] for c in crossovers] + [1] print("Regional Analysis:") print("-" * 50) for i in range(len(boundaries) - 1): fpr_low, fpr_high = boundaries[i], boundaries[i+1] # sample midpoint mid_fpr = (fpr_low + fpr_high) / 2 tpr_a_mid = np.interp(mid_fpr, fpr_a, tpr_a) tpr_b_mid = np.interp(mid_fpr, fpr_b, tpr_b) winner = 'A' if tpr_a_mid > tpr_b_mid else 'B' print(f"FPR [{fpr_low:.3f} - {fpr_high:.3f}]: Model {winner} is better") print(f"\nOverall AUC: A={auc_a:.4f}, B={auc_b:.4f}") print(f"AUC Winner: {'A' if auc_a > auc_b else 'B'}")When curves cross, use this decision framework:
| Your Priority | Choose Model That... | Example Scenario |
|---|---|---|
| Minimize false alarms | Wins at low FPR region | Fraud detection, medical diagnosis |
| Maximize detection | Wins at high TPR/Recall region | Safety-critical systems, spam filtering |
| Balance | Wins around Youden's J point | General classification tasks |
| Operating point unknown | Has higher overall AUC/AP | Building a general-purpose API |
| Different users, different needs | Is most robust across regions | Platform serving varied use cases |
When curves cross, the model with higher AUC might be inferior in YOUR operating region. AUC averages over all thresholds, but you'll use only one threshold. Always verify performance at your specific operating point.
When comparing more than two models, visualization and statistical testing require adapted approaches.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import matplotlib.pyplot as pltimport numpy as np def plot_multi_model_roc(models_results, highlight_best=True): """ Plot ROC curves for multiple models with confidence bands. models_results: dict of {'model_name': {'fpr': [], 'tpr': [], 'auc': float, ...}} """ fig, ax = plt.subplots(figsize=(10, 8)) # Sort by AUC for legend ordering sorted_models = sorted(models_results.items(), key=lambda x: x[1]['auc'], reverse=True) colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_models))) for i, (name, data) in enumerate(sorted_models): color = colors[i] linewidth = 3 if highlight_best and i == 0 else 1.5 alpha = 1.0 if highlight_best and i == 0 else 0.7 ax.plot(data['fpr'], data['tpr'], color=color, linewidth=linewidth, alpha=alpha, label=f"{name} (AUC={data['auc']:.3f})") # Add confidence band if available if 'tpr_lower' in data and 'tpr_upper' in data: ax.fill_between(data['fpr'], data['tpr_lower'], data['tpr_upper'], color=color, alpha=0.15) ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random') ax.set_xlabel('False Positive Rate', fontsize=12) ax.set_ylabel('True Positive Rate', fontsize=12) ax.set_title('ROC Curves: Multi-Model Comparison', fontsize=14) ax.legend(loc='lower right', fontsize=9) ax.grid(alpha=0.3) ax.set_xlim([-0.02, 1.02]) ax.set_ylim([-0.02, 1.02]) return fig def create_ranking_heatmap(models_results, metric='auc', show_pvalues=True, results_matrix=None): """ Create a heatmap showing pairwise comparisons. """ import seaborn as sns model_names = list(models_results.keys()) n = len(model_names) # Create comparison matrix comparison_matrix = np.zeros((n, n)) for i, m1 in enumerate(model_names): for j, m2 in enumerate(model_names): if i == j: comparison_matrix[i, j] = 0 else: # Difference: positive = model i better than model j val1 = models_results[m1][metric] val2 = models_results[m2][metric] comparison_matrix[i, j] = val1 - val2 # Plot fig, ax = plt.subplots(figsize=(8, 6)) mask = np.eye(n, dtype=bool) sns.heatmap(comparison_matrix, xticklabels=model_names, yticklabels=model_names, center=0, cmap='RdBu_r', annot=True, fmt='.3f', mask=mask, ax=ax) ax.set_title(f'{metric.upper()} Difference Matrix\n(Row - Column)', fontsize=12) ax.set_ylabel('Model (higher score)') ax.set_xlabel('Model (lower score)') plt.tight_layout() return figTo establish a robust ranking across many models:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
def robust_model_ranking(scores_dict, labels, n_bootstrap=1000): """ Rank models with statistical confidence. scores_dict: {'model_name': scores_array} Returns ranking with win/tie probabilities. """ from sklearn.utils import resample model_names = list(scores_dict.keys()) n_models = len(model_names) # Track pairwise wins across bootstrap wins = {m: 0 for m in model_names} pairwise_wins = {(m1, m2): 0 for m1 in model_names for m2 in model_names if m1 != m2} for _ in range(n_bootstrap): idx = resample(range(len(labels)), stratify=labels) # Compute AUC for each model on this bootstrap sample boot_aucs = {} for name, scores in scores_dict.items(): boot_aucs[name] = roc_auc_score(labels[idx], scores[idx]) # Find winner best_model = max(boot_aucs, key=boot_aucs.get) wins[best_model] += 1 # Track pairwise for m1 in model_names: for m2 in model_names: if m1 != m2 and boot_aucs[m1] > boot_aucs[m2]: pairwise_wins[(m1, m2)] += 1 # Compute win probabilities win_probs = {m: w / n_bootstrap for m, w in wins.items()} # Sort by win probability ranking = sorted(win_probs.items(), key=lambda x: -x[1]) print("Model Ranking (by bootstrap win probability):") print("-" * 50) print(f"{'Rank':<6} | {'Model':<20} | {'Win Prob':<10} | {'AUC':<8}") print("-" * 50) for rank, (model, prob) in enumerate(ranking, 1): auc = roc_auc_score(labels, scores_dict[model]) print(f"{rank:<6} | {model:<20} | {prob:<10.1%} | {auc:<8.4f}") return ranking, pairwise_winsFor comparing many models across multiple datasets, consider Nemenyi test with critical difference diagrams (common in ML literature). These show which models significantly differ in average rank across datasets.
After selecting a model, you must choose an operating point (threshold). The curve comparison framework informs this choice.
| Method | Formula/Description | When to Use |
|---|---|---|
| Youden's J | Maximize TPR - FPR | Balanced importance of sensitivity and specificity |
| Closest to (0,1) | Minimize √(FPR² + (1-TPR)²) | Similar to Youden's J, geometric interpretation |
| Fixed Sensitivity | Find threshold for TPR ≥ target | When minimum detection rate is mandated |
| Fixed Specificity | Find threshold for TNR ≥ target | When maximum false alarm rate is constrained |
| Fixed Precision | Find threshold for Precision ≥ target | When positive prediction quality is critical |
| Cost Minimization | Minimize C_FP×FP + C_FN×FN | When misclassification costs are known |
| F-score Optimization | Maximize Fβ for chosen β | When precision-recall balance is specified |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
import numpy as npfrom sklearn.metrics import roc_curve, precision_recall_curve def find_optimal_threshold(y_true, y_scores, method='youden', **kwargs): """ Find optimal classification threshold using various methods. Parameters: ----------- method : str 'youden', 'closest', 'fixed_tpr', 'fixed_fpr', 'fixed_precision', 'cost', 'f_score' kwargs : dict Method-specific parameters Returns: -------- dict with optimal threshold and operating point metrics """ fpr, tpr, thresholds_roc = roc_curve(y_true, y_scores) precision, recall, thresholds_pr = precision_recall_curve(y_true, y_scores) if method == 'youden': # Maximize TPR - FPR j_scores = tpr - fpr best_idx = np.argmax(j_scores) best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5 elif method == 'closest': # Closest to (0, 1) distances = np.sqrt(fpr**2 + (1 - tpr)**2) best_idx = np.argmin(distances) best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5 elif method == 'fixed_tpr': # Minimum threshold for TPR >= target target = kwargs.get('target', 0.9) valid_idx = np.where(tpr >= target)[0] if len(valid_idx) > 0: # Take the one with lowest FPR among valid best_idx = valid_idx[np.argmin(fpr[valid_idx])] best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5 else: best_threshold = 0.0 # Accept everything to achieve target best_idx = -1 elif method == 'fixed_fpr': # Maximum threshold for FPR <= target target = kwargs.get('target', 0.1) valid_idx = np.where(fpr <= target)[0] if len(valid_idx) > 0: # Take the one with highest TPR among valid best_idx = valid_idx[np.argmax(tpr[valid_idx])] best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5 else: best_threshold = 1.0 # Reject everything best_idx = 0 elif method == 'cost': # Minimize cost: C_FP * FP + C_FN * FN c_fp = kwargs.get('cost_fp', 1) c_fn = kwargs.get('cost_fn', 1) n_pos = np.sum(y_true) n_neg = len(y_true) - n_pos # Convert rates to counts fp_counts = fpr * n_neg fn_counts = (1 - tpr) * n_pos costs = c_fp * fp_counts + c_fn * fn_counts best_idx = np.argmin(costs) best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5 elif method == 'f_score': # Maximize F-beta score beta = kwargs.get('beta', 1.0) # Use PR curve for F-score # F_beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall) with np.errstate(divide='ignore', invalid='ignore'): f_scores = (1 + beta**2) * (precision * recall) / (beta**2 * precision + recall) f_scores = np.nan_to_num(f_scores) best_idx = np.argmax(f_scores) best_threshold = thresholds_pr[best_idx] if best_idx < len(thresholds_pr) else 0.5 else: raise ValueError(f"Unknown method: {method}") # Compute metrics at threshold y_pred = (y_scores >= best_threshold).astype(int) tp = np.sum((y_pred == 1) & (y_true == 1)) fp = np.sum((y_pred == 1) & (y_true == 0)) tn = np.sum((y_pred == 0) & (y_true == 0)) fn = np.sum((y_pred == 0) & (y_true == 1)) return { 'threshold': best_threshold, 'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0, 'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0, 'precision': tp / (tp + fp) if (tp + fp) > 0 else 0, 'recall': tp / (tp + fn) if (tp + fn) > 0 else 0, 'f1': 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) > 0 else 0, } # Example: Compare threshold selection methodsnp.random.seed(42)n = 1000y_true = np.random.binomial(1, 0.15, n)y_scores = np.where(y_true, np.random.beta(4, 2, n), np.random.beta(2, 4, n)) print("\nThreshold Selection Method Comparison:")print("=" * 70)print(f"{'Method':<20} | {'Threshold':>9} | {'TPR':>6} | {'FPR':>6} | {'Prec':>6} | {'F1':>6}")print("-" * 70) for method in ['youden', 'closest', 'fixed_tpr', 'fixed_fpr', 'cost', 'f_score']: if method == 'fixed_tpr': result = find_optimal_threshold(y_true, y_scores, method, target=0.9) elif method == 'fixed_fpr': result = find_optimal_threshold(y_true, y_scores, method, target=0.1) elif method == 'cost': result = find_optimal_threshold(y_true, y_scores, method, cost_fp=1, cost_fn=5) else: result = find_optimal_threshold(y_true, y_scores, method) print(f"{method:<20} | {result['threshold']:>9.4f} | " f"{result['tpr']:>6.3f} | {result['fpr']:>6.3f} | " f"{result['precision']:>6.3f} | {result['f1']:>6.3f}")Never select thresholds on test data—this leaks information and inflates performance estimates. Use a separate validation set or nested cross-validation for threshold selection.
Let's synthesize everything into a practical, step-by-step workflow for curve comparison.
The same test set cannot be used for (1) model comparison, (2) threshold selection, and (3) final performance reporting. Use nested cross-validation or separate holdout sets to prevent optimistic bias in reported results.
We've developed comprehensive methods for comparing ROC and PR curves between classifiers. Let's consolidate the key insights:
Module Complete:
You have now mastered ROC and Precision-Recall curves—from construction to summary metrics to rigorous comparison. These tools form the foundation for evaluating any binary classifier, whether in academic research or production ML systems.
Congratulations! You've completed the comprehensive module on ROC and Precision-Recall Curves. You can now construct, interpret, summarize, and rigorously compare classifier evaluation curves—essential skills for any machine learning practitioner.