Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

5 / 5

Curve Comparison

Making Model Selection Decisions

You've trained several models, computed their ROC and PR curves, and calculated AUC and Average Precision. Now comes the critical question: Which model should you deploy?

This question seems simple when one model dominates everywhere, but reality is rarely so clean:

Curves cross at multiple points
One model wins on AUC, another on AP
Confidence intervals overlap
Different stakeholders care about different operating regions

Curve comparison is the art and science of extracting actionable insights from these complex situations—turning visualization and statistics into defensible model selection decisions.

What You Will Learn

By the end of this page, you will master visual comparison techniques for curves, statistical tests for significant differences, strategies for handling crossing curves, multi-model comparison frameworks, and decision guides based on operational requirements.

Dominance and Partial Ordering

The ideal scenario for model comparison is when one classifier unambiguously dominates another across all operating points.

ROC Dominance

Classifier A ROC-dominates classifier B if:

For all FPR values: TPR_A(FPR) ≥ TPR_B(FPR)
For at least one FPR value: TPR_A(FPR) > TPR_B(FPR)

Interpretation: At every possible false positive rate, A achieves at least as high a true positive rate as B (and strictly better somewhere).

PR Dominance

Classifier A PR-dominates classifier B if:

For all recall values: Precision_A(R) ≥ Precision_B(R)
For at least one recall value: Precision_A(R) > Precision_B(R)

Interpretation: At every desired recall level, A achieves at least as high precision as B.

The Dominance Theorem

A fundamental result connects ROC and PR dominance:

Dominance Transfer Theorem (Davis & Goadrich, 2006)

If classifier A dominates B in ROC space, then A also dominates B in PR space.

However, the converse is NOT true: A can dominate in PR space without dominating in ROC space.

Implication: ROC dominance is a stronger condition. PR dominance can occur without ROC dominance due to class imbalance effects.

Partial Ordering

When neither classifier dominates, we have a partial ordering:

A is better in some regions
B is better in other regions
Neither is universally superior

This is the common case in practice. Curves cross, and the 'better' model depends on the operating region that matters for your application.

Checking Dominance Programmatically

Dominance Check
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from sklearn.metrics import roc_curve, precision_recall_curve
 
def check_roc_dominance(fpr_a, tpr_a, fpr_b, tpr_b, n_points=100):
    """
    Check if curve A dominates curve B in ROC space.
    
    Returns:
    --------
    'A_dominates' : A strictly dominates B
    'B_dominates' : B strictly dominates A  
    'neither' : Curves cross (partial ordering)
    'equal' : Curves are identical
    """
    # Interpolate both curves at common FPR points
    fpr_common = np.linspace(0, 1, n_points)
    
    tpr_a_interp = np.interp(fpr_common, fpr_a, tpr_a)
    tpr_b_interp = np.interp(fpr_common, fpr_b, tpr_b)
    
    diff = tpr_a_interp - tpr_b_interp
    
    a_better = np.sum(diff > 1e-6)  # A strictly better
    b_better = np.sum(diff < -1e-6)  # B strictly better
    equal = np.sum(np.abs(diff) <= 1e-6)
    
    if a_better > 0 and b_better == 0:
        return 'A_dominates'
    elif b_better > 0 and a_better == 0:
        return 'B_dominates'
    elif a_better == 0 and b_better == 0:
        return 'equal'
    else:
        return 'neither'
 
def check_pr_dominance(prec_a, rec_a, prec_b, rec_b, n_points=100):
    """
    Check if curve A dominates curve B in PR space.
    """
    # Interpolate at common recall points
    rec_common = np.linspace(0, 1, n_points)
    
    # For PR curves, need careful interpolation (use max precision at >= recall)
    def interpolate_pr(prec, rec, rec_query):
        result = np.zeros_like(rec_query)
        for i, r in enumerate(rec_query):
            valid = prec[rec >= r]
            result[i] = np.max(valid) if len(valid) > 0 else 0
        return result
    
    prec_a_interp = interpolate_pr(prec_a, rec_a, rec_common)
    prec_b_interp = interpolate_pr(prec_b, rec_b, rec_common)
    
    diff = prec_a_interp - prec_b_interp
    
    a_better = np.sum(diff > 1e-6)
    b_better = np.sum(diff < -1e-6)
    
    if a_better > 0 and b_better == 0:
        return 'A_dominates'
    elif b_better > 0 and a_better == 0:
        return 'B_dominates'
    elif a_better == 0 and b_better == 0:
        return 'equal'
    else:
        return 'neither'
 
# Example usage
np.random.seed(42)
n = 500
 
# Generate data where neither model dominates
signal = np.random.randn(n)
labels = (signal > 0).astype(int)
 
scores_a = signal + np.random.randn(n) * 0.8  # Model A
scores_b = 0.3 * signal + 0.7 * (signal ** 3 / 3) + np.random.randn(n) * 0.6  # Model B (non-linear)
 
fpr_a, tpr_a, _ = roc_curve(labels, scores_a)
fpr_b, tpr_b, _ = roc_curve(labels, scores_b)
 
print(f"ROC Dominance: {check_roc_dominance(fpr_a, tpr_a, fpr_b, tpr_b)}")

Visual Comparison Techniques

Effective visualization is often the first and most informative step in curve comparison. Let's explore best practices.

Overlaid Curve Plots

The standard approach: plot multiple curves on the same axes.

Publication-Quality Curve Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (roc_curve, precision_recall_curve, 
                            roc_auc_score, average_precision_score)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
 
def create_comparison_visualization(X, y, models_dict, figsize=(14, 5)):
    """
    Create publication-quality ROC and PR curve comparisons.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, stratify=y, random_state=42
    )
    
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    
    # Color palette for accessibility
    colors = plt.cm.tab10(np.linspace(0, 1, len(models_dict)))
    
    results = {}
    
    for (name, model), color in zip(models_dict.items(), colors):
        # Train and predict
        model.fit(X_train, y_train)
        
        # Get probability scores
        if hasattr(model, 'predict_proba'):
            probs = model.predict_proba(X_test)[:, 1]
        else:
            probs = model.decision_function(X_test)
        
        # Compute curves
        fpr, tpr, _ = roc_curve(y_test, probs)
        prec, rec, _ = precision_recall_curve(y_test, probs)
        
        # Compute metrics
        auc = roc_auc_score(y_test, probs)
        ap = average_precision_score(y_test, probs)
        
        results[name] = {'auc': auc, 'ap': ap}
        
        # Plot ROC
        axes[0].plot(fpr, tpr, color=color, linewidth=2, 
                    label=f'{name} (AUC={auc:.3f})')
        
        # Plot PR
        axes[1].plot(rec, prec, color=color, linewidth=2,
                    label=f'{name} (AP={ap:.3f})')
    
    # ROC formatting
    axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
    axes[0].set_xlabel('False Positive Rate', fontsize=12)
    axes[0].set_ylabel('True Positive Rate', fontsize=12)
    axes[0].set_title('ROC Curve Comparison', fontsize=14, fontweight='bold')
    axes[0].legend(loc='lower right', fontsize=9)
    axes[0].set_xlim([-0.02, 1.02])
    axes[0].set_ylim([-0.02, 1.02])
    axes[0].grid(alpha=0.3)
    axes[0].set_aspect('equal')
    
    # PR formatting
    base_rate = np.mean(y_test)
    axes[1].axhline(y=base_rate, color='gray', linestyle='--', 
                   linewidth=1, label=f'Random (base={base_rate:.2f})')
    axes[1].set_xlabel('Recall', fontsize=12)
    axes[1].set_ylabel('Precision', fontsize=12)
    axes[1].set_title('Precision-Recall Curve Comparison', fontsize=14, fontweight='bold')
    axes[1].legend(loc='lower left', fontsize=9)
    axes[1].set_xlim([-0.02, 1.02])
    axes[1].set_ylim([-0.02, 1.02])
    axes[1].grid(alpha=0.3)
    axes[1].set_aspect('equal')
    
    plt.tight_layout()
    return fig, results
 
# Example usage
X, y = make_classification(n_samples=2000, n_features=20, 
                           weights=[0.9, 0.1], random_state=42)
 
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
}
 
fig, results = create_comparison_visualization(X, y, models)
plt.savefig('curve_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Difference Plots

When curves are close, overlaid plots can be hard to interpret. Difference plots show the gap between curves explicitly:

Difference Plots
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def plot_curve_difference(fpr_a, tpr_a, fpr_b, tpr_b, 
                           name_a='Model A', name_b='Model B'):
    """
    Plot the TPR difference between two ROC curves.
    
    Positive values = A is better; Negative = B is better.
    """
    # Common FPR points
    fpr_common = np.linspace(0, 1, 200)
    tpr_diff = np.interp(fpr_common, fpr_a, tpr_a) - np.interp(fpr_common, fpr_b, tpr_b)
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Overlaid curves
    axes[0].plot(fpr_a, tpr_a, 'b-', linewidth=2, label=name_a)
    axes[0].plot(fpr_b, tpr_b, 'r-', linewidth=2, label=name_b)
    axes[0].plot([0, 1], [0, 1], 'k--')
    axes[0].legend()
    axes[0].set_xlabel('FPR')
    axes[0].set_ylabel('TPR')
    axes[0].set_title('ROC Curves')
    axes[0].grid(alpha=0.3)
    
    # Difference plot
    axes[1].fill_between(fpr_common, 0, tpr_diff, 
                         where=(tpr_diff >= 0), color='blue', alpha=0.3,
                         label=f'{name_a} better')
    axes[1].fill_between(fpr_common, 0, tpr_diff,
                         where=(tpr_diff < 0), color='red', alpha=0.3,
                         label=f'{name_b} better')
    axes[1].axhline(y=0, color='black', linewidth=1)
    axes[1].set_xlabel('FPR')
    axes[1].set_ylabel('TPR Difference')
    axes[1].set_title(f'TPR({name_a}) - TPR({name_b})')
    axes[1].legend()
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    return fig

Visualization Best Practices

Use equal aspect ratio for ROC/PR plots (circles should look circular). 2. Include confidence bands (shaded regions) when available. 3. Mark key operating points explicitly. 4. Use colorblind-friendly palettes. 5. Always include random baseline for reference.

Statistical Significance Testing

Visual differences might be due to chance. Statistical tests help determine if observed differences are significant.

The DeLong Test for AUC

The DeLong test (1988) is the gold standard for comparing correlated AUCs. We covered this in the AUC page, but here's the practical application:

Complete Statistical Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy import stats
from sklearn.metrics import roc_auc_score, average_precision_score
 
def comprehensive_comparison(scores_a, scores_b, labels, 
                             n_bootstrap=2000, alpha=0.05):
    """
    Comprehensive statistical comparison between two classifiers.
    
    Returns significance tests for both AUC and AP.
    """
    from sklearn.utils import resample
    
    # Point estimates
    auc_a = roc_auc_score(labels, scores_a)
    auc_b = roc_auc_score(labels, scores_b)
    ap_a = average_precision_score(labels, scores_a)
    ap_b = average_precision_score(labels, scores_b)
    
    # Bootstrap for confidence intervals and p-values
    auc_diffs = []
    ap_diffs = []
    
    for _ in range(n_bootstrap):
        idx = resample(range(len(labels)), stratify=labels)
        
        auc_a_boot = roc_auc_score(labels[idx], scores_a[idx])
        auc_b_boot = roc_auc_score(labels[idx], scores_b[idx])
        ap_a_boot = average_precision_score(labels[idx], scores_a[idx])
        ap_b_boot = average_precision_score(labels[idx], scores_b[idx])
        
        auc_diffs.append(auc_a_boot - auc_b_boot)
        ap_diffs.append(ap_a_boot - ap_b_boot)
    
    auc_diffs = np.array(auc_diffs)
    ap_diffs = np.array(ap_diffs)
    
    # Compute confidence intervals
    def ci(diffs):
        return np.percentile(diffs, [2.5, 97.5])
    
    # Compute p-values (proportion of bootstrap samples on opposite side of 0)
    def pvalue(diffs, observed):
        if observed >= 0:
            return 2 * np.mean(diffs <= 0)
        else:
            return 2 * np.mean(diffs >= 0)
    
    auc_ci = ci(auc_diffs)
    ap_ci = ci(ap_diffs)
    auc_pvalue = min(1.0, pvalue(auc_diffs, auc_a - auc_b))
    ap_pvalue = min(1.0, pvalue(ap_diffs, ap_a - ap_b))
    
    results = {
        'auc_a': auc_a,
        'auc_b': auc_b,
        'auc_diff': auc_a - auc_b,
        'auc_ci': auc_ci,
        'auc_pvalue': auc_pvalue,
        'auc_significant': auc_pvalue < alpha,
        'ap_a': ap_a,
        'ap_b': ap_b,
        'ap_diff': ap_a - ap_b,
        'ap_ci': ap_ci,
        'ap_pvalue': ap_pvalue,
        'ap_significant': ap_pvalue < alpha,
    }
    
    return results
 
def print_comparison_report(results, name_a='Model A', name_b='Model B'):
    """Pretty-print comparison results."""
    print("=" * 60)
    print(f"CLASSIFIER COMPARISON: {name_a} vs {name_b}")
    print("=" * 60)
    
    print(f"\n{'Metric':<15} | {name_a:<10} | {name_b:<10} | {'Diff':>8} | {'p-value':>8}")
    print("-" * 60)
    
    print(f"{'AUC-ROC':<15} | {results['auc_a']:<10.4f} | {results['auc_b']:<10.4f} | "
          f"{results['auc_diff']:>+8.4f} | {results['auc_pvalue']:>8.4f}"
          f"{'*' if results['auc_significant'] else ''}")
    
    print(f"{'Avg Precision':<15} | {results['ap_a']:<10.4f} | {results['ap_b']:<10.4f} | "
          f"{results['ap_diff']:>+8.4f} | {results['ap_pvalue']:>8.4f}"
          f"{'*' if results['ap_significant'] else ''}")
    
    print("\n95% Confidence Intervals for differences:")
    print(f"  AUC diff: [{results['auc_ci'][0]:+.4f}, {results['auc_ci'][1]:+.4f}]")
    print(f"  AP diff:  [{results['ap_ci'][0]:+.4f}, {results['ap_ci'][1]:+.4f}]")
    print("\n* indicates p < 0.05")
 
# Usage example
np.random.seed(42)
n = 800
labels = np.random.binomial(1, 0.15, n)
scores_a = np.random.beta(3, 2 if labels else 5, n).diagonal() if False else \
           np.where(labels, np.random.beta(4, 2, n), np.random.beta(2, 4, n))
scores_b = np.where(labels, np.random.beta(3.5, 2.5, n), np.random.beta(2.5, 3.5, n))
 
results = comprehensive_comparison(scores_a, scores_b, labels)
print_comparison_report(results, 'GradBoost', 'LogisticReg')

Multiple Comparisons Correction

When comparing many models pairwise, apply multiple comparisons correction (Bonferroni, Holm, or FDR). With 10 models, you have 45 pairwise comparisons—at α=0.05, you'd expect ~2 false positives by chance alone.

Handling Crossing Curves

When curves cross, neither model dominates. The 'better' model depends on where you operate.

Identifying Crossover Points

Find Crossover Points
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def find_roc_crossovers(fpr_a, tpr_a, fpr_b, tpr_b):
    """
    Find FPR values where ROC curves A and B cross.
    """
    # Interpolate to common grid
    fpr_common = np.linspace(0, 1, 1000)
    tpr_a_interp = np.interp(fpr_common, fpr_a, tpr_a)
    tpr_b_interp = np.interp(fpr_common, fpr_b, tpr_b)
    
    # Find sign changes
    diff = tpr_a_interp - tpr_b_interp
    sign_changes = np.where(np.diff(np.sign(diff)))[0]
    
    crossovers = []
    for idx in sign_changes:
        crossover_fpr = (fpr_common[idx] + fpr_common[idx+1]) / 2
        crossover_tpr = (tpr_a_interp[idx] + tpr_a_interp[idx+1]) / 2
        crossovers.append({
            'fpr': crossover_fpr,
            'tpr': crossover_tpr,
            'a_better_before': diff[idx] > 0
        })
    
    return crossovers
 
def analyze_regions(crossovers, fpr_a, tpr_a, fpr_b, tpr_b, auc_a, auc_b):
    """
    Analyze which model is better in each FPR region.
    """
    boundaries = [0] + [c['fpr'] for c in crossovers] + [1]
    
    print("Regional Analysis:")
    print("-" * 50)
    
    for i in range(len(boundaries) - 1):
        fpr_low, fpr_high = boundaries[i], boundaries[i+1]
        
        # sample midpoint
        mid_fpr = (fpr_low + fpr_high) / 2
        tpr_a_mid = np.interp(mid_fpr, fpr_a, tpr_a)
        tpr_b_mid = np.interp(mid_fpr, fpr_b, tpr_b)
        
        winner = 'A' if tpr_a_mid > tpr_b_mid else 'B'
        
        print(f"FPR [{fpr_low:.3f} - {fpr_high:.3f}]: Model {winner} is better")
    
    print(f"\nOverall AUC: A={auc_a:.4f}, B={auc_b:.4f}")
    print(f"AUC Winner: {'A' if auc_a > auc_b else 'B'}")

Decision Framework for Crossing Curves

When curves cross, use this decision framework:

Decision Framework

•Identify your operating region — What FPR range is acceptable? What recall target do you have? This constrains which part of the curve matters.
•Compare in the relevant region only — Ignore performance where you won't operate. If you need FPR < 0.05, a model that wins at FPR = 0.3 is irrelevant.
•Consider partial AUC — Compute AUC only in your operating region for a fair summary.
•Examine robustness — Does the ranking hold with bootstrap resampling? Is one model's advantage consistent or variable?
•Consider non-performance factors — Interpretability, inference speed, maintenance complexity, calibration quality may break ties.

Scenario-Based Recommendations for Crossing Curves
Your Priority	Choose Model That...	Example Scenario
Minimize false alarms	Wins at low FPR region	Fraud detection, medical diagnosis
Maximize detection	Wins at high TPR/Recall region	Safety-critical systems, spam filtering
Balance	Wins around Youden's J point	General classification tasks
Operating point unknown	Has higher overall AUC/AP	Building a general-purpose API
Different users, different needs	Is most robust across regions	Platform serving varied use cases

Don't Let AUC Decide Everything

When curves cross, the model with higher AUC might be inferior in YOUR operating region. AUC averages over all thresholds, but you'll use only one threshold. Always verify performance at your specific operating point.

Multi-Model Comparison

When comparing more than two models, visualization and statistical testing require adapted approaches.

Visualization Strategies

Multi-Model Comparison Visualization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import matplotlib.pyplot as plt
import numpy as np
 
def plot_multi_model_roc(models_results, highlight_best=True):
    """
    Plot ROC curves for multiple models with confidence bands.
    
    models_results: dict of {'model_name': {'fpr': [], 'tpr': [], 'auc': float, ...}}
    """
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Sort by AUC for legend ordering
    sorted_models = sorted(models_results.items(), 
                          key=lambda x: x[1]['auc'], reverse=True)
    
    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_models)))
    
    for i, (name, data) in enumerate(sorted_models):
        color = colors[i]
        linewidth = 3 if highlight_best and i == 0 else 1.5
        alpha = 1.0 if highlight_best and i == 0 else 0.7
        
        ax.plot(data['fpr'], data['tpr'], 
               color=color, linewidth=linewidth, alpha=alpha,
               label=f"{name} (AUC={data['auc']:.3f})")
        
        # Add confidence band if available
        if 'tpr_lower' in data and 'tpr_upper' in data:
            ax.fill_between(data['fpr'], data['tpr_lower'], data['tpr_upper'],
                          color=color, alpha=0.15)
    
    ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
    ax.set_xlabel('False Positive Rate', fontsize=12)
    ax.set_ylabel('True Positive Rate', fontsize=12)
    ax.set_title('ROC Curves: Multi-Model Comparison', fontsize=14)
    ax.legend(loc='lower right', fontsize=9)
    ax.grid(alpha=0.3)
    ax.set_xlim([-0.02, 1.02])
    ax.set_ylim([-0.02, 1.02])
    
    return fig
 
def create_ranking_heatmap(models_results, metric='auc', 
                           show_pvalues=True, results_matrix=None):
    """
    Create a heatmap showing pairwise comparisons.
    """
    import seaborn as sns
    
    model_names = list(models_results.keys())
    n = len(model_names)
    
    # Create comparison matrix
    comparison_matrix = np.zeros((n, n))
    
    for i, m1 in enumerate(model_names):
        for j, m2 in enumerate(model_names):
            if i == j:
                comparison_matrix[i, j] = 0
            else:
                # Difference: positive = model i better than model j
                val1 = models_results[m1][metric]
                val2 = models_results[m2][metric]
                comparison_matrix[i, j] = val1 - val2
    
    # Plot
    fig, ax = plt.subplots(figsize=(8, 6))
    
    mask = np.eye(n, dtype=bool)
    
    sns.heatmap(comparison_matrix, 
               xticklabels=model_names,
               yticklabels=model_names,
               center=0, cmap='RdBu_r',
               annot=True, fmt='.3f',
               mask=mask,
               ax=ax)
    
    ax.set_title(f'{metric.upper()} Difference Matrix\n(Row - Column)', fontsize=12)
    ax.set_ylabel('Model (higher score)')
    ax.set_xlabel('Model (lower score)')
    
    plt.tight_layout()
    return fig

Ranking with Confidence

To establish a robust ranking across many models:

Robust Model Ranking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def robust_model_ranking(scores_dict, labels, n_bootstrap=1000):
    """
    Rank models with statistical confidence.
    
    scores_dict: {'model_name': scores_array}
    
    Returns ranking with win/tie probabilities.
    """
    from sklearn.utils import resample
    
    model_names = list(scores_dict.keys())
    n_models = len(model_names)
    
    # Track pairwise wins across bootstrap
    wins = {m: 0 for m in model_names}
    pairwise_wins = {(m1, m2): 0 for m1 in model_names for m2 in model_names if m1 != m2}
    
    for _ in range(n_bootstrap):
        idx = resample(range(len(labels)), stratify=labels)
        
        # Compute AUC for each model on this bootstrap sample
        boot_aucs = {}
        for name, scores in scores_dict.items():
            boot_aucs[name] = roc_auc_score(labels[idx], scores[idx])
        
        # Find winner
        best_model = max(boot_aucs, key=boot_aucs.get)
        wins[best_model] += 1
        
        # Track pairwise
        for m1 in model_names:
            for m2 in model_names:
                if m1 != m2 and boot_aucs[m1] > boot_aucs[m2]:
                    pairwise_wins[(m1, m2)] += 1
    
    # Compute win probabilities
    win_probs = {m: w / n_bootstrap for m, w in wins.items()}
    
    # Sort by win probability
    ranking = sorted(win_probs.items(), key=lambda x: -x[1])
    
    print("Model Ranking (by bootstrap win probability):")
    print("-" * 50)
    print(f"{'Rank':<6} | {'Model':<20} | {'Win Prob':<10} | {'AUC':<8}")
    print("-" * 50)
    
    for rank, (model, prob) in enumerate(ranking, 1):
        auc = roc_auc_score(labels, scores_dict[model])
        print(f"{rank:<6} | {model:<20} | {prob:<10.1%} | {auc:<8.4f}")
    
    return ranking, pairwise_wins

Critical Difference Diagrams

For comparing many models across multiple datasets, consider Nemenyi test with critical difference diagrams (common in ML literature). These show which models significantly differ in average rank across datasets.

Operating Point Selection

After selecting a model, you must choose an operating point (threshold). The curve comparison framework informs this choice.

Common Selection Strategies

Operating Point Selection Methods
Method	Formula/Description	When to Use
Youden's J	Maximize TPR - FPR	Balanced importance of sensitivity and specificity
Closest to (0,1)	Minimize √(FPR² + (1-TPR)²)	Similar to Youden's J, geometric interpretation
Fixed Sensitivity	Find threshold for TPR ≥ target	When minimum detection rate is mandated
Fixed Specificity	Find threshold for TNR ≥ target	When maximum false alarm rate is constrained
Fixed Precision	Find threshold for Precision ≥ target	When positive prediction quality is critical
Cost Minimization	Minimize C_FP×FP + C_FN×FN	When misclassification costs are known
F-score Optimization	Maximize Fβ for chosen β	When precision-recall balance is specified

Operating Point Selection
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from sklearn.metrics import roc_curve, precision_recall_curve
 
def find_optimal_threshold(y_true, y_scores, method='youden', **kwargs):
    """
    Find optimal classification threshold using various methods.
    
    Parameters:
    -----------
    method : str
        'youden', 'closest', 'fixed_tpr', 'fixed_fpr', 
        'fixed_precision', 'cost', 'f_score'
    kwargs : dict
        Method-specific parameters
    
    Returns:
    --------
    dict with optimal threshold and operating point metrics
    """
    fpr, tpr, thresholds_roc = roc_curve(y_true, y_scores)
    precision, recall, thresholds_pr = precision_recall_curve(y_true, y_scores)
    
    if method == 'youden':
        # Maximize TPR - FPR
        j_scores = tpr - fpr
        best_idx = np.argmax(j_scores)
        best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        
    elif method == 'closest':
        # Closest to (0, 1)
        distances = np.sqrt(fpr**2 + (1 - tpr)**2)
        best_idx = np.argmin(distances)
        best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        
    elif method == 'fixed_tpr':
        # Minimum threshold for TPR >= target
        target = kwargs.get('target', 0.9)
        valid_idx = np.where(tpr >= target)[0]
        if len(valid_idx) > 0:
            # Take the one with lowest FPR among valid
            best_idx = valid_idx[np.argmin(fpr[valid_idx])]
            best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        else:
            best_threshold = 0.0  # Accept everything to achieve target
            best_idx = -1
            
    elif method == 'fixed_fpr':
        # Maximum threshold for FPR <= target
        target = kwargs.get('target', 0.1)
        valid_idx = np.where(fpr <= target)[0]
        if len(valid_idx) > 0:
            # Take the one with highest TPR among valid
            best_idx = valid_idx[np.argmax(tpr[valid_idx])]
            best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        else:
            best_threshold = 1.0  # Reject everything
            best_idx = 0
            
    elif method == 'cost':
        # Minimize cost: C_FP * FP + C_FN * FN
        c_fp = kwargs.get('cost_fp', 1)
        c_fn = kwargs.get('cost_fn', 1)
        
        n_pos = np.sum(y_true)
        n_neg = len(y_true) - n_pos
        
        # Convert rates to counts
        fp_counts = fpr * n_neg
        fn_counts = (1 - tpr) * n_pos
        
        costs = c_fp * fp_counts + c_fn * fn_counts
        best_idx = np.argmin(costs)
        best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        
    elif method == 'f_score':
        # Maximize F-beta score
        beta = kwargs.get('beta', 1.0)
        
        # Use PR curve for F-score
        # F_beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall)
        with np.errstate(divide='ignore', invalid='ignore'):
            f_scores = (1 + beta**2) * (precision * recall) / (beta**2 * precision + recall)
            f_scores = np.nan_to_num(f_scores)
        
        best_idx = np.argmax(f_scores)
        best_threshold = thresholds_pr[best_idx] if best_idx < len(thresholds_pr) else 0.5
        
    else:
        raise ValueError(f"Unknown method: {method}")
    
    # Compute metrics at threshold
    y_pred = (y_scores >= best_threshold).astype(int)
    
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    tn = np.sum((y_pred == 0) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    
    return {
        'threshold': best_threshold,
        'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
        'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
        'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
        'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
        'f1': 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) > 0 else 0,
    }
 
# Example: Compare threshold selection methods
np.random.seed(42)
n = 1000
y_true = np.random.binomial(1, 0.15, n)
y_scores = np.where(y_true, np.random.beta(4, 2, n), np.random.beta(2, 4, n))
 
print("\nThreshold Selection Method Comparison:")
print("=" * 70)
print(f"{'Method':<20} | {'Threshold':>9} | {'TPR':>6} | {'FPR':>6} | {'Prec':>6} | {'F1':>6}")
print("-" * 70)
 
for method in ['youden', 'closest', 'fixed_tpr', 'fixed_fpr', 'cost', 'f_score']:
    if method == 'fixed_tpr':
        result = find_optimal_threshold(y_true, y_scores, method, target=0.9)
    elif method == 'fixed_fpr':
        result = find_optimal_threshold(y_true, y_scores, method, target=0.1)
    elif method == 'cost':
        result = find_optimal_threshold(y_true, y_scores, method, cost_fp=1, cost_fn=5)
    else:
        result = find_optimal_threshold(y_true, y_scores, method)
    
    print(f"{method:<20} | {result['threshold']:>9.4f} | "
          f"{result['tpr']:>6.3f} | {result['fpr']:>6.3f} | "
          f"{result['precision']:>6.3f} | {result['f1']:>6.3f}")

Threshold Selection Should Use Validation Data

Never select thresholds on test data—this leaks information and inflates performance estimates. Use a separate validation set or nested cross-validation for threshold selection.

Practical Comparison Workflow

Let's synthesize everything into a practical, step-by-step workflow for curve comparison.

Complete Comparison Workflow

•Define operating requirements — What FPR or recall constraints exist? What costs are associated with errors? This determines which curve regions matter.
•Generate curves for all candidate models — Compute ROC and PR curves with proper test data (not training or validation data used for model selection).
•Visual inspection — Plot all curves overlaid. Look for dominance, crossing patterns, and performance in your operating region.
•Compute scalar summaries — Calculate AUC, AP, and partial metrics (pAUC in relevant FPR range, AP@k) appropriate for your problem.
•Statistical testing — Use paired bootstrap or DeLong test to assess significance of differences. Apply multiple comparison corrections if needed.
•Regional analysis — If curves cross, identify crossover points and determine which model wins in your specific operating region.
•Operating point selection — Choose threshold using validation data with method matching your requirements (Youden, cost-based, fixed constraint).
•Final validation — Confirm selected model and threshold performance on held-out test data. Report with confidence intervals.
•Document decision — Record which metric drove selection, what operating point is chosen, and why—for future reference and reproducibility.

Avoid Selection Bias

The same test set cannot be used for (1) model comparison, (2) threshold selection, and (3) final performance reporting. Use nested cross-validation or separate holdout sets to prevent optimistic bias in reported results.

Summary: Curve Comparison

We've developed comprehensive methods for comparing ROC and PR curves between classifiers. Let's consolidate the key insights:

Key Takeaways

•Dominance is the ideal case — If one curve dominates everywhere, the choice is clear. This rarely happens in practice.
•Crossing curves require regional analysis — When curves cross, the better model depends on your operating region. Never let overall AUC decide when curves cross substantially in your region.
•Statistical significance matters — Visual differences might be noise. Always use bootstrap or DeLong tests with multiple comparison corrections.
•Match metrics to requirements — Use AUC for general ranking, AP for imbalanced data, partial metrics for constrained operating regions.
•Operating point selection is separate — After choosing a model, select threshold on validation data using a method that matches your operational requirements.
•Document your decisions — Record which metrics drove selection, what constraints applied, and why—for reproducibility and future reference.

Module Complete:

You have now mastered ROC and Precision-Recall curves—from construction to summary metrics to rigorous comparison. These tools form the foundation for evaluating any binary classifier, whether in academic research or production ML systems.

Module Complete

Congratulations! You've completed the comprehensive module on ROC and Precision-Recall Curves. You can now construct, interpret, summarize, and rigorously compare classifier evaluation curves—essential skills for any machine learning practitioner.

5 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

5 / 5

Curve Comparison

Making Model Selection Decisions

You've trained several models, computed their ROC and PR curves, and calculated AUC and Average Precision. Now comes the critical question: Which model should you deploy?

This question seems simple when one model dominates everywhere, but reality is rarely so clean:

Curves cross at multiple points
One model wins on AUC, another on AP
Confidence intervals overlap
Different stakeholders care about different operating regions

Curve comparison is the art and science of extracting actionable insights from these complex situations—turning visualization and statistics into defensible model selection decisions.

What You Will Learn

Dominance and Partial Ordering

The ideal scenario for model comparison is when one classifier unambiguously dominates another across all operating points.

ROC Dominance

Classifier A ROC-dominates classifier B if:

For all FPR values: TPR_A(FPR) ≥ TPR_B(FPR)
For at least one FPR value: TPR_A(FPR) > TPR_B(FPR)

Interpretation: At every possible false positive rate, A achieves at least as high a true positive rate as B (and strictly better somewhere).

PR Dominance

Classifier A PR-dominates classifier B if:

For all recall values: Precision_A(R) ≥ Precision_B(R)
For at least one recall value: Precision_A(R) > Precision_B(R)

Interpretation: At every desired recall level, A achieves at least as high precision as B.

The Dominance Theorem

A fundamental result connects ROC and PR dominance:

Dominance Transfer Theorem (Davis & Goadrich, 2006)

If classifier A dominates B in ROC space, then A also dominates B in PR space.

However, the converse is NOT true: A can dominate in PR space without dominating in ROC space.

Implication: ROC dominance is a stronger condition. PR dominance can occur without ROC dominance due to class imbalance effects.

Partial Ordering

When neither classifier dominates, we have a partial ordering:

A is better in some regions
B is better in other regions
Neither is universally superior

This is the common case in practice. Curves cross, and the 'better' model depends on the operating region that matters for your application.

Checking Dominance Programmatically

Dominance Check
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from sklearn.metrics import roc_curve, precision_recall_curve
 
def check_roc_dominance(fpr_a, tpr_a, fpr_b, tpr_b, n_points=100):
    """
    Check if curve A dominates curve B in ROC space.
    
    Returns:
    --------
    'A_dominates' : A strictly dominates B
    'B_dominates' : B strictly dominates A  
    'neither' : Curves cross (partial ordering)
    'equal' : Curves are identical
    """
    # Interpolate both curves at common FPR points
    fpr_common = np.linspace(0, 1, n_points)
    
    tpr_a_interp = np.interp(fpr_common, fpr_a, tpr_a)
    tpr_b_interp = np.interp(fpr_common, fpr_b, tpr_b)
    
    diff = tpr_a_interp - tpr_b_interp
    
    a_better = np.sum(diff > 1e-6)  # A strictly better
    b_better = np.sum(diff < -1e-6)  # B strictly better
    equal = np.sum(np.abs(diff) <= 1e-6)
    
    if a_better > 0 and b_better == 0:
        return 'A_dominates'
    elif b_better > 0 and a_better == 0:
        return 'B_dominates'
    elif a_better == 0 and b_better == 0:
        return 'equal'
    else:
        return 'neither'
 
def check_pr_dominance(prec_a, rec_a, prec_b, rec_b, n_points=100):
    """
    Check if curve A dominates curve B in PR space.
    """
    # Interpolate at common recall points
    rec_common = np.linspace(0, 1, n_points)
    
    # For PR curves, need careful interpolation (use max precision at >= recall)
    def interpolate_pr(prec, rec, rec_query):
        result = np.zeros_like(rec_query)
        for i, r in enumerate(rec_query):
            valid = prec[rec >= r]
            result[i] = np.max(valid) if len(valid) > 0 else 0
        return result
    
    prec_a_interp = interpolate_pr(prec_a, rec_a, rec_common)
    prec_b_interp = interpolate_pr(prec_b, rec_b, rec_common)
    
    diff = prec_a_interp - prec_b_interp
    
    a_better = np.sum(diff > 1e-6)
    b_better = np.sum(diff < -1e-6)
    
    if a_better > 0 and b_better == 0:
        return 'A_dominates'
    elif b_better > 0 and a_better == 0:
        return 'B_dominates'
    elif a_better == 0 and b_better == 0:
        return 'equal'
    else:
        return 'neither'
 
# Example usage
np.random.seed(42)
n = 500
 
# Generate data where neither model dominates
signal = np.random.randn(n)
labels = (signal > 0).astype(int)
 
scores_a = signal + np.random.randn(n) * 0.8  # Model A
scores_b = 0.3 * signal + 0.7 * (signal ** 3 / 3) + np.random.randn(n) * 0.6  # Model B (non-linear)
 
fpr_a, tpr_a, _ = roc_curve(labels, scores_a)
fpr_b, tpr_b, _ = roc_curve(labels, scores_b)
 
print(f"ROC Dominance: {check_roc_dominance(fpr_a, tpr_a, fpr_b, tpr_b)}")

Visual Comparison Techniques

Effective visualization is often the first and most informative step in curve comparison. Let's explore best practices.

Overlaid Curve Plots

The standard approach: plot multiple curves on the same axes.

Publication-Quality Curve Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import (roc_curve, precision_recall_curve, 
                            roc_auc_score, average_precision_score)
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
 
def create_comparison_visualization(X, y, models_dict, figsize=(14, 5)):
    """
    Create publication-quality ROC and PR curve comparisons.
    """
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, stratify=y, random_state=42
    )
    
    fig, axes = plt.subplots(1, 2, figsize=figsize)
    
    # Color palette for accessibility
    colors = plt.cm.tab10(np.linspace(0, 1, len(models_dict)))
    
    results = {}
    
    for (name, model), color in zip(models_dict.items(), colors):
        # Train and predict
        model.fit(X_train, y_train)
        
        # Get probability scores
        if hasattr(model, 'predict_proba'):
            probs = model.predict_proba(X_test)[:, 1]
        else:
            probs = model.decision_function(X_test)
        
        # Compute curves
        fpr, tpr, _ = roc_curve(y_test, probs)
        prec, rec, _ = precision_recall_curve(y_test, probs)
        
        # Compute metrics
        auc = roc_auc_score(y_test, probs)
        ap = average_precision_score(y_test, probs)
        
        results[name] = {'auc': auc, 'ap': ap}
        
        # Plot ROC
        axes[0].plot(fpr, tpr, color=color, linewidth=2, 
                    label=f'{name} (AUC={auc:.3f})')
        
        # Plot PR
        axes[1].plot(rec, prec, color=color, linewidth=2,
                    label=f'{name} (AP={ap:.3f})')
    
    # ROC formatting
    axes[0].plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
    axes[0].set_xlabel('False Positive Rate', fontsize=12)
    axes[0].set_ylabel('True Positive Rate', fontsize=12)
    axes[0].set_title('ROC Curve Comparison', fontsize=14, fontweight='bold')
    axes[0].legend(loc='lower right', fontsize=9)
    axes[0].set_xlim([-0.02, 1.02])
    axes[0].set_ylim([-0.02, 1.02])
    axes[0].grid(alpha=0.3)
    axes[0].set_aspect('equal')
    
    # PR formatting
    base_rate = np.mean(y_test)
    axes[1].axhline(y=base_rate, color='gray', linestyle='--', 
                   linewidth=1, label=f'Random (base={base_rate:.2f})')
    axes[1].set_xlabel('Recall', fontsize=12)
    axes[1].set_ylabel('Precision', fontsize=12)
    axes[1].set_title('Precision-Recall Curve Comparison', fontsize=14, fontweight='bold')
    axes[1].legend(loc='lower left', fontsize=9)
    axes[1].set_xlim([-0.02, 1.02])
    axes[1].set_ylim([-0.02, 1.02])
    axes[1].grid(alpha=0.3)
    axes[1].set_aspect('equal')
    
    plt.tight_layout()
    return fig, results
 
# Example usage
X, y = make_classification(n_samples=2000, n_features=20, 
                           weights=[0.9, 0.1], random_state=42)
 
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
}
 
fig, results = create_comparison_visualization(X, y, models)
plt.savefig('curve_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Difference Plots

When curves are close, overlaid plots can be hard to interpret. Difference plots show the gap between curves explicitly:

Difference Plots
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def plot_curve_difference(fpr_a, tpr_a, fpr_b, tpr_b, 
                           name_a='Model A', name_b='Model B'):
    """
    Plot the TPR difference between two ROC curves.
    
    Positive values = A is better; Negative = B is better.
    """
    # Common FPR points
    fpr_common = np.linspace(0, 1, 200)
    tpr_diff = np.interp(fpr_common, fpr_a, tpr_a) - np.interp(fpr_common, fpr_b, tpr_b)
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # Overlaid curves
    axes[0].plot(fpr_a, tpr_a, 'b-', linewidth=2, label=name_a)
    axes[0].plot(fpr_b, tpr_b, 'r-', linewidth=2, label=name_b)
    axes[0].plot([0, 1], [0, 1], 'k--')
    axes[0].legend()
    axes[0].set_xlabel('FPR')
    axes[0].set_ylabel('TPR')
    axes[0].set_title('ROC Curves')
    axes[0].grid(alpha=0.3)
    
    # Difference plot
    axes[1].fill_between(fpr_common, 0, tpr_diff, 
                         where=(tpr_diff >= 0), color='blue', alpha=0.3,
                         label=f'{name_a} better')
    axes[1].fill_between(fpr_common, 0, tpr_diff,
                         where=(tpr_diff < 0), color='red', alpha=0.3,
                         label=f'{name_b} better')
    axes[1].axhline(y=0, color='black', linewidth=1)
    axes[1].set_xlabel('FPR')
    axes[1].set_ylabel('TPR Difference')
    axes[1].set_title(f'TPR({name_a}) - TPR({name_b})')
    axes[1].legend()
    axes[1].grid(alpha=0.3)
    
    plt.tight_layout()
    return fig

Visualization Best Practices

Use equal aspect ratio for ROC/PR plots (circles should look circular). 2. Include confidence bands (shaded regions) when available. 3. Mark key operating points explicitly. 4. Use colorblind-friendly palettes. 5. Always include random baseline for reference.

Statistical Significance Testing

Visual differences might be due to chance. Statistical tests help determine if observed differences are significant.

The DeLong Test for AUC

The DeLong test (1988) is the gold standard for comparing correlated AUCs. We covered this in the AUC page, but here's the practical application:

Complete Statistical Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from scipy import stats
from sklearn.metrics import roc_auc_score, average_precision_score
 
def comprehensive_comparison(scores_a, scores_b, labels, 
                             n_bootstrap=2000, alpha=0.05):
    """
    Comprehensive statistical comparison between two classifiers.
    
    Returns significance tests for both AUC and AP.
    """
    from sklearn.utils import resample
    
    # Point estimates
    auc_a = roc_auc_score(labels, scores_a)
    auc_b = roc_auc_score(labels, scores_b)
    ap_a = average_precision_score(labels, scores_a)
    ap_b = average_precision_score(labels, scores_b)
    
    # Bootstrap for confidence intervals and p-values
    auc_diffs = []
    ap_diffs = []
    
    for _ in range(n_bootstrap):
        idx = resample(range(len(labels)), stratify=labels)
        
        auc_a_boot = roc_auc_score(labels[idx], scores_a[idx])
        auc_b_boot = roc_auc_score(labels[idx], scores_b[idx])
        ap_a_boot = average_precision_score(labels[idx], scores_a[idx])
        ap_b_boot = average_precision_score(labels[idx], scores_b[idx])
        
        auc_diffs.append(auc_a_boot - auc_b_boot)
        ap_diffs.append(ap_a_boot - ap_b_boot)
    
    auc_diffs = np.array(auc_diffs)
    ap_diffs = np.array(ap_diffs)
    
    # Compute confidence intervals
    def ci(diffs):
        return np.percentile(diffs, [2.5, 97.5])
    
    # Compute p-values (proportion of bootstrap samples on opposite side of 0)
    def pvalue(diffs, observed):
        if observed >= 0:
            return 2 * np.mean(diffs <= 0)
        else:
            return 2 * np.mean(diffs >= 0)
    
    auc_ci = ci(auc_diffs)
    ap_ci = ci(ap_diffs)
    auc_pvalue = min(1.0, pvalue(auc_diffs, auc_a - auc_b))
    ap_pvalue = min(1.0, pvalue(ap_diffs, ap_a - ap_b))
    
    results = {
        'auc_a': auc_a,
        'auc_b': auc_b,
        'auc_diff': auc_a - auc_b,
        'auc_ci': auc_ci,
        'auc_pvalue': auc_pvalue,
        'auc_significant': auc_pvalue < alpha,
        'ap_a': ap_a,
        'ap_b': ap_b,
        'ap_diff': ap_a - ap_b,
        'ap_ci': ap_ci,
        'ap_pvalue': ap_pvalue,
        'ap_significant': ap_pvalue < alpha,
    }
    
    return results
 
def print_comparison_report(results, name_a='Model A', name_b='Model B'):
    """Pretty-print comparison results."""
    print("=" * 60)
    print(f"CLASSIFIER COMPARISON: {name_a} vs {name_b}")
    print("=" * 60)
    
    print(f"\n{'Metric':<15} | {name_a:<10} | {name_b:<10} | {'Diff':>8} | {'p-value':>8}")
    print("-" * 60)
    
    print(f"{'AUC-ROC':<15} | {results['auc_a']:<10.4f} | {results['auc_b']:<10.4f} | "
          f"{results['auc_diff']:>+8.4f} | {results['auc_pvalue']:>8.4f}"
          f"{'*' if results['auc_significant'] else ''}")
    
    print(f"{'Avg Precision':<15} | {results['ap_a']:<10.4f} | {results['ap_b']:<10.4f} | "
          f"{results['ap_diff']:>+8.4f} | {results['ap_pvalue']:>8.4f}"
          f"{'*' if results['ap_significant'] else ''}")
    
    print("\n95% Confidence Intervals for differences:")
    print(f"  AUC diff: [{results['auc_ci'][0]:+.4f}, {results['auc_ci'][1]:+.4f}]")
    print(f"  AP diff:  [{results['ap_ci'][0]:+.4f}, {results['ap_ci'][1]:+.4f}]")
    print("\n* indicates p < 0.05")
 
# Usage example
np.random.seed(42)
n = 800
labels = np.random.binomial(1, 0.15, n)
scores_a = np.random.beta(3, 2 if labels else 5, n).diagonal() if False else \
           np.where(labels, np.random.beta(4, 2, n), np.random.beta(2, 4, n))
scores_b = np.where(labels, np.random.beta(3.5, 2.5, n), np.random.beta(2.5, 3.5, n))
 
results = comprehensive_comparison(scores_a, scores_b, labels)
print_comparison_report(results, 'GradBoost', 'LogisticReg')

Multiple Comparisons Correction

Handling Crossing Curves

When curves cross, neither model dominates. The 'better' model depends on where you operate.

Identifying Crossover Points

Find Crossover Points
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def find_roc_crossovers(fpr_a, tpr_a, fpr_b, tpr_b):
    """
    Find FPR values where ROC curves A and B cross.
    """
    # Interpolate to common grid
    fpr_common = np.linspace(0, 1, 1000)
    tpr_a_interp = np.interp(fpr_common, fpr_a, tpr_a)
    tpr_b_interp = np.interp(fpr_common, fpr_b, tpr_b)
    
    # Find sign changes
    diff = tpr_a_interp - tpr_b_interp
    sign_changes = np.where(np.diff(np.sign(diff)))[0]
    
    crossovers = []
    for idx in sign_changes:
        crossover_fpr = (fpr_common[idx] + fpr_common[idx+1]) / 2
        crossover_tpr = (tpr_a_interp[idx] + tpr_a_interp[idx+1]) / 2
        crossovers.append({
            'fpr': crossover_fpr,
            'tpr': crossover_tpr,
            'a_better_before': diff[idx] > 0
        })
    
    return crossovers
 
def analyze_regions(crossovers, fpr_a, tpr_a, fpr_b, tpr_b, auc_a, auc_b):
    """
    Analyze which model is better in each FPR region.
    """
    boundaries = [0] + [c['fpr'] for c in crossovers] + [1]
    
    print("Regional Analysis:")
    print("-" * 50)
    
    for i in range(len(boundaries) - 1):
        fpr_low, fpr_high = boundaries[i], boundaries[i+1]
        
        # sample midpoint
        mid_fpr = (fpr_low + fpr_high) / 2
        tpr_a_mid = np.interp(mid_fpr, fpr_a, tpr_a)
        tpr_b_mid = np.interp(mid_fpr, fpr_b, tpr_b)
        
        winner = 'A' if tpr_a_mid > tpr_b_mid else 'B'
        
        print(f"FPR [{fpr_low:.3f} - {fpr_high:.3f}]: Model {winner} is better")
    
    print(f"\nOverall AUC: A={auc_a:.4f}, B={auc_b:.4f}")
    print(f"AUC Winner: {'A' if auc_a > auc_b else 'B'}")

Decision Framework for Crossing Curves

When curves cross, use this decision framework:

Decision Framework

•Identify your operating region — What FPR range is acceptable? What recall target do you have? This constrains which part of the curve matters.
•Compare in the relevant region only — Ignore performance where you won't operate. If you need FPR < 0.05, a model that wins at FPR = 0.3 is irrelevant.
•Consider partial AUC — Compute AUC only in your operating region for a fair summary.
•Examine robustness — Does the ranking hold with bootstrap resampling? Is one model's advantage consistent or variable?
•Consider non-performance factors — Interpretability, inference speed, maintenance complexity, calibration quality may break ties.

Scenario-Based Recommendations for Crossing Curves
Your Priority	Choose Model That...	Example Scenario
Minimize false alarms	Wins at low FPR region	Fraud detection, medical diagnosis
Maximize detection	Wins at high TPR/Recall region	Safety-critical systems, spam filtering
Balance	Wins around Youden's J point	General classification tasks
Operating point unknown	Has higher overall AUC/AP	Building a general-purpose API
Different users, different needs	Is most robust across regions	Platform serving varied use cases

Don't Let AUC Decide Everything

Multi-Model Comparison

When comparing more than two models, visualization and statistical testing require adapted approaches.

Visualization Strategies

Multi-Model Comparison Visualization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import matplotlib.pyplot as plt
import numpy as np
 
def plot_multi_model_roc(models_results, highlight_best=True):
    """
    Plot ROC curves for multiple models with confidence bands.
    
    models_results: dict of {'model_name': {'fpr': [], 'tpr': [], 'auc': float, ...}}
    """
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Sort by AUC for legend ordering
    sorted_models = sorted(models_results.items(), 
                          key=lambda x: x[1]['auc'], reverse=True)
    
    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_models)))
    
    for i, (name, data) in enumerate(sorted_models):
        color = colors[i]
        linewidth = 3 if highlight_best and i == 0 else 1.5
        alpha = 1.0 if highlight_best and i == 0 else 0.7
        
        ax.plot(data['fpr'], data['tpr'], 
               color=color, linewidth=linewidth, alpha=alpha,
               label=f"{name} (AUC={data['auc']:.3f})")
        
        # Add confidence band if available
        if 'tpr_lower' in data and 'tpr_upper' in data:
            ax.fill_between(data['fpr'], data['tpr_lower'], data['tpr_upper'],
                          color=color, alpha=0.15)
    
    ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random')
    ax.set_xlabel('False Positive Rate', fontsize=12)
    ax.set_ylabel('True Positive Rate', fontsize=12)
    ax.set_title('ROC Curves: Multi-Model Comparison', fontsize=14)
    ax.legend(loc='lower right', fontsize=9)
    ax.grid(alpha=0.3)
    ax.set_xlim([-0.02, 1.02])
    ax.set_ylim([-0.02, 1.02])
    
    return fig
 
def create_ranking_heatmap(models_results, metric='auc', 
                           show_pvalues=True, results_matrix=None):
    """
    Create a heatmap showing pairwise comparisons.
    """
    import seaborn as sns
    
    model_names = list(models_results.keys())
    n = len(model_names)
    
    # Create comparison matrix
    comparison_matrix = np.zeros((n, n))
    
    for i, m1 in enumerate(model_names):
        for j, m2 in enumerate(model_names):
            if i == j:
                comparison_matrix[i, j] = 0
            else:
                # Difference: positive = model i better than model j
                val1 = models_results[m1][metric]
                val2 = models_results[m2][metric]
                comparison_matrix[i, j] = val1 - val2
    
    # Plot
    fig, ax = plt.subplots(figsize=(8, 6))
    
    mask = np.eye(n, dtype=bool)
    
    sns.heatmap(comparison_matrix, 
               xticklabels=model_names,
               yticklabels=model_names,
               center=0, cmap='RdBu_r',
               annot=True, fmt='.3f',
               mask=mask,
               ax=ax)
    
    ax.set_title(f'{metric.upper()} Difference Matrix\n(Row - Column)', fontsize=12)
    ax.set_ylabel('Model (higher score)')
    ax.set_xlabel('Model (lower score)')
    
    plt.tight_layout()
    return fig

Ranking with Confidence

To establish a robust ranking across many models:

Robust Model Ranking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def robust_model_ranking(scores_dict, labels, n_bootstrap=1000):
    """
    Rank models with statistical confidence.
    
    scores_dict: {'model_name': scores_array}
    
    Returns ranking with win/tie probabilities.
    """
    from sklearn.utils import resample
    
    model_names = list(scores_dict.keys())
    n_models = len(model_names)
    
    # Track pairwise wins across bootstrap
    wins = {m: 0 for m in model_names}
    pairwise_wins = {(m1, m2): 0 for m1 in model_names for m2 in model_names if m1 != m2}
    
    for _ in range(n_bootstrap):
        idx = resample(range(len(labels)), stratify=labels)
        
        # Compute AUC for each model on this bootstrap sample
        boot_aucs = {}
        for name, scores in scores_dict.items():
            boot_aucs[name] = roc_auc_score(labels[idx], scores[idx])
        
        # Find winner
        best_model = max(boot_aucs, key=boot_aucs.get)
        wins[best_model] += 1
        
        # Track pairwise
        for m1 in model_names:
            for m2 in model_names:
                if m1 != m2 and boot_aucs[m1] > boot_aucs[m2]:
                    pairwise_wins[(m1, m2)] += 1
    
    # Compute win probabilities
    win_probs = {m: w / n_bootstrap for m, w in wins.items()}
    
    # Sort by win probability
    ranking = sorted(win_probs.items(), key=lambda x: -x[1])
    
    print("Model Ranking (by bootstrap win probability):")
    print("-" * 50)
    print(f"{'Rank':<6} | {'Model':<20} | {'Win Prob':<10} | {'AUC':<8}")
    print("-" * 50)
    
    for rank, (model, prob) in enumerate(ranking, 1):
        auc = roc_auc_score(labels, scores_dict[model])
        print(f"{rank:<6} | {model:<20} | {prob:<10.1%} | {auc:<8.4f}")
    
    return ranking, pairwise_wins

Critical Difference Diagrams

Operating Point Selection

After selecting a model, you must choose an operating point (threshold). The curve comparison framework informs this choice.

Common Selection Strategies

Operating Point Selection Methods
Method	Formula/Description	When to Use
Youden's J	Maximize TPR - FPR	Balanced importance of sensitivity and specificity
Closest to (0,1)	Minimize √(FPR² + (1-TPR)²)	Similar to Youden's J, geometric interpretation
Fixed Sensitivity	Find threshold for TPR ≥ target	When minimum detection rate is mandated
Fixed Specificity	Find threshold for TNR ≥ target	When maximum false alarm rate is constrained
Fixed Precision	Find threshold for Precision ≥ target	When positive prediction quality is critical
Cost Minimization	Minimize C_FP×FP + C_FN×FN	When misclassification costs are known
F-score Optimization	Maximize Fβ for chosen β	When precision-recall balance is specified

Operating Point Selection
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import numpy as np
from sklearn.metrics import roc_curve, precision_recall_curve
 
def find_optimal_threshold(y_true, y_scores, method='youden', **kwargs):
    """
    Find optimal classification threshold using various methods.
    
    Parameters:
    -----------
    method : str
        'youden', 'closest', 'fixed_tpr', 'fixed_fpr', 
        'fixed_precision', 'cost', 'f_score'
    kwargs : dict
        Method-specific parameters
    
    Returns:
    --------
    dict with optimal threshold and operating point metrics
    """
    fpr, tpr, thresholds_roc = roc_curve(y_true, y_scores)
    precision, recall, thresholds_pr = precision_recall_curve(y_true, y_scores)
    
    if method == 'youden':
        # Maximize TPR - FPR
        j_scores = tpr - fpr
        best_idx = np.argmax(j_scores)
        best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        
    elif method == 'closest':
        # Closest to (0, 1)
        distances = np.sqrt(fpr**2 + (1 - tpr)**2)
        best_idx = np.argmin(distances)
        best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        
    elif method == 'fixed_tpr':
        # Minimum threshold for TPR >= target
        target = kwargs.get('target', 0.9)
        valid_idx = np.where(tpr >= target)[0]
        if len(valid_idx) > 0:
            # Take the one with lowest FPR among valid
            best_idx = valid_idx[np.argmin(fpr[valid_idx])]
            best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        else:
            best_threshold = 0.0  # Accept everything to achieve target
            best_idx = -1
            
    elif method == 'fixed_fpr':
        # Maximum threshold for FPR <= target
        target = kwargs.get('target', 0.1)
        valid_idx = np.where(fpr <= target)[0]
        if len(valid_idx) > 0:
            # Take the one with highest TPR among valid
            best_idx = valid_idx[np.argmax(tpr[valid_idx])]
            best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        else:
            best_threshold = 1.0  # Reject everything
            best_idx = 0
            
    elif method == 'cost':
        # Minimize cost: C_FP * FP + C_FN * FN
        c_fp = kwargs.get('cost_fp', 1)
        c_fn = kwargs.get('cost_fn', 1)
        
        n_pos = np.sum(y_true)
        n_neg = len(y_true) - n_pos
        
        # Convert rates to counts
        fp_counts = fpr * n_neg
        fn_counts = (1 - tpr) * n_pos
        
        costs = c_fp * fp_counts + c_fn * fn_counts
        best_idx = np.argmin(costs)
        best_threshold = thresholds_roc[best_idx] if best_idx < len(thresholds_roc) else 0.5
        
    elif method == 'f_score':
        # Maximize F-beta score
        beta = kwargs.get('beta', 1.0)
        
        # Use PR curve for F-score
        # F_beta = (1 + beta^2) * (precision * recall) / (beta^2 * precision + recall)
        with np.errstate(divide='ignore', invalid='ignore'):
            f_scores = (1 + beta**2) * (precision * recall) / (beta**2 * precision + recall)
            f_scores = np.nan_to_num(f_scores)
        
        best_idx = np.argmax(f_scores)
        best_threshold = thresholds_pr[best_idx] if best_idx < len(thresholds_pr) else 0.5
        
    else:
        raise ValueError(f"Unknown method: {method}")
    
    # Compute metrics at threshold
    y_pred = (y_scores >= best_threshold).astype(int)
    
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    tn = np.sum((y_pred == 0) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    
    return {
        'threshold': best_threshold,
        'tpr': tp / (tp + fn) if (tp + fn) > 0 else 0,
        'fpr': fp / (fp + tn) if (fp + tn) > 0 else 0,
        'precision': tp / (tp + fp) if (tp + fp) > 0 else 0,
        'recall': tp / (tp + fn) if (tp + fn) > 0 else 0,
        'f1': 2 * tp / (2 * tp + fp + fn) if (2 * tp + fp + fn) > 0 else 0,
    }
 
# Example: Compare threshold selection methods
np.random.seed(42)
n = 1000
y_true = np.random.binomial(1, 0.15, n)
y_scores = np.where(y_true, np.random.beta(4, 2, n), np.random.beta(2, 4, n))
 
print("\nThreshold Selection Method Comparison:")
print("=" * 70)
print(f"{'Method':<20} | {'Threshold':>9} | {'TPR':>6} | {'FPR':>6} | {'Prec':>6} | {'F1':>6}")
print("-" * 70)
 
for method in ['youden', 'closest', 'fixed_tpr', 'fixed_fpr', 'cost', 'f_score']:
    if method == 'fixed_tpr':
        result = find_optimal_threshold(y_true, y_scores, method, target=0.9)
    elif method == 'fixed_fpr':
        result = find_optimal_threshold(y_true, y_scores, method, target=0.1)
    elif method == 'cost':
        result = find_optimal_threshold(y_true, y_scores, method, cost_fp=1, cost_fn=5)
    else:
        result = find_optimal_threshold(y_true, y_scores, method)
    
    print(f"{method:<20} | {result['threshold']:>9.4f} | "
          f"{result['tpr']:>6.3f} | {result['fpr']:>6.3f} | "
          f"{result['precision']:>6.3f} | {result['f1']:>6.3f}")

Threshold Selection Should Use Validation Data

Never select thresholds on test data—this leaks information and inflates performance estimates. Use a separate validation set or nested cross-validation for threshold selection.

Practical Comparison Workflow

Let's synthesize everything into a practical, step-by-step workflow for curve comparison.

Complete Comparison Workflow

•Define operating requirements — What FPR or recall constraints exist? What costs are associated with errors? This determines which curve regions matter.
•Generate curves for all candidate models — Compute ROC and PR curves with proper test data (not training or validation data used for model selection).
•Visual inspection — Plot all curves overlaid. Look for dominance, crossing patterns, and performance in your operating region.
•Compute scalar summaries — Calculate AUC, AP, and partial metrics (pAUC in relevant FPR range, AP@k) appropriate for your problem.
•Statistical testing — Use paired bootstrap or DeLong test to assess significance of differences. Apply multiple comparison corrections if needed.
•Regional analysis — If curves cross, identify crossover points and determine which model wins in your specific operating region.
•Operating point selection — Choose threshold using validation data with method matching your requirements (Youden, cost-based, fixed constraint).
•Final validation — Confirm selected model and threshold performance on held-out test data. Report with confidence intervals.
•Document decision — Record which metric drove selection, what operating point is chosen, and why—for future reference and reproducibility.

Avoid Selection Bias

Summary: Curve Comparison

We've developed comprehensive methods for comparing ROC and PR curves between classifiers. Let's consolidate the key insights:

Key Takeaways

•Dominance is the ideal case — If one curve dominates everywhere, the choice is clear. This rarely happens in practice.
•Crossing curves require regional analysis — When curves cross, the better model depends on your operating region. Never let overall AUC decide when curves cross substantially in your region.
•Statistical significance matters — Visual differences might be noise. Always use bootstrap or DeLong tests with multiple comparison corrections.
•Match metrics to requirements — Use AUC for general ranking, AP for imbalanced data, partial metrics for constrained operating regions.
•Operating point selection is separate — After choosing a model, select threshold on validation data using a method that matches your operational requirements.
•Document your decisions — Record which metrics drove selection, what constraints applied, and why—for reproducibility and future reference.

Module Complete:

Module Complete

5 / 5