Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

3 / 5

Precision-Recall Curves

When ROC Curves Fall Short

ROC curves are powerful, but they have a blind spot: they can look excellent even when the classifier performs poorly in practice on imbalanced data.

Imagine a disease screening where 1 in 10,000 patients has the condition. A classifier achieves TPR = 0.80 (catches 80% of cases) at FPR = 0.01 (1% false alarm rate). On the ROC curve, this looks great—a point far into the upper-left corner.

But consider the practical implications:

From 10,000 patients: 1 true positive, 9,999 true negatives
At FPR = 0.01: ~100 false positives from the 9,999 negatives
Total positives predicted: ~101 (1 true + 100 false)
Precision = 1/101 ≈ 0.01 (99% of positive predictions are wrong!)

The ROC curve hid this disaster. The Precision-Recall curve would have revealed it immediately.

What You Will Learn

By the end of this page, you will understand precision and recall as axis metrics, how to construct PR curves, their graphical interpretation, when PR curves are more informative than ROC curves, the relationship between PR and ROC curves, and practical guidelines for choosing between them.

Precision and Recall Foundations

Before constructing PR curves, we must deeply understand the metrics they display. Precision and recall answer different questions about classifier behavior.

Recall (Sensitivity, True Positive Rate)

Recall measures the fraction of actual positives that are correctly identified:

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\text{TP}}{P}$$

Question answered: Of all actual positive examples, how many did we catch?

Denominator is P (total actual positives) — fixed for a dataset
Recall = 1: No false negatives (all positives caught)
Recall = 0: No positives caught (all classified negative)

Precision (Positive Predictive Value)

Precision measures the fraction of positive predictions that are correct:

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$

Question answered: Of all examples we predicted positive, how many were actually positive?

Denominator is TP + FP (total positive predictions) — varies with threshold
Precision = 1: No false positives (all positive predictions correct)
Precision = 0: No true positives (all positive predictions wrong)

Recall Characteristics

•Denominator is fixed (P = total positives)
•Equivalent to TPR in ROC analysis
•Ignores false positives completely
•Measures 'Did we find positives?'
•High recall = few missed positives
•Critical when missing positives is costly

Precision Characteristics

•Denominator varies (TP + FP changes)
•No direct equivalent in ROC
•Ignores true negatives completely
•Measures 'Are positives correct?'
•High precision = few false alarms
•Critical when false positives are costly

The Key Difference from ROC

ROC uses FPR = FP/N, which conditions on negatives. Precision uses FP in the numerator of a fraction conditioned on positive PREDICTIONS. When negatives vastly outnumber positives (imbalanced data), even a small FPR produces many FP, devastating precision.

The Precision-Recall Tradeoff

As we lower the classification threshold:

More examples classified positive (TP + FP increases)
Recall increases (we catch more positives, TP↑)
Precision typically decreases (more FP mix in, diluting TP fraction)

This tradeoff is fundamental: to catch more positives (higher recall), we usually accept more false positives (lower precision). The PR curve captures this tradeoff completely.

Base Rate Dependence

Unlike recall, precision is sensitive to class imbalance:

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{\text{Recall} \cdot P}{\text{Recall} \cdot P + \text{FPR} \cdot N}$$

When N >> P (many more negatives than positives), even small FPR produces FP >> TP, crushing precision. This is exactly why PR curves reveal problems that ROC curves hide on imbalanced data.

Constructing PR Curves

The construction algorithm for PR curves parallels ROC construction but with different metrics computed.

The Algorithm

Given n examples with scores and labels:

Sort examples by descending score
Sweep threshold from high to low
At each unique threshold, compute precision and recall
Plot points as (Recall, Precision)

Key difference from ROC: The curve moves in the opposite 'direction' — starting at (0, precision_at_first) when threshold is highest, ending at (recall=1, precision=base_rate) when threshold is lowest.

PR Curve Construction
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
def construct_pr_curve(scores, labels):
    """
    Construct Precision-Recall curve from scores and labels.
    
    Parameters:
    -----------
    scores : array-like
        Classifier output scores (higher = more likely positive)
    labels : array-like
        True binary labels (0 or 1)
    
    Returns:
    --------
    precision : array
    recall : array  
    thresholds : array
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Total positives
    P = np.sum(labels == 1)
    
    if P == 0:
        raise ValueError("No positive examples in dataset")
    
    # Sort by descending score
    sorted_indices = np.argsort(-scores)
    sorted_labels = labels[sorted_indices]
    sorted_scores = scores[sorted_indices]
    
    # Compute precision and recall at each threshold
    tp = 0
    fp = 0
    
    precisions = []
    recalls = []
    thresholds = []
    
    for i in range(len(sorted_labels)):
        if sorted_labels[i] == 1:
            tp += 1
        else:
            fp += 1
        
        # Precision = TP / (TP + FP)
        precision = tp / (tp + fp)
        
        # Recall = TP / P  
        recall = tp / P
        
        # Record at each unique threshold
        if i == len(sorted_labels) - 1 or sorted_scores[i] != sorted_scores[i+1]:
            precisions.append(precision)
            recalls.append(recall)
            thresholds.append(sorted_scores[i])
    
    # Convention: Add starting point (recall=0, precision=1)
    # This represents the "perfect precision" achievable at recall=0
    precisions = [1.0] + precisions
    recalls = [0.0] + recalls
    
    return np.array(precisions), np.array(recalls), np.array(thresholds)
 
# Example walkthrough
scores = [0.9, 0.8, 0.7, 0.5, 0.3, 0.1]
labels = [1, 0, 1, 0, 1, 0]  # P=3, N=3
 
print("Score | Label | TP | FP | Precision | Recall")
print("-" * 55)
 
tp, fp = 0, 0
P = sum(labels)
 
for i, (s, l) in enumerate(zip(sorted(zip(scores, labels), reverse=True))):
    score, label = s, l[1] if isinstance(l, tuple) else l
    # Unpack properly
    
sorted_pairs = sorted(zip(scores, labels), key=lambda x: -x[0])
tp, fp = 0, 0
 
for score, label in sorted_pairs:
    if label == 1:
        tp += 1
    else:
        fp += 1
    prec = tp / (tp + fp)
    rec = tp / P
    print(f" {score:.1f}  |   {label}   | {tp}  | {fp}  |   {prec:.3f}   | {rec:.3f}")

Example Walkthrough

Using scores = [0.9, 0.8, 0.7, 0.5, 0.3, 0.1] with labels = [1, 0, 1, 0, 1, 0]:

Score	Label	TP	FP	Precision	Recall
0.9	1	1	0	1.000	0.333
0.8	0	1	1	0.500	0.333
0.7	1	2	1	0.667	0.667
0.5	0	2	2	0.500	0.667
0.3	1	3	2	0.600	1.000
0.1	0	3	3	0.500	1.000

The PR curve passes through: (0, 1.0) → (0.333, 1.0) → (0.333, 0.5) → (0.667, 0.667) → (0.667, 0.5) → (1.0, 0.6) → (1.0, 0.5)

Non-Monotonic Precision

Unlike ROC curves where TPR monotonically increases, precision in PR curves can oscillate up and down. When we add a true positive (label=1), precision often increases; when we add a false positive (label=0), precision decreases. This creates the 'sawtooth' pattern characteristic of PR curves.

Graphical Interpretation

PR curves have a different visual language than ROC curves. Understanding this helps extract insights from PR plots.

The PR Space

Axes:

X-axis (Recall): Fraction of positives caught (0 to 1)
Y-axis (Precision): Fraction of positive predictions that are correct (0 to 1)

Goal: We want high precision AND high recall — upper-right corner is ideal.

Key Reference Points

Landmarks in PR Space

•Point (0, ~1.0): High threshold — predict only highest-confidence cases positive. Precision ≈ 1 if first examples are TPs.
•Point (1.0, base_rate): Lowest threshold — predict everything positive. Precision = P/(P+N) = base rate (class proportion).
•Point (1.0, 1.0): Perfect classifier — all positives found, no false positives.
•Horizontal line at y = base_rate: Random classifier — precision equals class proportion everywhere.
•Upper-right region: Desirable — achieves both high precision and high recall.

The Random Baseline Changes with Class Balance

This is crucial: the random baseline in PR space depends on class proportion.

For a dataset with 10% positives:

Random classifier: Precision ≈ 0.10 everywhere
Baseline is a horizontal line at y = 0.10

For a dataset with 50% positives:

Random classifier: Precision ≈ 0.50 everywhere
Baseline is a horizontal line at y = 0.50

This means PR curves cannot be directly compared across datasets with different class distributions, unlike ROC curves where the random baseline is always the diagonal.

Curve Shape Interpretation

Curve hugging the upper-right corner:

Excellent classifier
Achieves high precision even at high recall
Strong separation between classes

Rapidly dropping precision as recall increases:

Initial positives are well-separated
Remaining positives are mixed with negatives
Common pattern: quickly identifies 'easy' positives, struggles with 'hard' ones

Staircase pattern:

Discrete jumps when encountering true positives (precision up) and false positives (precision down)
More visible with small datasets or scores that produce few unique thresholds

Why PR Curves 'Drop' at the End

PR curves often show a final drop or plateau at high recall. This happens when the last positives (with lowest scores) are finally included, bringing in many false positives with similar scores. This 'tail' represents the classifier's difficulty with ambiguous cases.

Interpolation Methods

Unlike ROC curves where linear interpolation between points is straightforward, PR curve interpolation requires careful treatment. Naive linear interpolation can produce misleading areas.

The Problem with Linear Interpolation

Between two observed points (R₁, P₁) and (R₂, P₂), what precision is achievable at recall R where R₁ < R < R₂?

Linear interpolation assumes: $$P(R) = P_1 + \frac{P_2 - P_1}{R_2 - R_1}(R - R_1)$$

But this is incorrect for PR curves. The true interpolation depends on how examples are distributed between the thresholds.

The Correct Approach: Step or Step-Down

Two common conventions:

Step function (conservative): Hold precision constant until the next observed point
Maximum precision interpolation: At each recall level, use the maximum precision at any recall ≥ that level

The maximum precision interpolation is widely used for calculating Average Precision.

PR Curve Interpolation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def interpolate_pr_curve(precision, recall, num_points=101):
    """
    Interpolate PR curve using the 'maximum precision' method.
    
    At each recall level r, precision is the maximum precision
    at any recall >= r. This is the standard for Average Precision.
    
    This gives an 'optimistic' or 'envelope' interpolation.
    """
    # Ensure recall is increasing (may need to sort)
    sorted_indices = np.argsort(recall)
    recall = np.array(recall)[sorted_indices]
    precision = np.array(precision)[sorted_indices]
    
    # Compute maximum precision at each recall level (looking forward)
    # precision_interp[i] = max(precision[i:])
    precision_interp = np.maximum.accumulate(precision[::-1])[::-1]
    
    # Sample at regular recall intervals
    recall_levels = np.linspace(0, 1, num_points)
    interpolated_precision = np.zeros_like(recall_levels)
    
    for i, r in enumerate(recall_levels):
        # Find first observed recall >= r
        idx = np.searchsorted(recall, r)
        if idx >= len(recall):
            # Beyond observed data - use last precision
            interpolated_precision[i] = precision_interp[-1]
        else:
            interpolated_precision[i] = precision_interp[idx]
    
    return recall_levels, interpolated_precision
 
# The '11-point interpolation' used in older TREC evaluations
def eleven_point_interpolation(precision, recall):
    """
    Compute interpolated precision at 11 standard recall levels.
    
    Used in classic information retrieval evaluation.
    """
    recall_levels = np.linspace(0, 1, 11)  # 0.0, 0.1, 0.2, ..., 1.0
    
    interp_precision = []
    for r in recall_levels:
        # Max precision at any recall >= r
        valid = precision[recall >= r] if any(recall >= r) else [0]
        interp_precision.append(max(valid) if len(valid) > 0 else 0)
    
    return recall_levels, np.array(interp_precision)

scikit-learn's Approach

scikit-learn's precision_recall_curve returns points in a specific order and uses the 'step function' interpretation. When computing average_precision_score, it uses trapezoidal integration on the step function, which is equivalent to weighting each precision by the recall change at that point.

PR Curves vs ROC Curves

PR and ROC curves provide complementary views of classifier performance. Understanding their relationship helps choose the right tool.

Mathematical Relationship

Both curves use the same underlying confusion matrix, just focusing on different ratios:

Metric	ROC Curve	PR Curve
X-axis	FPR = FP / N	Recall = TP / P
Y-axis	TPR = TP / P	Precision = TP / (TP+FP)
Conditions on	Actual class	Predicted class (for precision)

Key insight: Recall is the same metric in both (just named TPR in ROC). The difference is the Y-axis:

ROC Y-axis (TPR) = TP / P
PR Y-axis (Precision) = TP / (TP + FP)

Precision involves FP in the denominator, making it sensitive to class imbalance in a way TPR isn't.

Why PR Curves Reveal Imbalance Problems

Imbalance Sensitivity Demonstration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score, roc_curve, precision_recall_curve
import matplotlib.pyplot as plt
 
def compare_roc_pr_under_imbalance():
    """
    Demonstrate how the same classifier behavior looks different
    on ROC vs PR curves under class imbalance.
    """
    np.random.seed(42)
    
    # Scenario: Same underlying separation, different class proportions
    # Generate scores that give AUC ≈ 0.85
    
    for pos_fraction in [0.5, 0.1, 0.01]:
        n_total = 10000
        n_pos = int(n_total * pos_fraction)
        n_neg = n_total - n_pos
        
        # Positive scores: mean 0.6, std 0.15
        # Negative scores: mean 0.4, std 0.15
        pos_scores = np.clip(np.random.normal(0.6, 0.15, n_pos), 0, 1)
        neg_scores = np.clip(np.random.normal(0.4, 0.15, n_neg), 0, 1)
        
        scores = np.concatenate([pos_scores, neg_scores])
        labels = np.array([1]*n_pos + [0]*n_neg)
        
        auc = roc_auc_score(labels, scores)
        ap = average_precision_score(labels, scores)
        
        print(f"\nClass balance: {pos_fraction:.0%} positive")
        print(f"  ROC-AUC:  {auc:.3f}")
        print(f"  Avg Prec: {ap:.3f}")
        print(f"  Ratio AP/AUC: {ap/auc:.2f}")
 
compare_roc_pr_under_imbalance()
 
# Output demonstrates that AUC stays ~0.85 across all imbalance levels,
# while Average Precision (PR-AUC) drops dramatically as imbalance increases.

Typical Results

Class Balance	ROC-AUC	Average Precision	Interpretation
50% positive	0.85	0.85	Balanced: both metrics similar
10% positive	0.85	0.52	Moderate imbalance: PR shows difficulty
1% positive	0.85	0.15	Severe imbalance: PR reveals poor precision

The same classifier, with identical ranking behavior, shows dramatically different PR performance as imbalance increases. ROC-AUC stays the same because it doesn't see the flood of false positives that drowns precision.

When This Matters

If you're building a disease detector and report AUC = 0.85, stakeholders might expect good performance. But if the disease is rare (1% prevalence), Average Precision = 0.15 reveals the truth: most positive predictions will be false alarms.

The Imbalance Trap

Reporting only ROC-AUC for imbalanced problems is misleading. A classifier might achieve AUC = 0.95 while having precision < 0.10 at practical recall levels. Always report both ROC and PR metrics for imbalanced data, or at minimum report PR metrics if positive predictions drive decisions.

Dominance and Curve Comparison

Comparing classifiers via their PR curves requires understanding dominance relationships.

PR Dominance

PR curve A dominates PR curve B if:

At every recall level, A's precision ≥ B's precision
At at least one recall level, A's precision > B's precision

If A dominates B, then A is unambiguously better—no threshold choice makes B preferable.

Relationship to ROC Dominance

A remarkable theoretical result (Davis & Goadrich, 2006):

If classifier A dominates in ROC space, it also dominates in PR space.

However, the converse is NOT true:

A can dominate in PR space without dominating in ROC space
This happens because PR curves are more sensitive to class imbalance

When Curves Cross

When neither curve dominates (they cross), neither classifier is universally better. The choice depends on the operating point:

Need high precision (few false alarms)? Choose the classifier with better PR curve in the low-recall region
Need high recall (catch most positives)? Choose the classifier with better PR curve in the high-recall region

Comparing PR Curves
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
 
def compare_pr_curves(scores_list, labels, model_names, 
                      target_recalls=[0.7, 0.8, 0.9]):
    """
    Compare multiple classifiers via PR curves and report
    precision at target recall levels.
    """
    plt.figure(figsize=(10, 6))
    
    results = {}
    
    for scores, name in zip(scores_list, model_names):
        precision, recall, thresholds = precision_recall_curve(labels, scores)
        
        # Plot
        plt.plot(recall, precision, label=name, linewidth=2)
        
        # Find precision at target recall levels
        results[name] = {}
        for target in target_recalls:
            # Find highest recall <= target
            valid_idx = np.where(recall >= target)[0]
            if len(valid_idx) > 0:
                # Use interpolated precision (max precision at recall >= target)
                prec_at_target = np.max(precision[valid_idx])
            else:
                prec_at_target = 0
            results[name][f'precision@recall={target}'] = prec_at_target
    
    # Random baseline
    base_rate = np.mean(labels)
    plt.axhline(y=base_rate, color='gray', linestyle='--', 
                label=f'Random (baseline={base_rate:.2f})')
    
    plt.xlabel('Recall', fontsize=12)
    plt.ylabel('Precision', fontsize=12)
    plt.title('Precision-Recall Curve Comparison', fontsize=14)
    plt.legend(loc='best')
    plt.grid(alpha=0.3)
    plt.xlim([0, 1.02])
    plt.ylim([0, 1.02])
    
    plt.tight_layout()
    plt.show()
    
    # Print comparison table
    print("\nPrecision at Target Recall Levels:")
    print("-" * 50)
    header = "Model".ljust(20) + " | ".join([f"R={r}" for r in target_recalls])
    print(header)
    print("-" * 50)
    
    for name in model_names:
        row = name.ljust(20)
        for target in target_recalls:
            row += f"{results[name][f'precision@recall={target}']:6.3f}  "
        print(row)
    
    return results

Operational Comparison

Instead of comparing entire curves, compare at specific operating points relevant to your use case. If you need to achieve recall ≥ 0.8, compare precision at recall = 0.8. This gives actionable insights rather than abstract curve comparisons.

Practical Visualization

Effective PR curve visualization requires attention to several practical details.

Complete PR Curve Visualization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (precision_recall_curve, average_precision_score,
                            roc_curve, roc_auc_score)
 
def create_comprehensive_pr_visualization():
    """
    Create a publication-quality PR curve comparison with all relevant context.
    """
    # Create imbalanced dataset
    X, y = make_classification(
        n_samples=5000,
        n_features=20,
        n_informative=10,
        weights=[0.95, 0.05],  # 5% positive
        random_state=42
    )
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, stratify=y, random_state=42
    )
    
    # Train models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
    }
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    colors = ['#2563eb', '#dc2626', '#16a34a']
    
    for ax_idx, (ax, curve_type) in enumerate(zip(axes, ['PR', 'ROC'])):
        for (name, model), color in zip(models.items(), colors):
            model.fit(X_train, y_train)
            probs = model.predict_proba(X_test)[:, 1]
            
            if curve_type == 'PR':
                precision, recall, _ = precision_recall_curve(y_test, probs)
                ap = average_precision_score(y_test, probs)
                ax.plot(recall, precision, color=color, linewidth=2,
                       label=f'{name} (AP={ap:.3f})')
            else:  # ROC
                fpr, tpr, _ = roc_curve(y_test, probs)
                auc = roc_auc_score(y_test, probs)
                ax.plot(fpr, tpr, color=color, linewidth=2,
                       label=f'{name} (AUC={auc:.3f})')
        
        if curve_type == 'PR':
            # Random baseline for PR
            base_rate = np.mean(y_test)
            ax.axhline(y=base_rate, color='gray', linestyle='--', 
                      label=f'Random (P={base_rate:.3f})')
            ax.set_xlabel('Recall', fontsize=12)
            ax.set_ylabel('Precision', fontsize=12)
            ax.set_title('Precision-Recall Curves\n(5% positive class)', fontsize=14)
        else:
            # Random baseline for ROC
            ax.plot([0, 1], [0, 1], 'k--', label='Random')
            ax.set_xlabel('False Positive Rate', fontsize=12)
            ax.set_ylabel('True Positive Rate', fontsize=12)
            ax.set_title('ROC Curves\n(5% positive class)', fontsize=14)
        
        ax.set_xlim([-0.02, 1.02])
        ax.set_ylim([-0.02, 1.02])
        ax.legend(loc='lower left' if curve_type == 'PR' else 'lower right')
        ax.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('pr_roc_comparison_imbalanced.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    # Print summary
    print("\nSummary (5% positive class):")
    print("-" * 50)
    print(f"{'Model':<25} {'AUC':>8} {'Avg Prec':>10}")
    print("-" * 50)
    for name, model in models.items():
        probs = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, probs)
        ap = average_precision_score(y_test, probs)
        print(f"{name:<25} {auc:>8.3f} {ap:>10.3f}")
    
    print("\nNote: AUC values are similar (~0.9), but Average Precision")
    print("reveals true performance differences on imbalanced data.")
 
create_comprehensive_pr_visualization()

Visualization Best Practices

Always show the random baseline (horizontal line at base rate for PR, diagonal for ROC). 2. Include Average Precision/AUC in the legend. 3. For imbalanced data, always show PR curves alongside ROC curves. 4. Use consistent colors when comparing the same models across plots.

When to Use PR vs ROC

Choosing between PR and ROC curves isn't about which is 'better'—it's about which better matches your evaluation needs.

PR vs ROC: Decision Framework
Factor	Prefer ROC Curves	Prefer PR Curves
Class balance	Balanced or moderate imbalance	Severe imbalance (rare positives)
What matters	Overall discrimination	Positive predictions quality
Use case	Compare ranking power	Evaluate predicted positive reliability
Negatives count	True negatives are important	True negatives are 'background'
Baseline	Fixed (diagonal)	Varies with class proportion
Cross-dataset comparison	Yes (same baseline)	No (different baselines)
Statistical tools	Well-developed (DeLong, etc.)	Less standardized

Use ROC Curves When...

•Class labels could be swapped (cost symmetric) — medical screening for common conditions
•Comparing models across datasets with different class proportions
•The specificity (true negative rate) matters as much as sensitivity
•Computing statistical significance via established methods (DeLong test)
•Moderate class imbalance (positive rate 10-50%)

Use PR Curves When...

•Positives are rare (positive rate < 5-10%) — fraud detection, rare disease screening
•False positives are costly — each requires expensive investigation or causes user friction
•True negatives are uninteresting 'background' — document retrieval, recommendation
•Stakeholders care about 'Of predicted positives, how many are correct?'
•Information retrieval contexts (PR curves are traditional in IR)

The Pragmatic Recommendation

For imbalanced problems, report BOTH ROC-AUC and Average Precision. This provides a complete picture: AUC shows ranking quality independent of class distribution; AP shows precision you can expect when acting on positive predictions. Stakeholders benefit from seeing both perspectives.

Summary: Precision-Recall Curves

We've developed comprehensive knowledge of Precision-Recall curves—from construction to interpretation to practical application. Let's consolidate the key insights:

Key Takeaways

•PR curves plot Precision vs Recall — revealing the tradeoff between positive prediction quality and positive coverage.
•Precision is sensitive to class imbalance — when negatives dominate, even small FPR produces many FP, devastating precision. ROC curves hide this problem.
•The random baseline depends on class proportion — unlike ROC's fixed diagonal, PR's baseline is a horizontal line at y = base_rate.
•PR curves reveal imbalance problems — the same classifier can have AUC = 0.85 but Average Precision = 0.15 on highly imbalanced data.
•Use PR curves for imbalanced data — when positives are rare and positive predictions drive decisions, PR curves provide the relevant evaluation.
•Report both metrics when appropriate — AUC and Average Precision tell complementary stories about classifier performance.

What's next:

The next page covers Average Precision (AP)—the area under the PR curve and the standard scalar summary for PR analysis. We'll explore its exact computation, relationship to other metrics, and interpretation for model comparison.

Page Complete

You now understand PR curves deeply: their construction, graphical interpretation, relationship to ROC curves, and when each is appropriate. You can confidently evaluate classifiers on imbalanced data and communicate performance to stakeholders accurately.

3 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

3 / 5

Precision-Recall Curves

When ROC Curves Fall Short

ROC curves are powerful, but they have a blind spot: they can look excellent even when the classifier performs poorly in practice on imbalanced data.

But consider the practical implications:

From 10,000 patients: 1 true positive, 9,999 true negatives
At FPR = 0.01: ~100 false positives from the 9,999 negatives
Total positives predicted: ~101 (1 true + 100 false)
Precision = 1/101 ≈ 0.01 (99% of positive predictions are wrong!)

The ROC curve hid this disaster. The Precision-Recall curve would have revealed it immediately.

What You Will Learn

Precision and Recall Foundations

Before constructing PR curves, we must deeply understand the metrics they display. Precision and recall answer different questions about classifier behavior.

Recall (Sensitivity, True Positive Rate)

Recall measures the fraction of actual positives that are correctly identified:

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\text{TP}}{P}$$

Question answered: Of all actual positive examples, how many did we catch?

Denominator is P (total actual positives) — fixed for a dataset
Recall = 1: No false negatives (all positives caught)
Recall = 0: No positives caught (all classified negative)

Precision (Positive Predictive Value)

Precision measures the fraction of positive predictions that are correct:

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$

Question answered: Of all examples we predicted positive, how many were actually positive?

Denominator is TP + FP (total positive predictions) — varies with threshold
Precision = 1: No false positives (all positive predictions correct)
Precision = 0: No true positives (all positive predictions wrong)

Recall Characteristics

•Denominator is fixed (P = total positives)
•Equivalent to TPR in ROC analysis
•Ignores false positives completely
•Measures 'Did we find positives?'
•High recall = few missed positives
•Critical when missing positives is costly

Precision Characteristics

•Denominator varies (TP + FP changes)
•No direct equivalent in ROC
•Ignores true negatives completely
•Measures 'Are positives correct?'
•High precision = few false alarms
•Critical when false positives are costly

The Key Difference from ROC

The Precision-Recall Tradeoff

As we lower the classification threshold:

More examples classified positive (TP + FP increases)
Recall increases (we catch more positives, TP↑)
Precision typically decreases (more FP mix in, diluting TP fraction)

This tradeoff is fundamental: to catch more positives (higher recall), we usually accept more false positives (lower precision). The PR curve captures this tradeoff completely.

Base Rate Dependence

Unlike recall, precision is sensitive to class imbalance:

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{\text{Recall} \cdot P}{\text{Recall} \cdot P + \text{FPR} \cdot N}$$

When N >> P (many more negatives than positives), even small FPR produces FP >> TP, crushing precision. This is exactly why PR curves reveal problems that ROC curves hide on imbalanced data.

Constructing PR Curves

The construction algorithm for PR curves parallels ROC construction but with different metrics computed.

The Algorithm

Given n examples with scores and labels:

Sort examples by descending score
Sweep threshold from high to low
At each unique threshold, compute precision and recall
Plot points as (Recall, Precision)

PR Curve Construction
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
 
def construct_pr_curve(scores, labels):
    """
    Construct Precision-Recall curve from scores and labels.
    
    Parameters:
    -----------
    scores : array-like
        Classifier output scores (higher = more likely positive)
    labels : array-like
        True binary labels (0 or 1)
    
    Returns:
    --------
    precision : array
    recall : array  
    thresholds : array
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Total positives
    P = np.sum(labels == 1)
    
    if P == 0:
        raise ValueError("No positive examples in dataset")
    
    # Sort by descending score
    sorted_indices = np.argsort(-scores)
    sorted_labels = labels[sorted_indices]
    sorted_scores = scores[sorted_indices]
    
    # Compute precision and recall at each threshold
    tp = 0
    fp = 0
    
    precisions = []
    recalls = []
    thresholds = []
    
    for i in range(len(sorted_labels)):
        if sorted_labels[i] == 1:
            tp += 1
        else:
            fp += 1
        
        # Precision = TP / (TP + FP)
        precision = tp / (tp + fp)
        
        # Recall = TP / P  
        recall = tp / P
        
        # Record at each unique threshold
        if i == len(sorted_labels) - 1 or sorted_scores[i] != sorted_scores[i+1]:
            precisions.append(precision)
            recalls.append(recall)
            thresholds.append(sorted_scores[i])
    
    # Convention: Add starting point (recall=0, precision=1)
    # This represents the "perfect precision" achievable at recall=0
    precisions = [1.0] + precisions
    recalls = [0.0] + recalls
    
    return np.array(precisions), np.array(recalls), np.array(thresholds)
 
# Example walkthrough
scores = [0.9, 0.8, 0.7, 0.5, 0.3, 0.1]
labels = [1, 0, 1, 0, 1, 0]  # P=3, N=3
 
print("Score | Label | TP | FP | Precision | Recall")
print("-" * 55)
 
tp, fp = 0, 0
P = sum(labels)
 
for i, (s, l) in enumerate(zip(sorted(zip(scores, labels), reverse=True))):
    score, label = s, l[1] if isinstance(l, tuple) else l
    # Unpack properly
    
sorted_pairs = sorted(zip(scores, labels), key=lambda x: -x[0])
tp, fp = 0, 0
 
for score, label in sorted_pairs:
    if label == 1:
        tp += 1
    else:
        fp += 1
    prec = tp / (tp + fp)
    rec = tp / P
    print(f" {score:.1f}  |   {label}   | {tp}  | {fp}  |   {prec:.3f}   | {rec:.3f}")

Example Walkthrough

Using scores = [0.9, 0.8, 0.7, 0.5, 0.3, 0.1] with labels = [1, 0, 1, 0, 1, 0]:

Score	Label	TP	FP	Precision	Recall
0.9	1	1	0	1.000	0.333
0.8	0	1	1	0.500	0.333
0.7	1	2	1	0.667	0.667
0.5	0	2	2	0.500	0.667
0.3	1	3	2	0.600	1.000
0.1	0	3	3	0.500	1.000

The PR curve passes through: (0, 1.0) → (0.333, 1.0) → (0.333, 0.5) → (0.667, 0.667) → (0.667, 0.5) → (1.0, 0.6) → (1.0, 0.5)

Non-Monotonic Precision

Graphical Interpretation

PR curves have a different visual language than ROC curves. Understanding this helps extract insights from PR plots.

The PR Space

Axes:

X-axis (Recall): Fraction of positives caught (0 to 1)
Y-axis (Precision): Fraction of positive predictions that are correct (0 to 1)

Goal: We want high precision AND high recall — upper-right corner is ideal.

Key Reference Points

Landmarks in PR Space

•Point (0, ~1.0): High threshold — predict only highest-confidence cases positive. Precision ≈ 1 if first examples are TPs.
•Point (1.0, base_rate): Lowest threshold — predict everything positive. Precision = P/(P+N) = base rate (class proportion).
•Point (1.0, 1.0): Perfect classifier — all positives found, no false positives.
•Horizontal line at y = base_rate: Random classifier — precision equals class proportion everywhere.
•Upper-right region: Desirable — achieves both high precision and high recall.

The Random Baseline Changes with Class Balance

This is crucial: the random baseline in PR space depends on class proportion.

For a dataset with 10% positives:

Random classifier: Precision ≈ 0.10 everywhere
Baseline is a horizontal line at y = 0.10

For a dataset with 50% positives:

Random classifier: Precision ≈ 0.50 everywhere
Baseline is a horizontal line at y = 0.50

This means PR curves cannot be directly compared across datasets with different class distributions, unlike ROC curves where the random baseline is always the diagonal.

Curve Shape Interpretation

Curve hugging the upper-right corner:

Excellent classifier
Achieves high precision even at high recall
Strong separation between classes

Rapidly dropping precision as recall increases:

Initial positives are well-separated
Remaining positives are mixed with negatives
Common pattern: quickly identifies 'easy' positives, struggles with 'hard' ones

Staircase pattern:

Discrete jumps when encountering true positives (precision up) and false positives (precision down)
More visible with small datasets or scores that produce few unique thresholds

Why PR Curves 'Drop' at the End

Interpolation Methods

Unlike ROC curves where linear interpolation between points is straightforward, PR curve interpolation requires careful treatment. Naive linear interpolation can produce misleading areas.

The Problem with Linear Interpolation

Between two observed points (R₁, P₁) and (R₂, P₂), what precision is achievable at recall R where R₁ < R < R₂?

Linear interpolation assumes: $$P(R) = P_1 + \frac{P_2 - P_1}{R_2 - R_1}(R - R_1)$$

But this is incorrect for PR curves. The true interpolation depends on how examples are distributed between the thresholds.

The Correct Approach: Step or Step-Down

Two common conventions:

Step function (conservative): Hold precision constant until the next observed point
Maximum precision interpolation: At each recall level, use the maximum precision at any recall ≥ that level

The maximum precision interpolation is widely used for calculating Average Precision.

PR Curve Interpolation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def interpolate_pr_curve(precision, recall, num_points=101):
    """
    Interpolate PR curve using the 'maximum precision' method.
    
    At each recall level r, precision is the maximum precision
    at any recall >= r. This is the standard for Average Precision.
    
    This gives an 'optimistic' or 'envelope' interpolation.
    """
    # Ensure recall is increasing (may need to sort)
    sorted_indices = np.argsort(recall)
    recall = np.array(recall)[sorted_indices]
    precision = np.array(precision)[sorted_indices]
    
    # Compute maximum precision at each recall level (looking forward)
    # precision_interp[i] = max(precision[i:])
    precision_interp = np.maximum.accumulate(precision[::-1])[::-1]
    
    # Sample at regular recall intervals
    recall_levels = np.linspace(0, 1, num_points)
    interpolated_precision = np.zeros_like(recall_levels)
    
    for i, r in enumerate(recall_levels):
        # Find first observed recall >= r
        idx = np.searchsorted(recall, r)
        if idx >= len(recall):
            # Beyond observed data - use last precision
            interpolated_precision[i] = precision_interp[-1]
        else:
            interpolated_precision[i] = precision_interp[idx]
    
    return recall_levels, interpolated_precision
 
# The '11-point interpolation' used in older TREC evaluations
def eleven_point_interpolation(precision, recall):
    """
    Compute interpolated precision at 11 standard recall levels.
    
    Used in classic information retrieval evaluation.
    """
    recall_levels = np.linspace(0, 1, 11)  # 0.0, 0.1, 0.2, ..., 1.0
    
    interp_precision = []
    for r in recall_levels:
        # Max precision at any recall >= r
        valid = precision[recall >= r] if any(recall >= r) else [0]
        interp_precision.append(max(valid) if len(valid) > 0 else 0)
    
    return recall_levels, np.array(interp_precision)

scikit-learn's Approach

PR Curves vs ROC Curves

PR and ROC curves provide complementary views of classifier performance. Understanding their relationship helps choose the right tool.

Mathematical Relationship

Both curves use the same underlying confusion matrix, just focusing on different ratios:

Metric	ROC Curve	PR Curve
X-axis	FPR = FP / N	Recall = TP / P
Y-axis	TPR = TP / P	Precision = TP / (TP+FP)
Conditions on	Actual class	Predicted class (for precision)

Key insight: Recall is the same metric in both (just named TPR in ROC). The difference is the Y-axis:

ROC Y-axis (TPR) = TP / P
PR Y-axis (Precision) = TP / (TP + FP)

Precision involves FP in the denominator, making it sensitive to class imbalance in a way TPR isn't.

Why PR Curves Reveal Imbalance Problems

Imbalance Sensitivity Demonstration
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score, roc_curve, precision_recall_curve
import matplotlib.pyplot as plt
 
def compare_roc_pr_under_imbalance():
    """
    Demonstrate how the same classifier behavior looks different
    on ROC vs PR curves under class imbalance.
    """
    np.random.seed(42)
    
    # Scenario: Same underlying separation, different class proportions
    # Generate scores that give AUC ≈ 0.85
    
    for pos_fraction in [0.5, 0.1, 0.01]:
        n_total = 10000
        n_pos = int(n_total * pos_fraction)
        n_neg = n_total - n_pos
        
        # Positive scores: mean 0.6, std 0.15
        # Negative scores: mean 0.4, std 0.15
        pos_scores = np.clip(np.random.normal(0.6, 0.15, n_pos), 0, 1)
        neg_scores = np.clip(np.random.normal(0.4, 0.15, n_neg), 0, 1)
        
        scores = np.concatenate([pos_scores, neg_scores])
        labels = np.array([1]*n_pos + [0]*n_neg)
        
        auc = roc_auc_score(labels, scores)
        ap = average_precision_score(labels, scores)
        
        print(f"\nClass balance: {pos_fraction:.0%} positive")
        print(f"  ROC-AUC:  {auc:.3f}")
        print(f"  Avg Prec: {ap:.3f}")
        print(f"  Ratio AP/AUC: {ap/auc:.2f}")
 
compare_roc_pr_under_imbalance()
 
# Output demonstrates that AUC stays ~0.85 across all imbalance levels,
# while Average Precision (PR-AUC) drops dramatically as imbalance increases.

Typical Results

Class Balance	ROC-AUC	Average Precision	Interpretation
50% positive	0.85	0.85	Balanced: both metrics similar
10% positive	0.85	0.52	Moderate imbalance: PR shows difficulty
1% positive	0.85	0.15	Severe imbalance: PR reveals poor precision

When This Matters

The Imbalance Trap

Dominance and Curve Comparison

Comparing classifiers via their PR curves requires understanding dominance relationships.

PR Dominance

PR curve A dominates PR curve B if:

At every recall level, A's precision ≥ B's precision
At at least one recall level, A's precision > B's precision

If A dominates B, then A is unambiguously better—no threshold choice makes B preferable.

Relationship to ROC Dominance

A remarkable theoretical result (Davis & Goadrich, 2006):

If classifier A dominates in ROC space, it also dominates in PR space.

However, the converse is NOT true:

A can dominate in PR space without dominating in ROC space
This happens because PR curves are more sensitive to class imbalance

When Curves Cross

When neither curve dominates (they cross), neither classifier is universally better. The choice depends on the operating point:

Need high precision (few false alarms)? Choose the classifier with better PR curve in the low-recall region
Need high recall (catch most positives)? Choose the classifier with better PR curve in the high-recall region

Comparing PR Curves
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
 
def compare_pr_curves(scores_list, labels, model_names, 
                      target_recalls=[0.7, 0.8, 0.9]):
    """
    Compare multiple classifiers via PR curves and report
    precision at target recall levels.
    """
    plt.figure(figsize=(10, 6))
    
    results = {}
    
    for scores, name in zip(scores_list, model_names):
        precision, recall, thresholds = precision_recall_curve(labels, scores)
        
        # Plot
        plt.plot(recall, precision, label=name, linewidth=2)
        
        # Find precision at target recall levels
        results[name] = {}
        for target in target_recalls:
            # Find highest recall <= target
            valid_idx = np.where(recall >= target)[0]
            if len(valid_idx) > 0:
                # Use interpolated precision (max precision at recall >= target)
                prec_at_target = np.max(precision[valid_idx])
            else:
                prec_at_target = 0
            results[name][f'precision@recall={target}'] = prec_at_target
    
    # Random baseline
    base_rate = np.mean(labels)
    plt.axhline(y=base_rate, color='gray', linestyle='--', 
                label=f'Random (baseline={base_rate:.2f})')
    
    plt.xlabel('Recall', fontsize=12)
    plt.ylabel('Precision', fontsize=12)
    plt.title('Precision-Recall Curve Comparison', fontsize=14)
    plt.legend(loc='best')
    plt.grid(alpha=0.3)
    plt.xlim([0, 1.02])
    plt.ylim([0, 1.02])
    
    plt.tight_layout()
    plt.show()
    
    # Print comparison table
    print("\nPrecision at Target Recall Levels:")
    print("-" * 50)
    header = "Model".ljust(20) + " | ".join([f"R={r}" for r in target_recalls])
    print(header)
    print("-" * 50)
    
    for name in model_names:
        row = name.ljust(20)
        for target in target_recalls:
            row += f"{results[name][f'precision@recall={target}']:6.3f}  "
        print(row)
    
    return results

Operational Comparison

Practical Visualization

Effective PR curve visualization requires attention to several practical details.

Complete PR Curve Visualization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (precision_recall_curve, average_precision_score,
                            roc_curve, roc_auc_score)
 
def create_comprehensive_pr_visualization():
    """
    Create a publication-quality PR curve comparison with all relevant context.
    """
    # Create imbalanced dataset
    X, y = make_classification(
        n_samples=5000,
        n_features=20,
        n_informative=10,
        weights=[0.95, 0.05],  # 5% positive
        random_state=42
    )
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, stratify=y, random_state=42
    )
    
    # Train models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
    }
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    colors = ['#2563eb', '#dc2626', '#16a34a']
    
    for ax_idx, (ax, curve_type) in enumerate(zip(axes, ['PR', 'ROC'])):
        for (name, model), color in zip(models.items(), colors):
            model.fit(X_train, y_train)
            probs = model.predict_proba(X_test)[:, 1]
            
            if curve_type == 'PR':
                precision, recall, _ = precision_recall_curve(y_test, probs)
                ap = average_precision_score(y_test, probs)
                ax.plot(recall, precision, color=color, linewidth=2,
                       label=f'{name} (AP={ap:.3f})')
            else:  # ROC
                fpr, tpr, _ = roc_curve(y_test, probs)
                auc = roc_auc_score(y_test, probs)
                ax.plot(fpr, tpr, color=color, linewidth=2,
                       label=f'{name} (AUC={auc:.3f})')
        
        if curve_type == 'PR':
            # Random baseline for PR
            base_rate = np.mean(y_test)
            ax.axhline(y=base_rate, color='gray', linestyle='--', 
                      label=f'Random (P={base_rate:.3f})')
            ax.set_xlabel('Recall', fontsize=12)
            ax.set_ylabel('Precision', fontsize=12)
            ax.set_title('Precision-Recall Curves\n(5% positive class)', fontsize=14)
        else:
            # Random baseline for ROC
            ax.plot([0, 1], [0, 1], 'k--', label='Random')
            ax.set_xlabel('False Positive Rate', fontsize=12)
            ax.set_ylabel('True Positive Rate', fontsize=12)
            ax.set_title('ROC Curves\n(5% positive class)', fontsize=14)
        
        ax.set_xlim([-0.02, 1.02])
        ax.set_ylim([-0.02, 1.02])
        ax.legend(loc='lower left' if curve_type == 'PR' else 'lower right')
        ax.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('pr_roc_comparison_imbalanced.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    # Print summary
    print("\nSummary (5% positive class):")
    print("-" * 50)
    print(f"{'Model':<25} {'AUC':>8} {'Avg Prec':>10}")
    print("-" * 50)
    for name, model in models.items():
        probs = model.predict_proba(X_test)[:, 1]
        auc = roc_auc_score(y_test, probs)
        ap = average_precision_score(y_test, probs)
        print(f"{name:<25} {auc:>8.3f} {ap:>10.3f}")
    
    print("\nNote: AUC values are similar (~0.9), but Average Precision")
    print("reveals true performance differences on imbalanced data.")
 
create_comprehensive_pr_visualization()

Visualization Best Practices

Always show the random baseline (horizontal line at base rate for PR, diagonal for ROC). 2. Include Average Precision/AUC in the legend. 3. For imbalanced data, always show PR curves alongside ROC curves. 4. Use consistent colors when comparing the same models across plots.

When to Use PR vs ROC

Choosing between PR and ROC curves isn't about which is 'better'—it's about which better matches your evaluation needs.

PR vs ROC: Decision Framework
Factor	Prefer ROC Curves	Prefer PR Curves
Class balance	Balanced or moderate imbalance	Severe imbalance (rare positives)
What matters	Overall discrimination	Positive predictions quality
Use case	Compare ranking power	Evaluate predicted positive reliability
Negatives count	True negatives are important	True negatives are 'background'
Baseline	Fixed (diagonal)	Varies with class proportion
Cross-dataset comparison	Yes (same baseline)	No (different baselines)
Statistical tools	Well-developed (DeLong, etc.)	Less standardized

Use ROC Curves When...

•Class labels could be swapped (cost symmetric) — medical screening for common conditions
•Comparing models across datasets with different class proportions
•The specificity (true negative rate) matters as much as sensitivity
•Computing statistical significance via established methods (DeLong test)
•Moderate class imbalance (positive rate 10-50%)

Use PR Curves When...

•Positives are rare (positive rate < 5-10%) — fraud detection, rare disease screening
•False positives are costly — each requires expensive investigation or causes user friction
•True negatives are uninteresting 'background' — document retrieval, recommendation
•Stakeholders care about 'Of predicted positives, how many are correct?'
•Information retrieval contexts (PR curves are traditional in IR)

The Pragmatic Recommendation

Summary: Precision-Recall Curves

We've developed comprehensive knowledge of Precision-Recall curves—from construction to interpretation to practical application. Let's consolidate the key insights:

Key Takeaways

•PR curves plot Precision vs Recall — revealing the tradeoff between positive prediction quality and positive coverage.
•Precision is sensitive to class imbalance — when negatives dominate, even small FPR produces many FP, devastating precision. ROC curves hide this problem.
•The random baseline depends on class proportion — unlike ROC's fixed diagonal, PR's baseline is a horizontal line at y = base_rate.
•PR curves reveal imbalance problems — the same classifier can have AUC = 0.85 but Average Precision = 0.15 on highly imbalanced data.
•Use PR curves for imbalanced data — when positives are rare and positive predictions drive decisions, PR curves provide the relevant evaluation.
•Report both metrics when appropriate — AUC and Average Precision tell complementary stories about classifier performance.

What's next:

Page Complete

3 / 5