Loading learning content...
ROC curves are powerful, but they have a blind spot: they can look excellent even when the classifier performs poorly in practice on imbalanced data.
Imagine a disease screening where 1 in 10,000 patients has the condition. A classifier achieves TPR = 0.80 (catches 80% of cases) at FPR = 0.01 (1% false alarm rate). On the ROC curve, this looks great—a point far into the upper-left corner.
But consider the practical implications:
The ROC curve hid this disaster. The Precision-Recall curve would have revealed it immediately.
By the end of this page, you will understand precision and recall as axis metrics, how to construct PR curves, their graphical interpretation, when PR curves are more informative than ROC curves, the relationship between PR and ROC curves, and practical guidelines for choosing between them.
Before constructing PR curves, we must deeply understand the metrics they display. Precision and recall answer different questions about classifier behavior.
Recall measures the fraction of actual positives that are correctly identified:
$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} = \frac{\text{TP}}{P}$$
Question answered: Of all actual positive examples, how many did we catch?
Precision measures the fraction of positive predictions that are correct:
$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$
Question answered: Of all examples we predicted positive, how many were actually positive?
ROC uses FPR = FP/N, which conditions on negatives. Precision uses FP in the numerator of a fraction conditioned on positive PREDICTIONS. When negatives vastly outnumber positives (imbalanced data), even a small FPR produces many FP, devastating precision.
As we lower the classification threshold:
This tradeoff is fundamental: to catch more positives (higher recall), we usually accept more false positives (lower precision). The PR curve captures this tradeoff completely.
Unlike recall, precision is sensitive to class imbalance:
$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} = \frac{\text{Recall} \cdot P}{\text{Recall} \cdot P + \text{FPR} \cdot N}$$
When N >> P (many more negatives than positives), even small FPR produces FP >> TP, crushing precision. This is exactly why PR curves reveal problems that ROC curves hide on imbalanced data.
The construction algorithm for PR curves parallels ROC construction but with different metrics computed.
Given n examples with scores and labels:
Key difference from ROC: The curve moves in the opposite 'direction' — starting at (0, precision_at_first) when threshold is highest, ending at (recall=1, precision=base_rate) when threshold is lowest.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as np def construct_pr_curve(scores, labels): """ Construct Precision-Recall curve from scores and labels. Parameters: ----------- scores : array-like Classifier output scores (higher = more likely positive) labels : array-like True binary labels (0 or 1) Returns: -------- precision : array recall : array thresholds : array """ scores = np.array(scores) labels = np.array(labels) # Total positives P = np.sum(labels == 1) if P == 0: raise ValueError("No positive examples in dataset") # Sort by descending score sorted_indices = np.argsort(-scores) sorted_labels = labels[sorted_indices] sorted_scores = scores[sorted_indices] # Compute precision and recall at each threshold tp = 0 fp = 0 precisions = [] recalls = [] thresholds = [] for i in range(len(sorted_labels)): if sorted_labels[i] == 1: tp += 1 else: fp += 1 # Precision = TP / (TP + FP) precision = tp / (tp + fp) # Recall = TP / P recall = tp / P # Record at each unique threshold if i == len(sorted_labels) - 1 or sorted_scores[i] != sorted_scores[i+1]: precisions.append(precision) recalls.append(recall) thresholds.append(sorted_scores[i]) # Convention: Add starting point (recall=0, precision=1) # This represents the "perfect precision" achievable at recall=0 precisions = [1.0] + precisions recalls = [0.0] + recalls return np.array(precisions), np.array(recalls), np.array(thresholds) # Example walkthroughscores = [0.9, 0.8, 0.7, 0.5, 0.3, 0.1]labels = [1, 0, 1, 0, 1, 0] # P=3, N=3 print("Score | Label | TP | FP | Precision | Recall")print("-" * 55) tp, fp = 0, 0P = sum(labels) for i, (s, l) in enumerate(zip(sorted(zip(scores, labels), reverse=True))): score, label = s, l[1] if isinstance(l, tuple) else l # Unpack properly sorted_pairs = sorted(zip(scores, labels), key=lambda x: -x[0])tp, fp = 0, 0 for score, label in sorted_pairs: if label == 1: tp += 1 else: fp += 1 prec = tp / (tp + fp) rec = tp / P print(f" {score:.1f} | {label} | {tp} | {fp} | {prec:.3f} | {rec:.3f}")Using scores = [0.9, 0.8, 0.7, 0.5, 0.3, 0.1] with labels = [1, 0, 1, 0, 1, 0]:
| Score | Label | TP | FP | Precision | Recall |
|---|---|---|---|---|---|
| 0.9 | 1 | 1 | 0 | 1.000 | 0.333 |
| 0.8 | 0 | 1 | 1 | 0.500 | 0.333 |
| 0.7 | 1 | 2 | 1 | 0.667 | 0.667 |
| 0.5 | 0 | 2 | 2 | 0.500 | 0.667 |
| 0.3 | 1 | 3 | 2 | 0.600 | 1.000 |
| 0.1 | 0 | 3 | 3 | 0.500 | 1.000 |
The PR curve passes through: (0, 1.0) → (0.333, 1.0) → (0.333, 0.5) → (0.667, 0.667) → (0.667, 0.5) → (1.0, 0.6) → (1.0, 0.5)
Unlike ROC curves where TPR monotonically increases, precision in PR curves can oscillate up and down. When we add a true positive (label=1), precision often increases; when we add a false positive (label=0), precision decreases. This creates the 'sawtooth' pattern characteristic of PR curves.
PR curves have a different visual language than ROC curves. Understanding this helps extract insights from PR plots.
Axes:
Goal: We want high precision AND high recall — upper-right corner is ideal.
This is crucial: the random baseline in PR space depends on class proportion.
For a dataset with 10% positives:
For a dataset with 50% positives:
This means PR curves cannot be directly compared across datasets with different class distributions, unlike ROC curves where the random baseline is always the diagonal.
Curve hugging the upper-right corner:
Rapidly dropping precision as recall increases:
Staircase pattern:
PR curves often show a final drop or plateau at high recall. This happens when the last positives (with lowest scores) are finally included, bringing in many false positives with similar scores. This 'tail' represents the classifier's difficulty with ambiguous cases.
Unlike ROC curves where linear interpolation between points is straightforward, PR curve interpolation requires careful treatment. Naive linear interpolation can produce misleading areas.
Between two observed points (R₁, P₁) and (R₂, P₂), what precision is achievable at recall R where R₁ < R < R₂?
Linear interpolation assumes: $$P(R) = P_1 + \frac{P_2 - P_1}{R_2 - R_1}(R - R_1)$$
But this is incorrect for PR curves. The true interpolation depends on how examples are distributed between the thresholds.
Two common conventions:
The maximum precision interpolation is widely used for calculating Average Precision.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as np def interpolate_pr_curve(precision, recall, num_points=101): """ Interpolate PR curve using the 'maximum precision' method. At each recall level r, precision is the maximum precision at any recall >= r. This is the standard for Average Precision. This gives an 'optimistic' or 'envelope' interpolation. """ # Ensure recall is increasing (may need to sort) sorted_indices = np.argsort(recall) recall = np.array(recall)[sorted_indices] precision = np.array(precision)[sorted_indices] # Compute maximum precision at each recall level (looking forward) # precision_interp[i] = max(precision[i:]) precision_interp = np.maximum.accumulate(precision[::-1])[::-1] # Sample at regular recall intervals recall_levels = np.linspace(0, 1, num_points) interpolated_precision = np.zeros_like(recall_levels) for i, r in enumerate(recall_levels): # Find first observed recall >= r idx = np.searchsorted(recall, r) if idx >= len(recall): # Beyond observed data - use last precision interpolated_precision[i] = precision_interp[-1] else: interpolated_precision[i] = precision_interp[idx] return recall_levels, interpolated_precision # The '11-point interpolation' used in older TREC evaluationsdef eleven_point_interpolation(precision, recall): """ Compute interpolated precision at 11 standard recall levels. Used in classic information retrieval evaluation. """ recall_levels = np.linspace(0, 1, 11) # 0.0, 0.1, 0.2, ..., 1.0 interp_precision = [] for r in recall_levels: # Max precision at any recall >= r valid = precision[recall >= r] if any(recall >= r) else [0] interp_precision.append(max(valid) if len(valid) > 0 else 0) return recall_levels, np.array(interp_precision)scikit-learn's precision_recall_curve returns points in a specific order and uses the 'step function' interpretation. When computing average_precision_score, it uses trapezoidal integration on the step function, which is equivalent to weighting each precision by the recall change at that point.
PR and ROC curves provide complementary views of classifier performance. Understanding their relationship helps choose the right tool.
Both curves use the same underlying confusion matrix, just focusing on different ratios:
| Metric | ROC Curve | PR Curve |
|---|---|---|
| X-axis | FPR = FP / N | Recall = TP / P |
| Y-axis | TPR = TP / P | Precision = TP / (TP+FP) |
| Conditions on | Actual class | Predicted class (for precision) |
Key insight: Recall is the same metric in both (just named TPR in ROC). The difference is the Y-axis:
Precision involves FP in the denominator, making it sensitive to class imbalance in a way TPR isn't.
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as npfrom sklearn.metrics import roc_auc_score, average_precision_score, roc_curve, precision_recall_curveimport matplotlib.pyplot as plt def compare_roc_pr_under_imbalance(): """ Demonstrate how the same classifier behavior looks different on ROC vs PR curves under class imbalance. """ np.random.seed(42) # Scenario: Same underlying separation, different class proportions # Generate scores that give AUC ≈ 0.85 for pos_fraction in [0.5, 0.1, 0.01]: n_total = 10000 n_pos = int(n_total * pos_fraction) n_neg = n_total - n_pos # Positive scores: mean 0.6, std 0.15 # Negative scores: mean 0.4, std 0.15 pos_scores = np.clip(np.random.normal(0.6, 0.15, n_pos), 0, 1) neg_scores = np.clip(np.random.normal(0.4, 0.15, n_neg), 0, 1) scores = np.concatenate([pos_scores, neg_scores]) labels = np.array([1]*n_pos + [0]*n_neg) auc = roc_auc_score(labels, scores) ap = average_precision_score(labels, scores) print(f"\nClass balance: {pos_fraction:.0%} positive") print(f" ROC-AUC: {auc:.3f}") print(f" Avg Prec: {ap:.3f}") print(f" Ratio AP/AUC: {ap/auc:.2f}") compare_roc_pr_under_imbalance() # Output demonstrates that AUC stays ~0.85 across all imbalance levels,# while Average Precision (PR-AUC) drops dramatically as imbalance increases.| Class Balance | ROC-AUC | Average Precision | Interpretation |
|---|---|---|---|
| 50% positive | 0.85 | 0.85 | Balanced: both metrics similar |
| 10% positive | 0.85 | 0.52 | Moderate imbalance: PR shows difficulty |
| 1% positive | 0.85 | 0.15 | Severe imbalance: PR reveals poor precision |
The same classifier, with identical ranking behavior, shows dramatically different PR performance as imbalance increases. ROC-AUC stays the same because it doesn't see the flood of false positives that drowns precision.
If you're building a disease detector and report AUC = 0.85, stakeholders might expect good performance. But if the disease is rare (1% prevalence), Average Precision = 0.15 reveals the truth: most positive predictions will be false alarms.
Reporting only ROC-AUC for imbalanced problems is misleading. A classifier might achieve AUC = 0.95 while having precision < 0.10 at practical recall levels. Always report both ROC and PR metrics for imbalanced data, or at minimum report PR metrics if positive predictions drive decisions.
Comparing classifiers via their PR curves requires understanding dominance relationships.
PR curve A dominates PR curve B if:
If A dominates B, then A is unambiguously better—no threshold choice makes B preferable.
A remarkable theoretical result (Davis & Goadrich, 2006):
If classifier A dominates in ROC space, it also dominates in PR space.
However, the converse is NOT true:
When neither curve dominates (they cross), neither classifier is universally better. The choice depends on the operating point:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as npfrom sklearn.metrics import precision_recall_curveimport matplotlib.pyplot as plt def compare_pr_curves(scores_list, labels, model_names, target_recalls=[0.7, 0.8, 0.9]): """ Compare multiple classifiers via PR curves and report precision at target recall levels. """ plt.figure(figsize=(10, 6)) results = {} for scores, name in zip(scores_list, model_names): precision, recall, thresholds = precision_recall_curve(labels, scores) # Plot plt.plot(recall, precision, label=name, linewidth=2) # Find precision at target recall levels results[name] = {} for target in target_recalls: # Find highest recall <= target valid_idx = np.where(recall >= target)[0] if len(valid_idx) > 0: # Use interpolated precision (max precision at recall >= target) prec_at_target = np.max(precision[valid_idx]) else: prec_at_target = 0 results[name][f'precision@recall={target}'] = prec_at_target # Random baseline base_rate = np.mean(labels) plt.axhline(y=base_rate, color='gray', linestyle='--', label=f'Random (baseline={base_rate:.2f})') plt.xlabel('Recall', fontsize=12) plt.ylabel('Precision', fontsize=12) plt.title('Precision-Recall Curve Comparison', fontsize=14) plt.legend(loc='best') plt.grid(alpha=0.3) plt.xlim([0, 1.02]) plt.ylim([0, 1.02]) plt.tight_layout() plt.show() # Print comparison table print("\nPrecision at Target Recall Levels:") print("-" * 50) header = "Model".ljust(20) + " | ".join([f"R={r}" for r in target_recalls]) print(header) print("-" * 50) for name in model_names: row = name.ljust(20) for target in target_recalls: row += f"{results[name][f'precision@recall={target}']:6.3f} " print(row) return resultsInstead of comparing entire curves, compare at specific operating points relevant to your use case. If you need to achieve recall ≥ 0.8, compare precision at recall = 0.8. This gives actionable insights rather than abstract curve comparisons.
Effective PR curve visualization requires attention to several practical details.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import make_classificationfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.metrics import (precision_recall_curve, average_precision_score, roc_curve, roc_auc_score) def create_comprehensive_pr_visualization(): """ Create a publication-quality PR curve comparison with all relevant context. """ # Create imbalanced dataset X, y = make_classification( n_samples=5000, n_features=20, n_informative=10, weights=[0.95, 0.05], # 5% positive random_state=42 ) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, stratify=y, random_state=42 ) # Train models models = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42), 'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42) } fig, axes = plt.subplots(1, 2, figsize=(14, 6)) colors = ['#2563eb', '#dc2626', '#16a34a'] for ax_idx, (ax, curve_type) in enumerate(zip(axes, ['PR', 'ROC'])): for (name, model), color in zip(models.items(), colors): model.fit(X_train, y_train) probs = model.predict_proba(X_test)[:, 1] if curve_type == 'PR': precision, recall, _ = precision_recall_curve(y_test, probs) ap = average_precision_score(y_test, probs) ax.plot(recall, precision, color=color, linewidth=2, label=f'{name} (AP={ap:.3f})') else: # ROC fpr, tpr, _ = roc_curve(y_test, probs) auc = roc_auc_score(y_test, probs) ax.plot(fpr, tpr, color=color, linewidth=2, label=f'{name} (AUC={auc:.3f})') if curve_type == 'PR': # Random baseline for PR base_rate = np.mean(y_test) ax.axhline(y=base_rate, color='gray', linestyle='--', label=f'Random (P={base_rate:.3f})') ax.set_xlabel('Recall', fontsize=12) ax.set_ylabel('Precision', fontsize=12) ax.set_title('Precision-Recall Curves\n(5% positive class)', fontsize=14) else: # Random baseline for ROC ax.plot([0, 1], [0, 1], 'k--', label='Random') ax.set_xlabel('False Positive Rate', fontsize=12) ax.set_ylabel('True Positive Rate', fontsize=12) ax.set_title('ROC Curves\n(5% positive class)', fontsize=14) ax.set_xlim([-0.02, 1.02]) ax.set_ylim([-0.02, 1.02]) ax.legend(loc='lower left' if curve_type == 'PR' else 'lower right') ax.grid(alpha=0.3) plt.tight_layout() plt.savefig('pr_roc_comparison_imbalanced.png', dpi=150, bbox_inches='tight') plt.show() # Print summary print("\nSummary (5% positive class):") print("-" * 50) print(f"{'Model':<25} {'AUC':>8} {'Avg Prec':>10}") print("-" * 50) for name, model in models.items(): probs = model.predict_proba(X_test)[:, 1] auc = roc_auc_score(y_test, probs) ap = average_precision_score(y_test, probs) print(f"{name:<25} {auc:>8.3f} {ap:>10.3f}") print("\nNote: AUC values are similar (~0.9), but Average Precision") print("reveals true performance differences on imbalanced data.") create_comprehensive_pr_visualization()Choosing between PR and ROC curves isn't about which is 'better'—it's about which better matches your evaluation needs.
| Factor | Prefer ROC Curves | Prefer PR Curves |
|---|---|---|
| Class balance | Balanced or moderate imbalance | Severe imbalance (rare positives) |
| What matters | Overall discrimination | Positive predictions quality |
| Use case | Compare ranking power | Evaluate predicted positive reliability |
| Negatives count | True negatives are important | True negatives are 'background' |
| Baseline | Fixed (diagonal) | Varies with class proportion |
| Cross-dataset comparison | Yes (same baseline) | No (different baselines) |
| Statistical tools | Well-developed (DeLong, etc.) | Less standardized |
For imbalanced problems, report BOTH ROC-AUC and Average Precision. This provides a complete picture: AUC shows ranking quality independent of class distribution; AP shows precision you can expect when acting on positive predictions. Stakeholders benefit from seeing both perspectives.
We've developed comprehensive knowledge of Precision-Recall curves—from construction to interpretation to practical application. Let's consolidate the key insights:
What's next:
The next page covers Average Precision (AP)—the area under the PR curve and the standard scalar summary for PR analysis. We'll explore its exact computation, relationship to other metrics, and interpretation for model comparison.
You now understand PR curves deeply: their construction, graphical interpretation, relationship to ROC curves, and when each is appropriate. You can confidently evaluate classifiers on imbalanced data and communicate performance to stakeholders accurately.