Loading content...
We established that accuracy conflates two fundamentally different types of errors—False Positives and False Negatives—into a single number, obscuring critical distinctions. Precision and Recall resolve this problem by measuring each error type separately.
These two metrics answer distinct questions about classifier behavior:
This separation is not merely a mathematical convenience—it reflects a fundamental asymmetry in how errors impact real-world systems. Mastering precision and recall, their trade-offs, and their domain-specific interpretation is essential for building classifiers that actually work in production.
By the end of this page, you will deeply understand precision and recall's formal definitions, geometric interpretations, the fundamental trade-off between them, how classification thresholds affect both metrics, and how to select the appropriate metric emphasis for your specific application domain.
Precision (Positive Predictive Value):
Precision measures the accuracy of positive predictions—what proportion of predicted positives are actually positive:
$$\text{Precision} = \frac{TP}{TP + FP} = \frac{\text{True Positives}}{\text{Total Predicted Positives}}$$
Recall (Sensitivity, True Positive Rate, Hit Rate):
Recall measures the completeness of positive identification—what proportion of actual positives were correctly identified:
$$\text{Recall} = \frac{TP}{TP + FN} = \frac{\text{True Positives}}{\text{Total Actual Positives}}$$
Both precision and recall share TP in the numerator—they differ in which type of error appears in the denominator. Precision penalizes False Positives (FP), while Recall penalizes False Negatives (FN). This is why improving one often comes at the cost of the other.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import numpy as npfrom sklearn.metrics import precision_score, recall_score, confusion_matrix def precision_recall_from_confusion_matrix(y_true, y_pred): """ Calculate precision and recall directly from the confusion matrix to illustrate their derivation. """ cm = confusion_matrix(y_true, y_pred) TN, FP, FN, TP = cm.ravel() # Precision: Of all predicted positives, how many are correct? precision = TP / (TP + FP) if (TP + FP) > 0 else 0 # Recall: Of all actual positives, how many did we find? recall = TP / (TP + FN) if (TP + FN) > 0 else 0 return { 'confusion_matrix': cm, 'TP': TP, 'FP': FP, 'FN': FN, 'TN': TN, 'precision': precision, 'recall': recall, 'predicted_positives': TP + FP, 'actual_positives': TP + FN, } # Example: Disease screening# Positive = Disease Present, Negative = Healthyy_true = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, # 10 actual patients with disease 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, # 10 healthy individuals 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] # 10 more healthy individuals y_pred = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, # Found 7, missed 3 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, # 18 correct, 2 false alarms 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] result = precision_recall_from_confusion_matrix(y_true, y_pred) print("Disease Screening Example")print("=" * 50)print(f"Confusion Matrix:")print(f" Predicted")print(f" Neg Pos")print(f"Actual Neg {result['TN']:3d} {result['FP']:3d}")print(f"Actual Pos {result['FN']:3d} {result['TP']:3d}") print(f"Precision: {result['precision']:.2%}")print(f" Interpretation: {result['precision']:.0%} of patients we flagged as sick actually have the disease")print(f" Calculation: {result['TP']} / ({result['TP']} + {result['FP']}) = {result['TP']} / {result['predicted_positives']}") print(f"Recall: {result['recall']:.2%}")print(f" Interpretation: We found {result['recall']:.0%} of all patients who actually have the disease")print(f" Calculation: {result['TP']} / ({result['TP']} + {result['FN']}) = {result['TP']} / {result['actual_positives']}")To build deep intuition, consider several helpful mental models:
The Search Engine Analogy:
Imagine a search engine returning results for a query:
Precision = What fraction of returned results are actually relevant?
Recall = What fraction of all relevant documents did the engine return?
The Fishing Net Analogy:
A fine-mesh net catches more fish (high recall) but also more debris (low precision). A coarse-mesh net keeps out debris (high precision) but lets fish escape (low recall).
Geometric Interpretation:
Visualize precision and recall as proportions of different sets:
All Predictions
┌─────────────────────────────────────┐
│ │
│ Predicted Negative │
│ (TN + FN) │
┌───────────────────┼─────────────────────────────────────┤
│ │ │
│ Actual │ │
│ Positive │ True Positives │
│ (TP + FN) │ (TP) │
│ │ │ Predicted Positive
│ ─────────►│◄─────── Recall = TP/(TP+FN) ───────│ (TP + FP)
│ │ │
└───────────────────┼─────────────────────────────────────┤
│ │
│ False Positives │
│ (FP) │
│ │
└─────────────────────────────────────┘
▲
│
Precision = TP/(TP+FP)
Precision and recall exist in fundamental tension—optimizing for one typically degrades the other. This is the precision-recall trade-off, arguably the most important concept in classification evaluation.
Why the Trade-off Exists:
Most classifiers produce a continuous score or probability $P(y=1|x)$ that is then thresholded to produce a binary prediction:
$$\hat{y} = \begin{cases} 1 & \text{if } P(y=1|x) \geq \tau \ 0 & \text{otherwise} \end{cases}$$
The threshold $\tau$ controls the trade-off:
High threshold (e.g., $\tau = 0.9$): Predict positive only when very confident
Low threshold (e.g., $\tau = 0.1$): Predict positive even with modest confidence
The Extreme Cases:
$\tau \to 1$: Predict almost nothing as positive
$\tau \to 0$: Predict almost everything as positive
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import precision_recall_curve def demonstrate_threshold_tradeoff(): """ Demonstrate how changing the decision threshold affects precision and recall. """ np.random.seed(42) # Generate synthetic probability scores # Actual positives have higher scores on average n_pos, n_neg = 100, 400 # Positive class: centered around 0.7 scores_pos = np.clip(np.random.normal(0.7, 0.2, n_pos), 0, 1) # Negative class: centered around 0.3 scores_neg = np.clip(np.random.normal(0.3, 0.2, n_neg), 0, 1) y_true = np.array([1]*n_pos + [0]*n_neg) y_scores = np.concatenate([scores_pos, scores_neg]) # Calculate precision and recall at various thresholds thresholds = np.arange(0.1, 0.95, 0.05) print("Threshold Effects on Precision and Recall") print("=" * 55) print(f"{'Threshold':>10} {'Precision':>12} {'Recall':>10} {'Predicted+':>12}") print("-" * 55) for thresh in thresholds: y_pred = (y_scores >= thresh).astype(int) tp = np.sum((y_pred == 1) & (y_true == 1)) fp = np.sum((y_pred == 1) & (y_true == 0)) fn = np.sum((y_pred == 0) & (y_true == 1)) precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / (tp + fn) if (tp + fn) > 0 else 0 pred_positive = tp + fp print(f"{thresh:>10.2f} {precision:>12.1%} {recall:>10.1%} {pred_positive:>12}") # Visualization precision, recall, thresh = precision_recall_curve(y_true, y_scores) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Precision and Recall vs Threshold axes[0].plot(thresh, precision[:-1], 'b-', linewidth=2, label='Precision') axes[0].plot(thresh, recall[:-1], 'r-', linewidth=2, label='Recall') axes[0].set_xlabel('Decision Threshold', fontsize=12) axes[0].set_ylabel('Score', fontsize=12) axes[0].set_title('Precision-Recall Trade-off vs Threshold', fontsize=14) axes[0].legend() axes[0].grid(True, alpha=0.3) # Right: Precision-Recall Curve axes[1].plot(recall, precision, 'g-', linewidth=2) axes[1].set_xlabel('Recall', fontsize=12) axes[1].set_ylabel('Precision', fontsize=12) axes[1].set_title('Precision-Recall Curve', fontsize=14) axes[1].grid(True, alpha=0.3) axes[1].set_xlim([0, 1]) axes[1].set_ylim([0, 1]) # Add annotations for thresh_mark in [0.3, 0.5, 0.7]: idx = np.argmin(np.abs(thresh - thresh_mark)) axes[1].scatter([recall[idx]], [precision[idx]], s=100, zorder=5) axes[1].annotate(f'τ={thresh_mark}', (recall[idx], precision[idx]), textcoords='offset points', xytext=(10, 5)) plt.tight_layout() plt.savefig('precision_recall_tradeoff.png', dpi=150) plt.show() demonstrate_threshold_tradeoff()You cannot simultaneously maximize both precision and recall for a given model. The trade-off is inherent to the classification problem—improving one requires accepting degradation in the other. The art is in finding the right balance for your specific application.
The choice between emphasizing precision or recall depends critically on the application domain and the relative costs of False Positives vs. False Negatives.
Framework for Decision:
| Domain | Positive Class | Emphasize | Rationale |
|---|---|---|---|
| Medical Screening | Disease Present | Recall | Missing a disease (FN) can be fatal; follow-up tests filter FPs |
| Spam Email Filter | Spam | Precision | Losing important email (FP) is worse than seeing spam (FN) |
| Fraud Detection (Banking) | Fraud | Recall | Missed fraud (FN) causes direct financial loss; FPs inconvenience customers |
| Search Engines | Relevant Document | Balanced | Users want relevant results (precision) but also completeness (recall) |
| Criminal Justice (Conviction) | Guilty | Precision | Convicting innocent (FP) is worse than letting guilty go free (FN) |
| Autonomous Driving (Obstacle) | Obstacle Present | Recall | Missing an obstacle (FN) causes collisions; FPs cause unnecessary stops |
| Product Recommendations | Product Liked | Precision | Irrelevant recommendations (FP) annoy users; missing some items is okay |
| Content Moderation | Harmful Content | Context-dependent | Platforms balance user safety (recall) against over-censorship (precision) |
The interpretation of precision and recall depends entirely on which class is designated 'positive.' Swapping the positive class inverts the metric interpretations. Always be explicit about positive class definition when reporting these metrics.
Precision and recall have boundary conditions that can produce undefined or misleading values:
Undefined Precision (TP + FP = 0):
When the classifier predicts no positives at all:
$$\text{Precision} = \frac{0}{0} = \text{undefined}$$
This occurs with very high thresholds or extremely cautious models. Different libraries handle this differently:
Undefined Recall (TP + FN = 0):
When there are no actual positives in the dataset:
$$\text{Recall} = \frac{TP}{0} = \text{undefined}$$
This indicates a problem with your evaluation dataset—you cannot measure positive class detection without positive examples.
Perfect Precision with Zero Recall:
A classifier that predicts positive only once, and happens to be correct:
Perfect precision is trivially achievable and meaningless without considering recall.
Perfect Recall with Low Precision:
A classifier that predicts positive for everything:
Perfect recall is trivially achievable and meaningless without considering precision.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npfrom sklearn.metrics import precision_score, recall_scoreimport warnings def explore_edge_cases(): """ Demonstrate edge cases in precision and recall calculation. """ print("Edge Cases in Precision and Recall") print("=" * 50) # Case 1: No positive predictions print("Case 1: Model predicts all negative (TP + FP = 0)") y_true_1 = [1, 1, 1, 0, 0, 0] y_pred_1 = [0, 0, 0, 0, 0, 0] with warnings.catch_warnings(record=True) as w: warnings.simplefilter("always") prec = precision_score(y_true_1, y_pred_1, zero_division=0) rec = recall_score(y_true_1, y_pred_1, zero_division=0) print(f" y_true: {y_true_1}") print(f" y_pred: {y_pred_1}") print(f" Precision: {prec} (undefined, set to 0)") print(f" Recall: {rec}") # Case 2: No actual positives in test set print("Case 2: No actual positives in dataset (TP + FN = 0)") y_true_2 = [0, 0, 0, 0, 0, 0] y_pred_2 = [0, 0, 0, 1, 1, 0] prec = precision_score(y_true_2, y_pred_2, zero_division=0) rec = recall_score(y_true_2, y_pred_2, zero_division=0) print(f" y_true: {y_true_2}") print(f" y_pred: {y_pred_2}") print(f" Precision: {prec}") print(f" Recall: {rec} (undefined, set to 0)") # Case 3: Perfect precision, low recall print("Case 3: Conservative model - perfect precision, low recall") y_true_3 = [1]*100 + [0]*900 # 100 positives, 900 negatives y_pred_3 = [1]*5 + [0]*95 + [0]*900 # Only 5 positive predictions (all correct) prec = precision_score(y_true_3, y_pred_3) rec = recall_score(y_true_3, y_pred_3) print(f" 100 actual positives, 5 predicted positives (all correct)") print(f" Precision: {prec:.1%} (perfect!)") print(f" Recall: {rec:.1%} (terrible - missed 95% of positives)") # Case 4: Perfect recall, low precision print("Case 4: Aggressive model - perfect recall, low precision") y_true_4 = [1]*100 + [0]*900 y_pred_4 = [1]*100 + [1]*900 # Predict everything as positive prec = precision_score(y_true_4, y_pred_4) rec = recall_score(y_true_4, y_pred_4) print(f" 100 actual positives, predicted all 1000 as positive") print(f" Precision: {prec:.1%} (equals class proportion)") print(f" Recall: {rec:.1%} (perfect - found all positives)") print("Key Insight: Neither metric is meaningful in isolation!") explore_edge_cases()For multi-class classification ($K > 2$), precision and recall are computed per class using a one-vs-rest (OvR) approach. For class $c$:
$$\text{Precision}c = \frac{TP_c}{TP_c + FP_c} = \frac{C{cc}}{\sum_{i=1}^{K} C_{ic}}$$
$$\text{Recall}c = \frac{TP_c}{TP_c + FN_c} = \frac{C{cc}}{\sum_{j=1}^{K} C_{cj}}$$
where:
Aggregation Methods:
To obtain a single precision/recall value for the entire classifier, we must aggregate per-class values. Three standard approaches exist:
| Method | Formula | Properties |
|---|---|---|
| Macro Average | $\frac{1}{K} \sum_{c=1}^{K} \text{Metric}_c$ | Treats all classes equally; sensitive to rare class performance |
| Weighted Average | $\sum_{c=1}^{K} \frac{n_c}{n} \cdot \text{Metric}_c$ | Weights by class frequency; reflects overall prediction quality |
| Micro Average | $\frac{\sum_c TP_c}{\sum_c (TP_c + FP_c)}$ | Globally aggregates counts; equivalent to accuracy for multi-class |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npfrom sklearn.metrics import precision_score, recall_score, confusion_matrixfrom sklearn.metrics import classification_report def multiclass_precision_recall_analysis(): """ Demonstrate per-class and aggregated precision/recall for multi-class. """ # 4-class classification example with imbalance classes = ['Cat', 'Dog', 'Bird', 'Fish'] y_true = [0]*50 + [1]*30 + [2]*15 + [3]*5 # Imbalanced: 50, 30, 15, 5 y_pred = ([0]*40 + [1]*5 + [2]*3 + [3]*2 + # Cat predictions [1]*25 + [0]*3 + [2]*2 + # Dog predictions [2]*12 + [0]*2 + [1]*1 + # Bird predictions [3]*3 + [0]*1 + [1]*1) # Fish predictions cm = confusion_matrix(y_true, y_pred) print("Multi-Class Precision and Recall Analysis") print("=" * 60) print(f"Confusion Matrix:") print(f" Predicted") print(f" Cat Dog Bird Fish") for i, cls in enumerate(classes): print(f"{cls:>8} {cm[i, 0]:3d} {cm[i, 1]:3d} {cm[i, 2]:3d} {cm[i, 3]:3d}") # Per-class metrics print("Per-Class Metrics:") print("-" * 50) for i, cls in enumerate(classes): tp = cm[i, i] fp = cm[:, i].sum() - tp # Column sum minus diagonal fn = cm[i, :].sum() - tp # Row sum minus diagonal prec = tp / (tp + fp) if (tp + fp) > 0 else 0 rec = tp / (tp + fn) if (tp + fn) > 0 else 0 print(f"{cls}: Precision={prec:.3f}, Recall={rec:.3f}") print(f" TP={tp}, FP={fp}, FN={fn}") # Aggregated metrics print("Aggregated Metrics Comparison:") print("-" * 50) for avg in ['macro', 'weighted', 'micro']: prec = precision_score(y_true, y_pred, average=avg) rec = recall_score(y_true, y_pred, average=avg) print(f"{avg.capitalize():>10} Average: Precision={prec:.3f}, Recall={rec:.3f}") # Full classification report print("Full Classification Report:") print(classification_report(y_true, y_pred, target_names=classes)) multiclass_precision_recall_analysis()Precision and recall are known by many names across different fields. Understanding this terminology prevents confusion when reading literature from medicine, information retrieval, or statistics.
Precision Aliases:
Recall Aliases:
Related Metrics:
| Metric | Formula | Relationship |
|---|---|---|
| Specificity (TNR) | $TN / (TN + FP)$ | Recall for the negative class |
| Negative Predictive Value (NPV) | $TN / (TN + FN)$ | Precision for the negative class |
| False Discovery Rate (FDR) | $FP / (FP + TP) = 1 - Precision$ | Complement of precision |
| False Omission Rate (FOR) | $FN / (FN + TN) = 1 - NPV$ | Complement of NPV |
| Positive Likelihood Ratio (LR+) | $TPR / FPR$ | How much more likely positive result is for true positives |
| Negative Likelihood Ratio (LR-) | $FNR / TNR$ | How likely negative result is for true positives vs negatives |
Precision and recall for the positive class, combined with specificity and NPV (precision/recall for the negative class), provide a complete picture of classifier performance. Together, they account for all four cells of the confusion matrix.
Effective use of precision and recall requires operational considerations beyond their mathematical definitions:
Precision and recall provide a richer view of classifier performance than accuracy by separating two distinct aspects of quality.
What's Next:
Reporting precision and recall as two separate numbers is informative but sometimes inconvenient. The next page introduces the F1 score and F-beta family—metrics that combine precision and recall into a single number while allowing control over their relative importance.
You now understand precision and recall at a deep level—their definitions, intuition, trade-offs, domain-specific application, and practical considerations. These metrics form the foundation for understanding all subsequent classification metrics.