Classification Metrics - Learning Module

Loading content...

0/278

Precision and Recall

Two Lenses on Classifier Performance

We established that accuracy conflates two fundamentally different types of errors—False Positives and False Negatives—into a single number, obscuring critical distinctions. Precision and Recall resolve this problem by measuring each error type separately.

These two metrics answer distinct questions about classifier behavior:

Precision: 'When the model predicts positive, how often is it correct?'
Recall: 'Of all the actual positives, how many did the model find?'

This separation is not merely a mathematical convenience—it reflects a fundamental asymmetry in how errors impact real-world systems. Mastering precision and recall, their trade-offs, and their domain-specific interpretation is essential for building classifiers that actually work in production.

What You Will Learn

By the end of this page, you will deeply understand precision and recall's formal definitions, geometric interpretations, the fundamental trade-off between them, how classification thresholds affect both metrics, and how to select the appropriate metric emphasis for your specific application domain.

Formal Definitions

Precision (Positive Predictive Value):

Precision measures the accuracy of positive predictions—what proportion of predicted positives are actually positive:

$$\text{Precision} = \frac{TP}{TP + FP} = \frac{\text{True Positives}}{\text{Total Predicted Positives}}$$

Numerator: Correct positive predictions (True Positives)
Denominator: All positive predictions (True Positives + False Positives)
Focus: The quality/purity of positive predictions
Question answered: 'When I predict positive, how often am I right?'

Recall (Sensitivity, True Positive Rate, Hit Rate):

Recall measures the completeness of positive identification—what proportion of actual positives were correctly identified:

$$\text{Recall} = \frac{TP}{TP + FN} = \frac{\text{True Positives}}{\text{Total Actual Positives}}$$

Numerator: Correct positive predictions (True Positives)
Denominator: All actual positives (True Positives + False Negatives)
Focus: Coverage/completeness of positive detection
Question answered: 'Of all positives that exist, how many did I find?'

The Shared Numerator

Both precision and recall share TP in the numerator—they differ in which type of error appears in the denominator. Precision penalizes False Positives (FP), while Recall penalizes False Negatives (FN). This is why improving one often comes at the cost of the other.

precision_recall_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
from sklearn.metrics import precision_score, recall_score, confusion_matrix
 
def precision_recall_from_confusion_matrix(y_true, y_pred):
    """
    Calculate precision and recall directly from the confusion matrix
    to illustrate their derivation.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    # Precision: Of all predicted positives, how many are correct?
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    
    # Recall: Of all actual positives, how many did we find?
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    return {
        'confusion_matrix': cm,
        'TP': TP, 'FP': FP, 'FN': FN, 'TN': TN,
        'precision': precision,
        'recall': recall,
        'predicted_positives': TP + FP,
        'actual_positives': TP + FN,
    }
 
# Example: Disease screening
# Positive = Disease Present, Negative = Healthy
y_true = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1,  # 10 actual patients with disease
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0,  # 10 healthy individuals
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0]  # 10 more healthy individuals
 
y_pred = [1, 1, 1, 1, 1, 1, 1, 0, 0, 0,  # Found 7, missed 3
          0, 0, 0, 0, 0, 0, 0, 0, 1, 1,  # 18 correct, 2 false alarms
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
 
result = precision_recall_from_confusion_matrix(y_true, y_pred)
 
print("Disease Screening Example")
print("=" * 50)
print(f"
Confusion Matrix:")
print(f"              Predicted")
print(f"              Neg   Pos")
print(f"Actual Neg    {result['TN']:3d}   {result['FP']:3d}")
print(f"Actual Pos    {result['FN']:3d}   {result['TP']:3d}")
 
print(f"
Precision: {result['precision']:.2%}")
print(f"  Interpretation: {result['precision']:.0%} of patients we flagged as sick actually have the disease")
print(f"  Calculation: {result['TP']} / ({result['TP']} + {result['FP']}) = {result['TP']} / {result['predicted_positives']}")
 
print(f"
Recall: {result['recall']:.2%}")
print(f"  Interpretation: We found {result['recall']:.0%} of all patients who actually have the disease")
print(f"  Calculation: {result['TP']} / ({result['TP']} + {result['FN']}) = {result['TP']} / {result['actual_positives']}")

Intuitive Understanding

To build deep intuition, consider several helpful mental models:

The Search Engine Analogy:

Imagine a search engine returning results for a query:

Precision = What fraction of returned results are actually relevant?
- High precision: Most results I see are useful
- Low precision: I'm flooded with irrelevant results (False Positives)
Recall = What fraction of all relevant documents did the engine return?
- High recall: The engine found most of what's out there
- Low recall: Many relevant documents are missing (False Negatives)

The Fishing Net Analogy:

Precision = What fraction of what's in my net is actually fish (vs. debris)?
Recall = What fraction of all fish in the lake did I catch?

A fine-mesh net catches more fish (high recall) but also more debris (low precision). A coarse-mesh net keeps out debris (high precision) but lets fish escape (low recall).

High Precision Systems

•Spam filter (conservative): Rarely marks legitimate mail as spam
•Fraud alerts: Each alert is likely a true fraud
•Recommended products: Shown items are relevant
•Diagnostic confirmation: Positive diagnosis is reliable
•Cost: May miss some positives (lower recall)

High Recall Systems

•Disease screening: Catches nearly all cases
•Security threat detection: Rarely misses actual threats
•Search engines (comprehensive): Finds most relevant pages
•Background checks: Flags all concerning candidates
•Cost: May produce false alarms (lower precision)

Geometric Interpretation:

Visualize precision and recall as proportions of different sets:

                    All Predictions
                    ┌─────────────────────────────────────┐
                    │                                     │
                    │     Predicted Negative              │
                    │     (TN + FN)                       │
┌───────────────────┼─────────────────────────────────────┤
│                   │                                     │
│  Actual           │                                     │
│  Positive         │     True Positives                  │
│  (TP + FN)        │     (TP)                            │
│                   │                                     │    Predicted Positive
│         ─────────►│◄─────── Recall = TP/(TP+FN) ───────│    (TP + FP)
│                   │                                     │
└───────────────────┼─────────────────────────────────────┤
                    │                                     │
                    │     False Positives                 │
                    │     (FP)                            │
                    │                                     │
                    └─────────────────────────────────────┘
                              ▲
                              │
                    Precision = TP/(TP+FP)

The Precision-Recall Trade-off

Precision and recall exist in fundamental tension—optimizing for one typically degrades the other. This is the precision-recall trade-off, arguably the most important concept in classification evaluation.

Why the Trade-off Exists:

Most classifiers produce a continuous score or probability $P(y=1|x)$ that is then thresholded to produce a binary prediction:

$$\hat{y} = \begin{cases} 1 & \text{if } P(y=1|x) \geq \tau \ 0 & \text{otherwise} \end{cases}$$

The threshold $\tau$ controls the trade-off:

High threshold (e.g., $\tau = 0.9$): Predict positive only when very confident
- Fewer False Positives → Higher Precision
- More False Negatives → Lower Recall
Low threshold (e.g., $\tau = 0.1$): Predict positive even with modest confidence
- Fewer False Negatives → Higher Recall
- More False Positives → Lower Precision

The Extreme Cases:

$\tau \to 1$: Predict almost nothing as positive
- Precision → 1 (the few positives predicted are correct)
- Recall → 0 (we miss almost everything)
$\tau \to 0$: Predict almost everything as positive
- Recall → 1 (we find all actual positives)
- Precision → $P/n$ (precision equals the positive class proportion)

precision_recall_tradeoff.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
 
def demonstrate_threshold_tradeoff():
    """
    Demonstrate how changing the decision threshold affects precision and recall.
    """
    np.random.seed(42)
    
    # Generate synthetic probability scores
    # Actual positives have higher scores on average
    n_pos, n_neg = 100, 400
    
    # Positive class: centered around 0.7
    scores_pos = np.clip(np.random.normal(0.7, 0.2, n_pos), 0, 1)
    # Negative class: centered around 0.3
    scores_neg = np.clip(np.random.normal(0.3, 0.2, n_neg), 0, 1)
    
    y_true = np.array([1]*n_pos + [0]*n_neg)
    y_scores = np.concatenate([scores_pos, scores_neg])
    
    # Calculate precision and recall at various thresholds
    thresholds = np.arange(0.1, 0.95, 0.05)
    
    print("Threshold Effects on Precision and Recall")
    print("=" * 55)
    print(f"{'Threshold':>10} {'Precision':>12} {'Recall':>10} {'Predicted+':>12}")
    print("-" * 55)
    
    for thresh in thresholds:
        y_pred = (y_scores >= thresh).astype(int)
        
        tp = np.sum((y_pred == 1) & (y_true == 1))
        fp = np.sum((y_pred == 1) & (y_true == 0))
        fn = np.sum((y_pred == 0) & (y_true == 1))
        
        precision = tp / (tp + fp) if (tp + fp) > 0 else 0
        recall = tp / (tp + fn) if (tp + fn) > 0 else 0
        pred_positive = tp + fp
        
        print(f"{thresh:>10.2f} {precision:>12.1%} {recall:>10.1%} {pred_positive:>12}")
    
    # Visualization
    precision, recall, thresh = precision_recall_curve(y_true, y_scores)
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: Precision and Recall vs Threshold
    axes[0].plot(thresh, precision[:-1], 'b-', linewidth=2, label='Precision')
    axes[0].plot(thresh, recall[:-1], 'r-', linewidth=2, label='Recall')
    axes[0].set_xlabel('Decision Threshold', fontsize=12)
    axes[0].set_ylabel('Score', fontsize=12)
    axes[0].set_title('Precision-Recall Trade-off vs Threshold', fontsize=14)
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Right: Precision-Recall Curve
    axes[1].plot(recall, precision, 'g-', linewidth=2)
    axes[1].set_xlabel('Recall', fontsize=12)
    axes[1].set_ylabel('Precision', fontsize=12)
    axes[1].set_title('Precision-Recall Curve', fontsize=14)
    axes[1].grid(True, alpha=0.3)
    axes[1].set_xlim([0, 1])
    axes[1].set_ylim([0, 1])
    
    # Add annotations
    for thresh_mark in [0.3, 0.5, 0.7]:
        idx = np.argmin(np.abs(thresh - thresh_mark))
        axes[1].scatter([recall[idx]], [precision[idx]], s=100, zorder=5)
        axes[1].annotate(f'τ={thresh_mark}', (recall[idx], precision[idx]),
                        textcoords='offset points', xytext=(10, 5))
    
    plt.tight_layout()
    plt.savefig('precision_recall_tradeoff.png', dpi=150)
    plt.show()
 
demonstrate_threshold_tradeoff()

There Is No Free Lunch

You cannot simultaneously maximize both precision and recall for a given model. The trade-off is inherent to the classification problem—improving one requires accepting degradation in the other. The art is in finding the right balance for your specific application.

Domain-Specific Metric Emphasis

The choice between emphasizing precision or recall depends critically on the application domain and the relative costs of False Positives vs. False Negatives.

Framework for Decision:

Identify the positive class (the class of interest)
Enumerate the consequences of each error type:
- False Positive (Type I): Incorrectly predict positive
- False Negative (Type II): Incorrectly predict negative
Assess which error is more costly in your context
Emphasize the metric that penalizes the costlier error

Precision vs. Recall Emphasis by Domain
Domain	Positive Class	Emphasize	Rationale
Medical Screening	Disease Present	Recall	Missing a disease (FN) can be fatal; follow-up tests filter FPs
Spam Email Filter	Spam	Precision	Losing important email (FP) is worse than seeing spam (FN)
Fraud Detection (Banking)	Fraud	Recall	Missed fraud (FN) causes direct financial loss; FPs inconvenience customers
Search Engines	Relevant Document	Balanced	Users want relevant results (precision) but also completeness (recall)
Criminal Justice (Conviction)	Guilty	Precision	Convicting innocent (FP) is worse than letting guilty go free (FN)
Autonomous Driving (Obstacle)	Obstacle Present	Recall	Missing an obstacle (FN) causes collisions; FPs cause unnecessary stops
Product Recommendations	Product Liked	Precision	Irrelevant recommendations (FP) annoy users; missing some items is okay
Content Moderation	Harmful Content	Context-dependent	Platforms balance user safety (recall) against over-censorship (precision)

The Positive Class Matters

The interpretation of precision and recall depends entirely on which class is designated 'positive.' Swapping the positive class inverts the metric interpretations. Always be explicit about positive class definition when reporting these metrics.

Edge Cases and Pathologies

Precision and recall have boundary conditions that can produce undefined or misleading values:

Undefined Precision (TP + FP = 0):

When the classifier predicts no positives at all:

$$\text{Precision} = \frac{0}{0} = \text{undefined}$$

This occurs with very high thresholds or extremely cautious models. Different libraries handle this differently:

Some return 0 (treating undefined as worst case)
Some return 1 (treating 'no mistakes' as perfect)
Some return NaN and issue a warning

Undefined Recall (TP + FN = 0):

When there are no actual positives in the dataset:

$$\text{Recall} = \frac{TP}{0} = \text{undefined}$$

This indicates a problem with your evaluation dataset—you cannot measure positive class detection without positive examples.

Perfect Precision with Zero Recall:

A classifier that predicts positive only once, and happens to be correct:

Precision = 1/1 = 100%
Recall = 1/P (could be very low if P is large)

Perfect precision is trivially achievable and meaningless without considering recall.

Perfect Recall with Low Precision:

A classifier that predicts positive for everything:

Recall = P/P = 100%
Precision = P/n (equals the positive class proportion)

Perfect recall is trivially achievable and meaningless without considering precision.

precision_recall_edge_cases.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from sklearn.metrics import precision_score, recall_score
import warnings
 
def explore_edge_cases():
    """
    Demonstrate edge cases in precision and recall calculation.
    """
    print("Edge Cases in Precision and Recall")
    print("=" * 50)
    
    # Case 1: No positive predictions
    print("
Case 1: Model predicts all negative (TP + FP = 0)")
    y_true_1 = [1, 1, 1, 0, 0, 0]
    y_pred_1 = [0, 0, 0, 0, 0, 0]
    
    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always")
        prec = precision_score(y_true_1, y_pred_1, zero_division=0)
        rec = recall_score(y_true_1, y_pred_1, zero_division=0)
        print(f"  y_true: {y_true_1}")
        print(f"  y_pred: {y_pred_1}")
        print(f"  Precision: {prec} (undefined, set to 0)")
        print(f"  Recall: {rec}")
    
    # Case 2: No actual positives in test set
    print("
Case 2: No actual positives in dataset (TP + FN = 0)")
    y_true_2 = [0, 0, 0, 0, 0, 0]
    y_pred_2 = [0, 0, 0, 1, 1, 0]
    
    prec = precision_score(y_true_2, y_pred_2, zero_division=0)
    rec = recall_score(y_true_2, y_pred_2, zero_division=0)
    print(f"  y_true: {y_true_2}")
    print(f"  y_pred: {y_pred_2}")
    print(f"  Precision: {prec}")
    print(f"  Recall: {rec} (undefined, set to 0)")
    
    # Case 3: Perfect precision, low recall
    print("
Case 3: Conservative model - perfect precision, low recall")
    y_true_3 = [1]*100 + [0]*900  # 100 positives, 900 negatives
    y_pred_3 = [1]*5 + [0]*95 + [0]*900  # Only 5 positive predictions (all correct)
    
    prec = precision_score(y_true_3, y_pred_3)
    rec = recall_score(y_true_3, y_pred_3)
    print(f"  100 actual positives, 5 predicted positives (all correct)")
    print(f"  Precision: {prec:.1%} (perfect!)")
    print(f"  Recall: {rec:.1%} (terrible - missed 95% of positives)")
    
    # Case 4: Perfect recall, low precision
    print("
Case 4: Aggressive model - perfect recall, low precision")
    y_true_4 = [1]*100 + [0]*900
    y_pred_4 = [1]*100 + [1]*900  # Predict everything as positive
    
    prec = precision_score(y_true_4, y_pred_4)
    rec = recall_score(y_true_4, y_pred_4)
    print(f"  100 actual positives, predicted all 1000 as positive")
    print(f"  Precision: {prec:.1%} (equals class proportion)")
    print(f"  Recall: {rec:.1%} (perfect - found all positives)")
    
    print("
Key Insight: Neither metric is meaningful in isolation!")
 
explore_edge_cases()

Multi-Class Precision and Recall

For multi-class classification ($K > 2$), precision and recall are computed per class using a one-vs-rest (OvR) approach. For class $c$:

$$\text{Precision}c = \frac{TP_c}{TP_c + FP_c} = \frac{C{cc}}{\sum_{i=1}^{K} C_{ic}}$$

$$\text{Recall}c = \frac{TP_c}{TP_c + FN_c} = \frac{C{cc}}{\sum_{j=1}^{K} C_{cj}}$$

where:

$TP_c = C_{cc}$: Instances of class $c$ correctly classified
$FP_c = \sum_{i eq c} C_{ic}$: Instances of other classes incorrectly classified as $c$ (column sum minus diagonal)
$FN_c = \sum_{j eq c} C_{cj}$: Instances of class $c$ incorrectly classified as other classes (row sum minus diagonal)

Aggregation Methods:

To obtain a single precision/recall value for the entire classifier, we must aggregate per-class values. Three standard approaches exist:

Aggregation Methods for Multi-Class Precision/Recall
Method	Formula	Properties
Macro Average	$\frac{1}{K} \sum_{c=1}^{K} \text{Metric}_c$	Treats all classes equally; sensitive to rare class performance
Weighted Average	$\sum_{c=1}^{K} \frac{n_c}{n} \cdot \text{Metric}_c$	Weights by class frequency; reflects overall prediction quality
Micro Average	$\frac{\sum_c TP_c}{\sum_c (TP_c + FP_c)}$	Globally aggregates counts; equivalent to accuracy for multi-class

multiclass_precision_recall.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
from sklearn.metrics import precision_score, recall_score, confusion_matrix
from sklearn.metrics import classification_report
 
def multiclass_precision_recall_analysis():
    """
    Demonstrate per-class and aggregated precision/recall for multi-class.
    """
    # 4-class classification example with imbalance
    classes = ['Cat', 'Dog', 'Bird', 'Fish']
    y_true = [0]*50 + [1]*30 + [2]*15 + [3]*5  # Imbalanced: 50, 30, 15, 5
    y_pred = ([0]*40 + [1]*5 + [2]*3 + [3]*2 +   # Cat predictions
              [1]*25 + [0]*3 + [2]*2 +            # Dog predictions
              [2]*12 + [0]*2 + [1]*1 +            # Bird predictions
              [3]*3 + [0]*1 + [1]*1)              # Fish predictions
    
    cm = confusion_matrix(y_true, y_pred)
    
    print("Multi-Class Precision and Recall Analysis")
    print("=" * 60)
    print(f"
Confusion Matrix:")
    print(f"           Predicted")
    print(f"           Cat  Dog  Bird Fish")
    for i, cls in enumerate(classes):
        print(f"{cls:>8}  {cm[i, 0]:3d}  {cm[i, 1]:3d}  {cm[i, 2]:3d}  {cm[i, 3]:3d}")
    
    # Per-class metrics
    print("
Per-Class Metrics:")
    print("-" * 50)
    for i, cls in enumerate(classes):
        tp = cm[i, i]
        fp = cm[:, i].sum() - tp  # Column sum minus diagonal
        fn = cm[i, :].sum() - tp  # Row sum minus diagonal
        
        prec = tp / (tp + fp) if (tp + fp) > 0 else 0
        rec = tp / (tp + fn) if (tp + fn) > 0 else 0
        
        print(f"{cls}: Precision={prec:.3f}, Recall={rec:.3f}")
        print(f"       TP={tp}, FP={fp}, FN={fn}")
    
    # Aggregated metrics
    print("
Aggregated Metrics Comparison:")
    print("-" * 50)
    
    for avg in ['macro', 'weighted', 'micro']:
        prec = precision_score(y_true, y_pred, average=avg)
        rec = recall_score(y_true, y_pred, average=avg)
        print(f"{avg.capitalize():>10} Average: Precision={prec:.3f}, Recall={rec:.3f}")
    
    # Full classification report
    print("
Full Classification Report:")
    print(classification_report(y_true, y_pred, target_names=classes))
 
multiclass_precision_recall_analysis()

Alternative Terminology and Related Metrics

Precision and recall are known by many names across different fields. Understanding this terminology prevents confusion when reading literature from medicine, information retrieval, or statistics.

Precision Aliases:

Positive Predictive Value (PPV) - Medical/epidemiological literature
Confidence - Data mining (association rules)
Post-test probability - Medical screening

Recall Aliases:

Sensitivity - Medical literature, diagnostic testing
True Positive Rate (TPR) - Signal detection theory, ROC analysis
Hit Rate - Psychology, signal detection
Power (1 - β) - Statistical hypothesis testing (probability of detecting true effect)

Related Metrics:

Related Metrics in the Precision-Recall Family
Metric	Formula	Relationship
Specificity (TNR)	$TN / (TN + FP)$	Recall for the negative class
Negative Predictive Value (NPV)	$TN / (TN + FN)$	Precision for the negative class
False Discovery Rate (FDR)	$FP / (FP + TP) = 1 - Precision$	Complement of precision
False Omission Rate (FOR)	$FN / (FN + TN) = 1 - NPV$	Complement of NPV
Positive Likelihood Ratio (LR+)	$TPR / FPR$	How much more likely positive result is for true positives
Negative Likelihood Ratio (LR-)	$FNR / TNR$	How likely negative result is for true positives vs negatives

The Complete Picture

Precision and recall for the positive class, combined with specificity and NPV (precision/recall for the negative class), provide a complete picture of classifier performance. Together, they account for all four cells of the confusion matrix.

Precision and Recall in Practice

Effective use of precision and recall requires operational considerations beyond their mathematical definitions:

Best Practices

•Always report both metrics — Precision without recall (or vice versa) can be gamed trivially and is meaningless.
•Specify the positive class definition — Make explicit which class is 'positive' to avoid misinterpretation.
•Include confidence intervals — Point estimates without uncertainty can be misleading, especially for small test sets.
•Compare against baselines — Know the precision/recall of trivial baselines (all positive, all negative, random).
•Consider operating requirements — Some applications have minimum precision or recall requirements that constrain model selection.
•Track across model iterations — Monitor how changes affect both metrics, not just one.

Common Pitfalls

•Optimizing for one metric only — A model with 99.9% precision but 1% recall is rarely useful.
•Ignoring the trade-off — Claiming improvement in precision without acknowledging recall impact.
•Fixed threshold assumption — Reporting precision/recall at a single threshold without exploring alternatives.
•Class distribution drift — Test set class proportions different from production affect precision (but not recall).
•Averaging without context — Micro, macro, and weighted averages tell different stories; know which you need.

Summary: The Essential Duality

Precision and recall provide a richer view of classifier performance than accuracy by separating two distinct aspects of quality.

Key Takeaways

•Precision measures prediction quality — What fraction of positive predictions are correct?
•Recall measures detection completeness — What fraction of actual positives do we find?
•The trade-off is fundamental — Improving one typically degrades the other; the threshold controls the balance.
•Domain determines emphasis — Application-specific costs dictate which metric matters more.
•Never report in isolation — Both metrics are needed to avoid trivial gaming (predict all/none positive).
•Multi-class requires aggregation — Macro, weighted, and micro averages offer different perspectives.
•Edge cases produce undefined values — Handle zero denominators explicitly and consistently.

What's Next:

Reporting precision and recall as two separate numbers is informative but sometimes inconvenient. The next page introduces the F1 score and F-beta family—metrics that combine precision and recall into a single number while allowing control over their relative importance.

Page Complete

You now understand precision and recall at a deep level—their definitions, intuition, trade-offs, domain-specific application, and practical considerations. These metrics form the foundation for understanding all subsequent classification metrics.