Classification Metrics - Learning Module

Loading content...

0/278

Specificity and Sensitivity

The Medical Perspective on Classification

In medical diagnostics, where classification models have perhaps their most consequential applications, practitioners rarely speak of 'precision' or 'recall.' Instead, they use sensitivity and specificity—a complementary pair of metrics that describe classifier performance from the perspective of each actual class.

While sensitivity is identical to recall (just different terminology), specificity provides crucial information that precision and recall don't capture: how well the classifier identifies true negatives. This perspective is essential in domains where correct identification of the 'normal' or 'healthy' case is as important as identifying the 'positive' case.

This page will thoroughly examine sensitivity and specificity, their relationship to other metrics, their role in ROC analysis, and their practical interpretation in real-world classification systems.

What You Will Learn

By the end of this page, you will understand sensitivity and specificity at a deep level, their complementary relationship, how they connect to ROC curves, their use in diagnostic testing via likelihood ratios, and when to prioritize each metric in practical applications.

Formal Definitions

Sensitivity (True Positive Rate):

Sensitivity measures the proportion of actual positives correctly identified:

$$\text{Sensitivity} = \frac{TP}{TP + FN} = \frac{TP}{P} = P(\hat{y} = 1 | y = 1)$$

Numerator: True Positives (correctly identified positives)
Denominator: All Actual Positives (TP + FN)
Question answered: 'Among people with the condition, how many did we correctly identify?'
Synonym: Recall, True Positive Rate (TPR), Hit Rate

Specificity (True Negative Rate):

Specificity measures the proportion of actual negatives correctly identified:

$$\text{Specificity} = \frac{TN}{TN + FP} = \frac{TN}{N} = P(\hat{y} = 0 | y = 0)$$

Numerator: True Negatives (correctly identified negatives)
Denominator: All Actual Negatives (TN + FP)
Question answered: 'Among healthy people, how many did we correctly clear?'
Synonym: True Negative Rate (TNR), Selectivity

The Symmetry of Sensitivity and Specificity

Sensitivity and specificity are symmetric counterparts: Sensitivity is 'recall for the positive class' while Specificity is 'recall for the negative class.' Together, they describe how well the classifier performs on each of the two ground-truth classes, independent of the predictions made.

Sensitivity vs. Specificity: The Complementary View
Metric	Formula	Complement	Error Penalized
Sensitivity (TPR)	$TP / (TP + FN)$	False Negative Rate = $1 - \text{Sensitivity}$	False Negatives
Specificity (TNR)	$TN / (TN + FP)$	False Positive Rate = $1 - \text{Specificity}$	False Positives

The Conditional Probability Perspective

Sensitivity and specificity are conditional probabilities conditioned on the true class:

Sensitivity = $P(\hat{y} = 1 | y = 1)$: The probability of testing positive, given you truly have the condition
Specificity = $P(\hat{y} = 0 | y = 0)$: The probability of testing negative, given you truly don't have the condition

This conditioning is fundamentally different from precision and NPV:

Precision = $P(y = 1 | \hat{y} = 1)$: The probability of truly having the condition, given you tested positive
NPV = $P(y = 0 | \hat{y} = 0)$: The probability of truly not having the condition, given you tested negative

The Crucial Distinction:

Sensitivity and specificity are intrinsic properties of the test/classifier—they measure what the classifier does given certain types of inputs. They are (theoretically) independent of the class distribution in the population.

Precision and NPV are population-dependent—they depend not only on the classifier but also on the prevalence of the positive class in the population being tested.

sensitivity_specificity_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from sklearn.metrics import confusion_matrix, recall_score
 
def compute_sensitivity_specificity(y_true, y_pred):
    """
    Calculate sensitivity and specificity from predictions.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    # Sensitivity (True Positive Rate)
    sensitivity = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    # Specificity (True Negative Rate)
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
    
    # Complements
    fnr = FN / (TP + FN) if (TP + FN) > 0 else 0  # False Negative Rate
    fpr = FP / (TN + FP) if (TN + FP) > 0 else 0  # False Positive Rate
    
    return {
        'sensitivity': sensitivity,
        'specificity': specificity,
        'fnr': fnr,  # = 1 - sensitivity
        'fpr': fpr,  # = 1 - specificity
        'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN,
    }
 
# Example: Medical diagnostic test
# True: 1 = Disease, 0 = Healthy
y_true = [1]*100 + [0]*900  # 10% prevalence
 
# Predictions: A test with good sensitivity, moderate specificity
# Catches 95% of disease cases, but 10% false alarm rate
y_pred = [1]*95 + [0]*5 + [0]*810 + [1]*90
 
result = compute_sensitivity_specificity(y_true, y_pred)
 
print("Medical Diagnostic Test Performance")
print("=" * 50)
print(f"
Confusion Matrix:")
print(f"              Predicted")
print(f"              Healthy  Disease")
print(f"Actual Healthy   {result['TN']:4d}    {result['FP']:4d}")
print(f"Actual Disease   {result['FN']:4d}    {result['TP']:4d}")
 
print(f"
Sensitivity (TPR): {result['sensitivity']:.1%}")
print(f"  Interpretation: {result['sensitivity']:.0%} of patients WITH disease test positive")
 
print(f"
Specificity (TNR): {result['specificity']:.1%}")
print(f"  Interpretation: {result['specificity']:.0%} of HEALTHY individuals test negative")
 
print(f"
Complementary rates:")
print(f"  False Negative Rate: {result['fnr']:.1%} (= 1 - Sensitivity)")
print(f"  False Positive Rate: {result['fpr']:.1%} (= 1 - Specificity)")

Medical Applications and Interpretation

Medical diagnostics provides the canonical application of sensitivity and specificity. Understanding the clinical interpretation helps build intuition that transfers to all classification problems.

Screening vs. Confirmatory Tests:

Different medical contexts prioritize different metrics:

Screening Tests (used on general population):

Priority: High Sensitivity
Rationale: Miss as few actual cases as possible
Accept: More false positives (follow-up tests will filter)
Example: Cancer screening, HIV testing

Confirmatory Tests (used after positive screen):

Priority: High Specificity
Rationale: Minimize false alarms before treatment
Accept: May miss some cases (screening already caught most)
Example: Biopsy, Western blot for HIV

High Sensitivity Tests

•ELISA for HIV: >99% sensitivity
•Mammography screening: ~85% sensitivity
•Rapid Strep Test: ~90% sensitivity
•TSH for hypothyroidism: ~98% sensitivity
•Application: First-line screening where missing cases is costly

High Specificity Tests

•Western blot for HIV: >99% specificity
•Biopsy for cancer: ~99% specificity
•Throat culture for strep: >99% specificity
•Genetic testing confirmation: >99% specificity
•Application: Confirmation before invasive/expensive treatment

The Two-Stage Approach

In medical practice, screening tests with high sensitivity are often followed by confirmatory tests with high specificity. This cascade approach combines the benefits of both: catching most cases initially (sensitivity) and confirming before treatment (specificity).

ROC Space and the Sensitivity-Specificity Trade-off

ROC space (Receiver Operating Characteristic space) is the 2D plane where classifiers are represented by their (False Positive Rate, True Positive Rate) = (1 - Specificity, Sensitivity) coordinates.

ROC Space Geometry:

X-axis: False Positive Rate (FPR) = 1 - Specificity
Y-axis: True Positive Rate (TPR) = Sensitivity
Perfect classifier: Point (0, 1) — 100% sensitivity, 100% specificity
Random classifier: Diagonal line y = x — equal chance of correct/incorrect
Inverse classifier: Below the diagonal — worse than random

The ROC Curve:

For a probabilistic classifier, sweeping through thresholds traces out a curve in ROC space. Each point on the curve represents a different sensitivity-specificity trade-off:

Low threshold: More positive predictions → High sensitivity, Low specificity (upper right)
High threshold: Fewer positive predictions → Low sensitivity, High specificity (lower left)

The Area Under the ROC Curve (AUC) summarizes overall discriminative ability across all thresholds.

roc_space_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
 
def visualize_roc_space():
    """
    Visualize ROC space and the sensitivity-specificity trade-off.
    """
    np.random.seed(42)
    
    # Generate data for three models of different quality
    n_samples = 1000
    y_true = np.array([1]*500 + [0]*500)
    
    # Good model: Clear separation
    scores_good = np.concatenate([
        np.clip(np.random.normal(0.7, 0.15, 500), 0, 1),  # Positives
        np.clip(np.random.normal(0.3, 0.15, 500), 0, 1),  # Negatives
    ])
    
    # Medium model: Moderate separation
    scores_medium = np.concatenate([
        np.clip(np.random.normal(0.6, 0.2, 500), 0, 1),
        np.clip(np.random.normal(0.4, 0.2, 500), 0, 1),
    ])
    
    # Poor model: Little separation
    scores_poor = np.concatenate([
        np.clip(np.random.normal(0.55, 0.25, 500), 0, 1),
        np.clip(np.random.normal(0.45, 0.25, 500), 0, 1),
    ])
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Left: ROC curves
    for scores, label, color in [(scores_good, 'Good Model', 'green'),
                                   (scores_medium, 'Medium Model', 'orange'),
                                   (scores_poor, 'Poor Model', 'red')]:
        fpr, tpr, thresholds = roc_curve(y_true, scores)
        auc = roc_auc_score(y_true, scores)
        axes[0].plot(fpr, tpr, color=color, linewidth=2, label=f'{label} (AUC={auc:.3f})')
    
    axes[0].plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)')
    axes[0].scatter([0], [1], color='blue', s=100, zorder=5, label='Perfect Classifier')
    axes[0].set_xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
    axes[0].set_ylabel('True Positive Rate (Sensitivity)', fontsize=12)
    axes[0].set_title('ROC Curves', fontsize=14)
    axes[0].legend(loc='lower right')
    axes[0].grid(True, alpha=0.3)
    
    # Right: Sensitivity vs Specificity at different thresholds
    fpr, tpr, thresholds = roc_curve(y_true, scores_good)
    specificity = 1 - fpr
    
    axes[1].plot(thresholds, tpr[:-1], 'b-', linewidth=2, label='Sensitivity')
    axes[1].plot(thresholds, specificity[:-1], 'r-', linewidth=2, label='Specificity')
    
    # Find balanced point (where they cross)
    diffs = np.abs(tpr[:-1] - specificity[:-1])
    balanced_idx = np.argmin(diffs)
    axes[1].axvline(x=thresholds[balanced_idx], color='gray', linestyle='--', 
                    label=f'Balanced (τ={thresholds[balanced_idx]:.2f})')
    
    axes[1].set_xlabel('Decision Threshold', fontsize=12)
    axes[1].set_ylabel('Rate', fontsize=12)
    axes[1].set_title('Sensitivity-Specificity Trade-off vs Threshold', fontsize=14)
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('roc_space.png', dpi=150)
    plt.show()
    
    # Numerical summary
    print("Sensitivity-Specificity Trade-off Analysis")
    print("=" * 55)
    print(f"
At different thresholds for the Good Model:")
    print(f"{'Threshold':>10} {'Sensitivity':>12} {'Specificity':>12} {'Sum':>10}")
    print("-" * 55)
    
    for idx in [0, len(thresholds)//4, len(thresholds)//2, 3*len(thresholds)//4, -2]:
        if idx < len(thresholds):
            print(f"{thresholds[idx]:>10.2f} {tpr[idx]:>12.3f} {specificity[idx]:>12.3f} {tpr[idx]+specificity[idx]:>10.3f}")
 
visualize_roc_space()

Youden's J Statistic

Youden's J = Sensitivity + Specificity - 1 = TPR - FPR measures how far the classifier is from the random diagonal. Maximizing J often provides a good operating point when sensitivity and specificity are equally important. The optimal threshold is where J is maximized.

Likelihood Ratios: Diagnostic Utility

Likelihood ratios combine sensitivity and specificity into measures of diagnostic utility that indicate how much a test result changes the odds of disease.

Positive Likelihood Ratio (LR+):

$$LR^+ = \frac{\text{Sensitivity}}{1 - \text{Specificity}} = \frac{TPR}{FPR}$$

Interpretation: How many times more likely is a positive result in someone with the disease compared to someone without?

LR+ > 10: Strong evidence for disease
LR+ 5-10: Moderate evidence
LR+ 2-5: Weak evidence
LR+ = 1: No diagnostic value

Negative Likelihood Ratio (LR-):

$$LR^- = \frac{1 - \text{Sensitivity}}{\text{Specificity}} = \frac{FNR}{TNR}$$

Interpretation: How many times more likely is a negative result in someone with the disease compared to someone without?

LR- < 0.1: Strong evidence against disease (rules out)
LR- 0.1-0.2: Moderate evidence
LR- 0.2-0.5: Weak evidence
LR- = 1: No diagnostic value

likelihood_ratios.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
 
def compute_likelihood_ratios(sensitivity, specificity):
    """
    Calculate likelihood ratios from sensitivity and specificity.
    """
    # Positive Likelihood Ratio
    lr_pos = sensitivity / (1 - specificity) if specificity < 1 else float('inf')
    
    # Negative Likelihood Ratio
    lr_neg = (1 - sensitivity) / specificity if specificity > 0 else float('inf')
    
    return lr_pos, lr_neg
 
def interpret_lr(lr_pos, lr_neg):
    """
    Provide clinical interpretation of likelihood ratios.
    """
    # LR+ interpretation
    if lr_pos > 10:
        pos_interp = "Large, conclusive increase in disease probability"
    elif lr_pos > 5:
        pos_interp = "Moderate increase in disease probability"
    elif lr_pos > 2:
        pos_interp = "Small increase in disease probability"
    else:
        pos_interp = "Minimal/no diagnostic value"
    
    # LR- interpretation
    if lr_neg < 0.1:
        neg_interp = "Large, conclusive decrease in disease probability"
    elif lr_neg < 0.2:
        neg_interp = "Moderate decrease in disease probability"
    elif lr_neg < 0.5:
        neg_interp = "Small decrease in disease probability"
    else:
        neg_interp = "Minimal/no diagnostic value"
    
    return pos_interp, neg_interp
 
def bayes_post_test_probability(pre_test_prob, lr):
    """
    Calculate post-test probability using likelihood ratio.
    
    Uses: Post-test odds = Pre-test odds × LR
    """
    pre_test_odds = pre_test_prob / (1 - pre_test_prob)
    post_test_odds = pre_test_odds * lr
    post_test_prob = post_test_odds / (1 + post_test_odds)
    return post_test_prob
 
# Example: HIV screening test
print("Likelihood Ratio Analysis: HIV Screening Test")
print("=" * 60)
 
# Typical HIV screening: 99.9% sensitivity, 99.5% specificity
sensitivity = 0.999
specificity = 0.995
 
lr_pos, lr_neg = compute_likelihood_ratios(sensitivity, specificity)
pos_interp, neg_interp = interpret_lr(lr_pos, lr_neg)
 
print(f"
Test characteristics:")
print(f"  Sensitivity: {sensitivity:.1%}")
print(f"  Specificity: {specificity:.1%}")
print(f"
Likelihood Ratios:")
print(f"  LR+: {lr_pos:.1f} - {pos_interp}")
print(f"  LR-: {lr_neg:.4f} - {neg_interp}")
 
# Demonstrate Bayes' theorem application
print("
Impact on Disease Probability:")
print("-" * 60)
 
pre_test_probs = [0.001, 0.01, 0.10, 0.50]  # Various pre-test probabilities
print(f"{'Pre-test prob':>14} | {'Post-test (+)':>14} | {'Post-test (-)':>14}")
print("-" * 60)
 
for pre_prob in pre_test_probs:
    post_prob_positive = bayes_post_test_probability(pre_prob, lr_pos)
    post_prob_negative = bayes_post_test_probability(pre_prob, lr_neg)
    print(f"{pre_prob:>14.1%} | {post_prob_positive:>14.1%} | {post_prob_negative:>14.4%}")
 
print("
Key Insight: In low-prevalence settings, even excellent tests")
print("produce many false positives. At 0.1% prevalence, a positive test")
print(f"only raises probability to ~17% (PPV), despite LR+ of {lr_pos:.0f}!")

The Base Rate Fallacy

Even tests with very high sensitivity and specificity can have low positive predictive value when the condition is rare. This is why likelihood ratios are more useful than sensitivity/specificity alone—they show how the test changes probability rather than just reporting rates.

Relationship to Other Metrics

Sensitivity and specificity exist in a web of related metrics. Understanding these relationships provides a unified view of classification evaluation.

The Complete Metric Family:

From the 2×2 confusion matrix, we can derive eight fundamental rates by dividing each cell by various marginals:

Complete Family of Confusion Matrix Rates
Denominator	TP-based Rate	TN-based Rate	FP-based Rate	FN-based Rate
Actual Positive (P)	Sensitivity (TPR)	—	—	False Negative Rate (FNR)
Actual Negative (N)	—	Specificity (TNR)	False Positive Rate (FPR)	—
Predicted Positive (PP)	Precision (PPV)	—	False Discovery Rate (FDR)	—
Predicted Negative (PN)	—	Negative Predictive Value (NPV)	—	False Omission Rate (FOR)

Key Relationships:

Sensitivity = Recall = TPR: Same metric, different names from different fields
Specificity = TNR: The 'recall' of the negative class
FPR = 1 - Specificity: Complement relationship
FNR = 1 - Sensitivity: Complement relationship
Balanced Accuracy = (Sensitivity + Specificity) / 2: Gives equal weight to both classes
Youden's J = Sensitivity + Specificity - 1: Distance from random diagonal in ROC space
Informedness = Sensitivity + Specificity - 1 = Youden's J: Same as J, probability model name
Diagnostic Odds Ratio = LR+ / LR- = (TP × TN) / (FP × FN): Single number summarizing diagnostic accuracy

metric_relationships.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from sklearn.metrics import confusion_matrix
 
def complete_metrics_from_confusion_matrix(y_true, y_pred):
    """
    Derive all possible metrics from the confusion matrix.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    # Marginal totals
    P = TP + FN  # Actual positives
    N = TN + FP  # Actual negatives
    PP = TP + FP  # Predicted positives
    PN = TN + FN  # Predicted negatives
    n = P + N     # Total
    
    # Rates conditioned on actual class
    sensitivity = TP / P if P > 0 else 0  # TPR
    specificity = TN / N if N > 0 else 0  # TNR
    fnr = FN / P if P > 0 else 0          # False Negative Rate
    fpr = FP / N if N > 0 else 0          # False Positive Rate
    
    # Rates conditioned on predicted class
    precision = TP / PP if PP > 0 else 0  # PPV
    npv = TN / PN if PN > 0 else 0        # Negative Predictive Value
    fdr = FP / PP if PP > 0 else 0        # False Discovery Rate
    for_ = FN / PN if PN > 0 else 0       # False Omission Rate
    
    # Aggregate metrics
    accuracy = (TP + TN) / n
    balanced_accuracy = (sensitivity + specificity) / 2
    youden_j = sensitivity + specificity - 1  # = TPR - FPR
    
    # Likelihood ratios
    lr_pos = sensitivity / fpr if fpr > 0 else float('inf')
    lr_neg = fnr / specificity if specificity > 0 else float('inf')
    
    # Diagnostic odds ratio
    dor = (TP * TN) / (FP * FN) if (FP * FN) > 0 else float('inf')
    
    return {
        # Actual class conditioned
        'Sensitivity (TPR)': sensitivity,
        'Specificity (TNR)': specificity,
        'False Negative Rate': fnr,
        'False Positive Rate': fpr,
        # Predicted class conditioned
        'Precision (PPV)': precision,
        'NPV': npv,
        'FDR': fdr,
        'FOR': for_,
        # Aggregates
        'Accuracy': accuracy,
        'Balanced Accuracy': balanced_accuracy,
        "Youden's J (Informedness)": youden_j,
        'LR+': lr_pos,
        'LR-': lr_neg,
        'DOR': dor,
    }
 
# Example
y_true = [1]*100 + [0]*400
y_pred = [1]*80 + [0]*20 + [0]*360 + [1]*40  # 80% sens, 90% spec
 
print("Complete Metrics Derivation")
print("=" * 55)
 
metrics = complete_metrics_from_confusion_matrix(y_true, y_pred)
 
print("
Actual-Class Conditioned Rates:")
for name in ['Sensitivity (TPR)', 'Specificity (TNR)', 
             'False Negative Rate', 'False Positive Rate']:
    print(f"  {name}: {metrics[name]:.3f}")
 
print("
Predicted-Class Conditioned Rates:")
for name in ['Precision (PPV)', 'NPV', 'FDR', 'FOR']:
    print(f"  {name}: {metrics[name]:.3f}")
 
print("
Aggregate Metrics:")
for name in ['Accuracy', 'Balanced Accuracy', "Youden's J (Informedness)"]:
    print(f"  {name}: {metrics[name]:.3f}")
 
print("
Diagnostic Utility:")
print(f"  LR+: {metrics['LR+']:.1f}")
print(f"  LR-: {metrics['LR-']:.3f}")
print(f"  DOR: {metrics['DOR']:.1f}")

Choosing Between Sensitivity-Specificity and Precision-Recall

Both pairs (sensitivity/specificity and precision/recall) provide valuable but different perspectives. The choice depends on your evaluation context:

Use Sensitivity and Specificity when:

Characterizing test properties: You want to describe intrinsic classifier behavior independent of class distribution
Medical/diagnostic contexts: These are standard metrics in clinical literature
ROC analysis: Sensitivity and (1-Specificity) define ROC space
Stable baselines needed: You want metrics that don't change with class prevalence in the test set
Both classes equally important: You care about correctly identifying both positives and negatives

Use Precision and Recall when:

Information retrieval: Standard metrics for search, recommendation
Class imbalance with rare positives: Precision-recall curves are more informative than ROC for highly imbalanced data
Positive class is primary interest: You care mainly about finding positives, not about correctly rejecting negatives
Practical deployment questions: 'How many of my alerts are real?' (precision), 'How many events am I catching?' (recall)

When to Use Which Metric Pair
Aspect	Sensitivity/Specificity	Precision/Recall
Conditioning	Conditioned on actual class	Mixed conditioning (classes and predictions)
Stability	Stable across different test set compositions	Precision depends on class distribution
Origin	Medical diagnostics, signal detection	Information retrieval, search engines
Best for imbalance	Less informative for rare positives	More informative for rare positives
ROC vs PR	Maps to ROC curves	Maps to Precision-Recall curves
Negative class info	Specificity captures negative class performance	No direct negative class metric

Use Both When Possible

In practice, report both pairs when feasible. Sensitivity = Recall, so you only need to add Specificity and Precision to cover all four perspectives. Together, they provide a complete picture of classifier performance from both the actual-class and predicted-class viewpoints.

Practical Considerations

Effective use of sensitivity and specificity requires attention to practical details:

Best Practices

•Report both metrics together: Sensitivity without specificity (or vice versa) can be gamed trivially.
•Use operational thresholds: Report at the threshold you'll deploy, not just at optimal points.
•Consider Youden's J: When both metrics are equally important, maximizing J = Sensitivity + Specificity - 1 provides a principled threshold choice.
•Confidence intervals matter: Especially for rare classes or small test sets, point estimates may have wide uncertainty.
•Subgroup analysis: Sensitivity and specificity may differ across patient subgroups, age brackets, or other strata.

Common Pitfalls

•Confusing sensitivity with PPV: High sensitivity doesn't mean positive predictions are reliable—that's precision/PPV.
•Ignoring base rates: In rare disease settings, even high sensitivity/specificity yields low PPV.
•Spectrum bias: Test performance in controlled studies may not generalize to real clinical populations.
•Assuming stability: While theoretically stable, sensitivity/specificity can vary with patient characteristics.
•Single threshold fixation: Report across multiple thresholds or use the full ROC curve.

Summary: The Complementary Pair

Sensitivity and specificity provide a symmetric view of classifier performance from the perspective of each actual class, completing our understanding of confusion-matrix-derived metrics.

Key Takeaways

•Sensitivity = Recall — The proportion of actual positives correctly identified (same metric, different name).
•Specificity = TNR — The proportion of actual negatives correctly identified (the 'recall' for negatives).
•Conditioned on actual class — Both metrics describe what happens given ground truth, making them intrinsic to the classifier.
•ROC space uses these metrics — Points are (1-Specificity, Sensitivity) = (FPR, TPR).
•Likelihood ratios combine both — LR+ and LR- measure diagnostic utility and how much a test changes disease probability.
•Screening prioritizes sensitivity — It's better to over-flag and follow up than to miss cases.
•Confirmation prioritizes specificity — Before treatment, ensure the diagnosis is correct.

Module Complete:

You have now completed the Classification Metrics module. You understand the confusion matrix foundation, accuracy's limitations, precision and recall's complementary perspectives, F-scores as combined metrics, and sensitivity/specificity from the medical/diagnostic viewpoint. This comprehensive understanding enables you to select, calculate, and interpret the right metrics for any classification problem.

Module Complete

You now have a complete and deep understanding of classification metrics. You can derive any metric from the confusion matrix, understand the trade-offs each metric makes, and select appropriate metrics for specific application domains. This knowledge is foundational for the next module on ROC and Precision-Recall curves.