Loading content...
In medical diagnostics, where classification models have perhaps their most consequential applications, practitioners rarely speak of 'precision' or 'recall.' Instead, they use sensitivity and specificity—a complementary pair of metrics that describe classifier performance from the perspective of each actual class.
While sensitivity is identical to recall (just different terminology), specificity provides crucial information that precision and recall don't capture: how well the classifier identifies true negatives. This perspective is essential in domains where correct identification of the 'normal' or 'healthy' case is as important as identifying the 'positive' case.
This page will thoroughly examine sensitivity and specificity, their relationship to other metrics, their role in ROC analysis, and their practical interpretation in real-world classification systems.
By the end of this page, you will understand sensitivity and specificity at a deep level, their complementary relationship, how they connect to ROC curves, their use in diagnostic testing via likelihood ratios, and when to prioritize each metric in practical applications.
Sensitivity (True Positive Rate):
Sensitivity measures the proportion of actual positives correctly identified:
$$\text{Sensitivity} = \frac{TP}{TP + FN} = \frac{TP}{P} = P(\hat{y} = 1 | y = 1)$$
Specificity (True Negative Rate):
Specificity measures the proportion of actual negatives correctly identified:
$$\text{Specificity} = \frac{TN}{TN + FP} = \frac{TN}{N} = P(\hat{y} = 0 | y = 0)$$
Sensitivity and specificity are symmetric counterparts: Sensitivity is 'recall for the positive class' while Specificity is 'recall for the negative class.' Together, they describe how well the classifier performs on each of the two ground-truth classes, independent of the predictions made.
| Metric | Formula | Complement | Error Penalized |
|---|---|---|---|
| Sensitivity (TPR) | $TP / (TP + FN)$ | False Negative Rate = $1 - \text{Sensitivity}$ | False Negatives |
| Specificity (TNR) | $TN / (TN + FP)$ | False Positive Rate = $1 - \text{Specificity}$ | False Positives |
Sensitivity and specificity are conditional probabilities conditioned on the true class:
This conditioning is fundamentally different from precision and NPV:
The Crucial Distinction:
Sensitivity and specificity are intrinsic properties of the test/classifier—they measure what the classifier does given certain types of inputs. They are (theoretically) independent of the class distribution in the population.
Precision and NPV are population-dependent—they depend not only on the classifier but also on the prevalence of the positive class in the population being tested.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npfrom sklearn.metrics import confusion_matrix, recall_score def compute_sensitivity_specificity(y_true, y_pred): """ Calculate sensitivity and specificity from predictions. """ cm = confusion_matrix(y_true, y_pred) TN, FP, FN, TP = cm.ravel() # Sensitivity (True Positive Rate) sensitivity = TP / (TP + FN) if (TP + FN) > 0 else 0 # Specificity (True Negative Rate) specificity = TN / (TN + FP) if (TN + FP) > 0 else 0 # Complements fnr = FN / (TP + FN) if (TP + FN) > 0 else 0 # False Negative Rate fpr = FP / (TN + FP) if (TN + FP) > 0 else 0 # False Positive Rate return { 'sensitivity': sensitivity, 'specificity': specificity, 'fnr': fnr, # = 1 - sensitivity 'fpr': fpr, # = 1 - specificity 'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN, } # Example: Medical diagnostic test# True: 1 = Disease, 0 = Healthyy_true = [1]*100 + [0]*900 # 10% prevalence # Predictions: A test with good sensitivity, moderate specificity# Catches 95% of disease cases, but 10% false alarm ratey_pred = [1]*95 + [0]*5 + [0]*810 + [1]*90 result = compute_sensitivity_specificity(y_true, y_pred) print("Medical Diagnostic Test Performance")print("=" * 50)print(f"Confusion Matrix:")print(f" Predicted")print(f" Healthy Disease")print(f"Actual Healthy {result['TN']:4d} {result['FP']:4d}")print(f"Actual Disease {result['FN']:4d} {result['TP']:4d}") print(f"Sensitivity (TPR): {result['sensitivity']:.1%}")print(f" Interpretation: {result['sensitivity']:.0%} of patients WITH disease test positive") print(f"Specificity (TNR): {result['specificity']:.1%}")print(f" Interpretation: {result['specificity']:.0%} of HEALTHY individuals test negative") print(f"Complementary rates:")print(f" False Negative Rate: {result['fnr']:.1%} (= 1 - Sensitivity)")print(f" False Positive Rate: {result['fpr']:.1%} (= 1 - Specificity)")Medical diagnostics provides the canonical application of sensitivity and specificity. Understanding the clinical interpretation helps build intuition that transfers to all classification problems.
Screening vs. Confirmatory Tests:
Different medical contexts prioritize different metrics:
Screening Tests (used on general population):
Confirmatory Tests (used after positive screen):
In medical practice, screening tests with high sensitivity are often followed by confirmatory tests with high specificity. This cascade approach combines the benefits of both: catching most cases initially (sensitivity) and confirming before treatment (specificity).
ROC space (Receiver Operating Characteristic space) is the 2D plane where classifiers are represented by their (False Positive Rate, True Positive Rate) = (1 - Specificity, Sensitivity) coordinates.
ROC Space Geometry:
The ROC Curve:
For a probabilistic classifier, sweeping through thresholds traces out a curve in ROC space. Each point on the curve represents a different sensitivity-specificity trade-off:
The Area Under the ROC Curve (AUC) summarizes overall discriminative ability across all thresholds.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import roc_curve, roc_auc_score def visualize_roc_space(): """ Visualize ROC space and the sensitivity-specificity trade-off. """ np.random.seed(42) # Generate data for three models of different quality n_samples = 1000 y_true = np.array([1]*500 + [0]*500) # Good model: Clear separation scores_good = np.concatenate([ np.clip(np.random.normal(0.7, 0.15, 500), 0, 1), # Positives np.clip(np.random.normal(0.3, 0.15, 500), 0, 1), # Negatives ]) # Medium model: Moderate separation scores_medium = np.concatenate([ np.clip(np.random.normal(0.6, 0.2, 500), 0, 1), np.clip(np.random.normal(0.4, 0.2, 500), 0, 1), ]) # Poor model: Little separation scores_poor = np.concatenate([ np.clip(np.random.normal(0.55, 0.25, 500), 0, 1), np.clip(np.random.normal(0.45, 0.25, 500), 0, 1), ]) fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: ROC curves for scores, label, color in [(scores_good, 'Good Model', 'green'), (scores_medium, 'Medium Model', 'orange'), (scores_poor, 'Poor Model', 'red')]: fpr, tpr, thresholds = roc_curve(y_true, scores) auc = roc_auc_score(y_true, scores) axes[0].plot(fpr, tpr, color=color, linewidth=2, label=f'{label} (AUC={auc:.3f})') axes[0].plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)') axes[0].scatter([0], [1], color='blue', s=100, zorder=5, label='Perfect Classifier') axes[0].set_xlabel('False Positive Rate (1 - Specificity)', fontsize=12) axes[0].set_ylabel('True Positive Rate (Sensitivity)', fontsize=12) axes[0].set_title('ROC Curves', fontsize=14) axes[0].legend(loc='lower right') axes[0].grid(True, alpha=0.3) # Right: Sensitivity vs Specificity at different thresholds fpr, tpr, thresholds = roc_curve(y_true, scores_good) specificity = 1 - fpr axes[1].plot(thresholds, tpr[:-1], 'b-', linewidth=2, label='Sensitivity') axes[1].plot(thresholds, specificity[:-1], 'r-', linewidth=2, label='Specificity') # Find balanced point (where they cross) diffs = np.abs(tpr[:-1] - specificity[:-1]) balanced_idx = np.argmin(diffs) axes[1].axvline(x=thresholds[balanced_idx], color='gray', linestyle='--', label=f'Balanced (τ={thresholds[balanced_idx]:.2f})') axes[1].set_xlabel('Decision Threshold', fontsize=12) axes[1].set_ylabel('Rate', fontsize=12) axes[1].set_title('Sensitivity-Specificity Trade-off vs Threshold', fontsize=14) axes[1].legend() axes[1].grid(True, alpha=0.3) plt.tight_layout() plt.savefig('roc_space.png', dpi=150) plt.show() # Numerical summary print("Sensitivity-Specificity Trade-off Analysis") print("=" * 55) print(f"At different thresholds for the Good Model:") print(f"{'Threshold':>10} {'Sensitivity':>12} {'Specificity':>12} {'Sum':>10}") print("-" * 55) for idx in [0, len(thresholds)//4, len(thresholds)//2, 3*len(thresholds)//4, -2]: if idx < len(thresholds): print(f"{thresholds[idx]:>10.2f} {tpr[idx]:>12.3f} {specificity[idx]:>12.3f} {tpr[idx]+specificity[idx]:>10.3f}") visualize_roc_space()Youden's J = Sensitivity + Specificity - 1 = TPR - FPR measures how far the classifier is from the random diagonal. Maximizing J often provides a good operating point when sensitivity and specificity are equally important. The optimal threshold is where J is maximized.
Likelihood ratios combine sensitivity and specificity into measures of diagnostic utility that indicate how much a test result changes the odds of disease.
Positive Likelihood Ratio (LR+):
$$LR^+ = \frac{\text{Sensitivity}}{1 - \text{Specificity}} = \frac{TPR}{FPR}$$
Interpretation: How many times more likely is a positive result in someone with the disease compared to someone without?
Negative Likelihood Ratio (LR-):
$$LR^- = \frac{1 - \text{Sensitivity}}{\text{Specificity}} = \frac{FNR}{TNR}$$
Interpretation: How many times more likely is a negative result in someone with the disease compared to someone without?
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as np def compute_likelihood_ratios(sensitivity, specificity): """ Calculate likelihood ratios from sensitivity and specificity. """ # Positive Likelihood Ratio lr_pos = sensitivity / (1 - specificity) if specificity < 1 else float('inf') # Negative Likelihood Ratio lr_neg = (1 - sensitivity) / specificity if specificity > 0 else float('inf') return lr_pos, lr_neg def interpret_lr(lr_pos, lr_neg): """ Provide clinical interpretation of likelihood ratios. """ # LR+ interpretation if lr_pos > 10: pos_interp = "Large, conclusive increase in disease probability" elif lr_pos > 5: pos_interp = "Moderate increase in disease probability" elif lr_pos > 2: pos_interp = "Small increase in disease probability" else: pos_interp = "Minimal/no diagnostic value" # LR- interpretation if lr_neg < 0.1: neg_interp = "Large, conclusive decrease in disease probability" elif lr_neg < 0.2: neg_interp = "Moderate decrease in disease probability" elif lr_neg < 0.5: neg_interp = "Small decrease in disease probability" else: neg_interp = "Minimal/no diagnostic value" return pos_interp, neg_interp def bayes_post_test_probability(pre_test_prob, lr): """ Calculate post-test probability using likelihood ratio. Uses: Post-test odds = Pre-test odds × LR """ pre_test_odds = pre_test_prob / (1 - pre_test_prob) post_test_odds = pre_test_odds * lr post_test_prob = post_test_odds / (1 + post_test_odds) return post_test_prob # Example: HIV screening testprint("Likelihood Ratio Analysis: HIV Screening Test")print("=" * 60) # Typical HIV screening: 99.9% sensitivity, 99.5% specificitysensitivity = 0.999specificity = 0.995 lr_pos, lr_neg = compute_likelihood_ratios(sensitivity, specificity)pos_interp, neg_interp = interpret_lr(lr_pos, lr_neg) print(f"Test characteristics:")print(f" Sensitivity: {sensitivity:.1%}")print(f" Specificity: {specificity:.1%}")print(f"Likelihood Ratios:")print(f" LR+: {lr_pos:.1f} - {pos_interp}")print(f" LR-: {lr_neg:.4f} - {neg_interp}") # Demonstrate Bayes' theorem applicationprint("Impact on Disease Probability:")print("-" * 60) pre_test_probs = [0.001, 0.01, 0.10, 0.50] # Various pre-test probabilitiesprint(f"{'Pre-test prob':>14} | {'Post-test (+)':>14} | {'Post-test (-)':>14}")print("-" * 60) for pre_prob in pre_test_probs: post_prob_positive = bayes_post_test_probability(pre_prob, lr_pos) post_prob_negative = bayes_post_test_probability(pre_prob, lr_neg) print(f"{pre_prob:>14.1%} | {post_prob_positive:>14.1%} | {post_prob_negative:>14.4%}") print("Key Insight: In low-prevalence settings, even excellent tests")print("produce many false positives. At 0.1% prevalence, a positive test")print(f"only raises probability to ~17% (PPV), despite LR+ of {lr_pos:.0f}!")Even tests with very high sensitivity and specificity can have low positive predictive value when the condition is rare. This is why likelihood ratios are more useful than sensitivity/specificity alone—they show how the test changes probability rather than just reporting rates.
Sensitivity and specificity exist in a web of related metrics. Understanding these relationships provides a unified view of classification evaluation.
The Complete Metric Family:
From the 2×2 confusion matrix, we can derive eight fundamental rates by dividing each cell by various marginals:
| Denominator | TP-based Rate | TN-based Rate | FP-based Rate | FN-based Rate |
|---|---|---|---|---|
| Actual Positive (P) | Sensitivity (TPR) | — | — | False Negative Rate (FNR) |
| Actual Negative (N) | — | Specificity (TNR) | False Positive Rate (FPR) | — |
| Predicted Positive (PP) | Precision (PPV) | — | False Discovery Rate (FDR) | — |
| Predicted Negative (PN) | — | Negative Predictive Value (NPV) | — | False Omission Rate (FOR) |
Key Relationships:
Sensitivity = Recall = TPR: Same metric, different names from different fields
Specificity = TNR: The 'recall' of the negative class
FPR = 1 - Specificity: Complement relationship
FNR = 1 - Sensitivity: Complement relationship
Balanced Accuracy = (Sensitivity + Specificity) / 2: Gives equal weight to both classes
Youden's J = Sensitivity + Specificity - 1: Distance from random diagonal in ROC space
Informedness = Sensitivity + Specificity - 1 = Youden's J: Same as J, probability model name
Diagnostic Odds Ratio = LR+ / LR- = (TP × TN) / (FP × FN): Single number summarizing diagnostic accuracy
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npfrom sklearn.metrics import confusion_matrix def complete_metrics_from_confusion_matrix(y_true, y_pred): """ Derive all possible metrics from the confusion matrix. """ cm = confusion_matrix(y_true, y_pred) TN, FP, FN, TP = cm.ravel() # Marginal totals P = TP + FN # Actual positives N = TN + FP # Actual negatives PP = TP + FP # Predicted positives PN = TN + FN # Predicted negatives n = P + N # Total # Rates conditioned on actual class sensitivity = TP / P if P > 0 else 0 # TPR specificity = TN / N if N > 0 else 0 # TNR fnr = FN / P if P > 0 else 0 # False Negative Rate fpr = FP / N if N > 0 else 0 # False Positive Rate # Rates conditioned on predicted class precision = TP / PP if PP > 0 else 0 # PPV npv = TN / PN if PN > 0 else 0 # Negative Predictive Value fdr = FP / PP if PP > 0 else 0 # False Discovery Rate for_ = FN / PN if PN > 0 else 0 # False Omission Rate # Aggregate metrics accuracy = (TP + TN) / n balanced_accuracy = (sensitivity + specificity) / 2 youden_j = sensitivity + specificity - 1 # = TPR - FPR # Likelihood ratios lr_pos = sensitivity / fpr if fpr > 0 else float('inf') lr_neg = fnr / specificity if specificity > 0 else float('inf') # Diagnostic odds ratio dor = (TP * TN) / (FP * FN) if (FP * FN) > 0 else float('inf') return { # Actual class conditioned 'Sensitivity (TPR)': sensitivity, 'Specificity (TNR)': specificity, 'False Negative Rate': fnr, 'False Positive Rate': fpr, # Predicted class conditioned 'Precision (PPV)': precision, 'NPV': npv, 'FDR': fdr, 'FOR': for_, # Aggregates 'Accuracy': accuracy, 'Balanced Accuracy': balanced_accuracy, "Youden's J (Informedness)": youden_j, 'LR+': lr_pos, 'LR-': lr_neg, 'DOR': dor, } # Exampley_true = [1]*100 + [0]*400y_pred = [1]*80 + [0]*20 + [0]*360 + [1]*40 # 80% sens, 90% spec print("Complete Metrics Derivation")print("=" * 55) metrics = complete_metrics_from_confusion_matrix(y_true, y_pred) print("Actual-Class Conditioned Rates:")for name in ['Sensitivity (TPR)', 'Specificity (TNR)', 'False Negative Rate', 'False Positive Rate']: print(f" {name}: {metrics[name]:.3f}") print("Predicted-Class Conditioned Rates:")for name in ['Precision (PPV)', 'NPV', 'FDR', 'FOR']: print(f" {name}: {metrics[name]:.3f}") print("Aggregate Metrics:")for name in ['Accuracy', 'Balanced Accuracy', "Youden's J (Informedness)"]: print(f" {name}: {metrics[name]:.3f}") print("Diagnostic Utility:")print(f" LR+: {metrics['LR+']:.1f}")print(f" LR-: {metrics['LR-']:.3f}")print(f" DOR: {metrics['DOR']:.1f}")Both pairs (sensitivity/specificity and precision/recall) provide valuable but different perspectives. The choice depends on your evaluation context:
Use Sensitivity and Specificity when:
Use Precision and Recall when:
| Aspect | Sensitivity/Specificity | Precision/Recall |
|---|---|---|
| Conditioning | Conditioned on actual class | Mixed conditioning (classes and predictions) |
| Stability | Stable across different test set compositions | Precision depends on class distribution |
| Origin | Medical diagnostics, signal detection | Information retrieval, search engines |
| Best for imbalance | Less informative for rare positives | More informative for rare positives |
| ROC vs PR | Maps to ROC curves | Maps to Precision-Recall curves |
| Negative class info | Specificity captures negative class performance | No direct negative class metric |
In practice, report both pairs when feasible. Sensitivity = Recall, so you only need to add Specificity and Precision to cover all four perspectives. Together, they provide a complete picture of classifier performance from both the actual-class and predicted-class viewpoints.
Effective use of sensitivity and specificity requires attention to practical details:
Sensitivity and specificity provide a symmetric view of classifier performance from the perspective of each actual class, completing our understanding of confusion-matrix-derived metrics.
Module Complete:
You have now completed the Classification Metrics module. You understand the confusion matrix foundation, accuracy's limitations, precision and recall's complementary perspectives, F-scores as combined metrics, and sensitivity/specificity from the medical/diagnostic viewpoint. This comprehensive understanding enables you to select, calculate, and interpret the right metrics for any classification problem.
You now have a complete and deep understanding of classification metrics. You can derive any metric from the confusion matrix, understand the trade-offs each metric makes, and select appropriate metrics for specific application domains. This knowledge is foundational for the next module on ROC and Precision-Recall curves.