Loading learning content...
Precision and recall provide valuable separate perspectives on classifier performance, but practitioners often need a single number to compare models, select hyperparameters, or communicate performance to stakeholders. The challenge is combining these metrics in a meaningful way.
The F-score family provides this combination. The F1 score—the most widely used member of this family—is the harmonic mean of precision and recall, offering a balanced view of both metrics. The generalized F-beta score extends this concept to allow explicit control over the precision-recall trade-off, enabling domain-specific weighting.
Understanding why the harmonic mean (rather than the arithmetic mean) is used, how beta parameter selection affects the metric, and when F-scores are appropriate is essential knowledge for any ML practitioner.
By the end of this page, you will understand the mathematical derivation of F-scores, the properties that make the harmonic mean appropriate, how to select beta for your application, F-score behavior in edge cases, and multi-class aggregation strategies.
Why do we need a combined metric at all? Consider these practical scenarios:
Model Selection: You have five candidate models with the following precision-recall profiles:
| Model | Precision | Recall |
|---|---|---|
| A | 0.90 | 0.60 |
| B | 0.80 | 0.80 |
| C | 0.70 | 0.85 |
| D | 0.85 | 0.75 |
| E | 0.95 | 0.50 |
Which model is 'best'? Without a combined metric, there's no principled way to rank these models or select one for deployment.
Hyperparameter Optimization: During grid search or Bayesian optimization, you need a single objective function to maximize. Optimizing precision and recall simultaneously (multi-objective optimization) is more complex and often unnecessary.
Threshold Calibration: When selecting an operating threshold, you need a criterion. Maximizing F-score at the optimal threshold is a common and principled approach.
Communication: Stakeholders often want a single 'performance number' rather than multiple metrics that may conflict in their implications.
Any single-number summary loses information. The F-score compresses the 2D precision-recall space into 1D, hiding trade-off choices. Always examine precision and recall separately after using F-score for model selection or optimization.
The F1 score is defined as the harmonic mean of precision and recall:
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$
The second form is derived by substituting the definitions of precision and recall:
$$F_1 = 2 \cdot \frac{\frac{TP}{TP+FP} \cdot \frac{TP}{TP+FN}}{\frac{TP}{TP+FP} + \frac{TP}{TP+FN}} = \frac{2 \cdot TP \cdot TP}{TP(TP+FN) + TP(TP+FP)} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$
Key Properties:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import numpy as npfrom sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix def compute_f1_from_components(y_true, y_pred): """ Calculate F1 score from precision and recall, and directly from cm. """ cm = confusion_matrix(y_true, y_pred) TN, FP, FN, TP = cm.ravel() precision = TP / (TP + FP) if (TP + FP) > 0 else 0 recall = TP / (TP + FN) if (TP + FN) > 0 else 0 # Method 1: Harmonic mean formula if precision + recall > 0: f1_harmonic = 2 * precision * recall / (precision + recall) else: f1_harmonic = 0 # Method 2: Direct formula from confusion matrix f1_direct = 2 * TP / (2 * TP + FP + FN) if (2 * TP + FP + FN) > 0 else 0 # Method 3: sklearn f1_sklearn = f1_score(y_true, y_pred) return { 'precision': precision, 'recall': recall, 'f1_harmonic': f1_harmonic, 'f1_direct': f1_direct, 'f1_sklearn': f1_sklearn, 'TP': TP, 'FP': FP, 'FN': FN, 'TN': TN, } # Exampley_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1] result = compute_f1_from_components(y_true, y_pred) print("F1 Score Calculation Demonstration")print("=" * 50)print(f"\nConfusion Matrix: TP={result['TP']}, FP={result['FP']}, FN={result['FN']}, TN={result['TN']}")print(f"Precision: {result['precision']:.4f}")print(f"Recall: {result['recall']:.4f}")print(f"\nF1 Score (harmonic mean): {result['f1_harmonic']:.4f}")print(f"F1 Score (direct formula): {result['f1_direct']:.4f}")print(f"F1 Score (sklearn): {result['f1_sklearn']:.4f}")print(f"\nAll three methods give identical results: {np.allclose([result['f1_harmonic']], [result['f1_direct']])}")The choice of harmonic mean over arithmetic or geometric mean is deliberate and important.
Comparison of Means:
For two values $a$ and $b$:
The Ordering Property:
For any two positive values $a \neq b$:
$$\text{Harmonic} < \text{Geometric} < \text{Arithmetic}$$
Why This Matters for F1:
The harmonic mean is dominated by the smaller value. Consider precision = 0.9, recall = 0.1:
| Mean Type | Formula | Result |
|---|---|---|
| Arithmetic | (0.9 + 0.1) / 2 | 0.50 |
| Geometric | √(0.9 × 0.1) | 0.30 |
| Harmonic | 2×0.9×0.1 / (0.9+0.1) | 0.18 |
The harmonic mean (0.18) correctly reflects that this is a poor classifier—one metric being excellent cannot compensate for the other being terrible.
The harmonic mean gives 'veto power' to the lower value. If either precision or recall approaches zero, F1 approaches zero regardless of how good the other metric is. This prevents celebrating a model that excels at one aspect while completely failing at the other.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
import numpy as npimport matplotlib.pyplot as plt def compare_means(): """ Visualize how different mean types behave with varying precision/recall. """ # Fixed recall = 0.5, varying precision recall_fixed = 0.5 precisions = np.linspace(0.01, 1, 100) arithmetic = (precisions + recall_fixed) / 2 geometric = np.sqrt(precisions * recall_fixed) harmonic = 2 * precisions * recall_fixed / (precisions + recall_fixed) plt.figure(figsize=(12, 5)) # Left plot: Different means plt.subplot(1, 2, 1) plt.plot(precisions, arithmetic, 'b-', linewidth=2, label='Arithmetic Mean') plt.plot(precisions, geometric, 'g-', linewidth=2, label='Geometric Mean') plt.plot(precisions, harmonic, 'r-', linewidth=2, label='Harmonic Mean (F1)') plt.axhline(y=recall_fixed, color='gray', linestyle='--', alpha=0.5, label=f'Recall = {recall_fixed}') plt.xlabel('Precision', fontsize=12) plt.ylabel('Combined Score', fontsize=12) plt.title('Different Means with Fixed Recall = 0.5', fontsize=14) plt.legend() plt.grid(True, alpha=0.3) # Right plot: F1 contours plt.subplot(1, 2, 2) p_grid, r_grid = np.meshgrid(np.linspace(0.01, 1, 100), np.linspace(0.01, 1, 100)) f1_grid = 2 * p_grid * r_grid / (p_grid + r_grid) contour = plt.contourf(p_grid, r_grid, f1_grid, levels=20, cmap='viridis') plt.colorbar(contour, label='F1 Score') # Add iso-F1 curves for f1_target in [0.2, 0.4, 0.6, 0.8]: plt.contour(p_grid, r_grid, f1_grid, levels=[f1_target], colors='white', linestyles='--') plt.xlabel('Precision', fontsize=12) plt.ylabel('Recall', fontsize=12) plt.title('F1 Score Contours (Iso-F1 curves in white)', fontsize=14) plt.tight_layout() plt.savefig('harmonic_mean_properties.png', dpi=150) plt.show() # Numerical examples print("\nNumerical Comparison of Means") print("=" * 55) print(f"{'Precision':>10} {'Recall':>8} {'Arith':>8} {'Geom':>8} {'Harm(F1)':>10}") print("-" * 55) test_cases = [(0.9, 0.9), (0.9, 0.5), (0.9, 0.1), (0.5, 0.5), (0.1, 0.1)] for p, r in test_cases: a = (p + r) / 2 g = np.sqrt(p * r) h = 2 * p * r / (p + r) print(f"{p:>10.1f} {r:>8.1f} {a:>8.2f} {g:>8.2f} {h:>10.2f}") compare_means()The F1 score weights precision and recall equally, but this isn't always desired. The F-beta score generalizes F1 to allow explicit control over the precision-recall trade-off:
$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$
Interpretation of Beta:
Specifically, beta represents how many times more important recall is than precision.
Common F-beta Variants:
| Score | Beta | Interpretation | Use Case |
|---|---|---|---|
| F0.5 | 0.5 | Precision twice as important as recall | Spam filtering, recommendations |
| F1 | 1.0 | Equal importance | General-purpose, balanced |
| F2 | 2.0 | Recall twice as important as precision | Medical screening, security |
Mathematical Intuition:
The F-beta score can be rewritten as:
$$F_\beta = \frac{(1 + \beta^2) \cdot TP}{(1 + \beta^2) \cdot TP + \beta^2 \cdot FN + FP}$$
The $\beta^2$ factor scales the penalty for False Negatives relative to False Positives.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
import numpy as npfrom sklearn.metrics import fbeta_score, precision_score, recall_scoreimport matplotlib.pyplot as plt def explore_fbeta(y_true, y_pred): """ Explore how F-beta changes with different beta values. """ precision = precision_score(y_true, y_pred) recall = recall_score(y_true, y_pred) print("F-beta Score Exploration") print("=" * 60) print(f"Precision: {precision:.3f}") print(f"Recall: {recall:.3f}") print(f"\n{'Beta':>6} {'F-beta':>10} {'Interpretation':<30}") print("-" * 60) betas = [0.25, 0.5, 1.0, 2.0, 4.0] for beta in betas: f_beta = fbeta_score(y_true, y_pred, beta=beta) if beta < 1: interp = f"Precision {1/beta:.1f}x more important" elif beta > 1: interp = f"Recall {beta:.1f}x more important" else: interp = "Equal weight (F1)" print(f"{beta:>6.2f} {f_beta:>10.4f} {interp:<30}") return precision, recall def visualize_fbeta_curves(): """ Visualize iso-F-beta curves for different beta values. """ fig, axes = plt.subplots(1, 3, figsize=(15, 4)) betas = [0.5, 1.0, 2.0] titles = ['F0.5 (Precision-focused)', 'F1 (Balanced)', 'F2 (Recall-focused)'] for ax, beta, title in zip(axes, betas, titles): p_grid, r_grid = np.meshgrid(np.linspace(0.01, 1, 100), np.linspace(0.01, 1, 100)) f_grid = (1 + beta**2) * p_grid * r_grid / (beta**2 * p_grid + r_grid) contour = ax.contourf(p_grid, r_grid, f_grid, levels=20, cmap='viridis') plt.colorbar(contour, ax=ax, label=f'F{beta}') # Mark the optimal point for fixed P+R=1 (trade-off line) ax.plot([0, 1], [1, 0], 'w--', alpha=0.5, label='P + R = 1') ax.set_xlabel('Precision', fontsize=12) ax.set_ylabel('Recall', fontsize=12) ax.set_title(title, fontsize=14) plt.tight_layout() plt.savefig('fbeta_contours.png', dpi=150) plt.show() # Example: Model with high precision, lower recally_true = [1]*100 + [0]*900y_pred = [1]*60 + [0]*40 + [0]*880 + [1]*20 # 60/100 recall, 60/80 precision precision, recall = explore_fbeta(y_true, y_pred) print(f"\nKey Insight:")print(f" With Precision={precision:.2%} and Recall={recall:.2%}:")print(f" - F0.5 rewards the higher precision")print(f" - F2 would reward higher recall (this model scores lower)") visualize_fbeta_curves()The beta parameter should be chosen based on domain requirements, not tuned to maximize the score. If recall is twice as important as precision in your application (e.g., missing a disease is twice as bad as a false alarm), use β=2. Choose beta before seeing results, not after.
Understanding the mathematical properties and behavior at boundaries is crucial for correct interpretation.
Range and Bounds:
$$0 \leq F_\beta \leq 1$$
Monotonicity:
Related Metrics:
$$F_\beta^2 = \frac{(1 + \beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}} \cdot \frac{(1 + \beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$
Dice Coefficient Connection:
The F1 score is equivalent to the Dice coefficient (or Sørensen-Dice coefficient) used in image segmentation:
$$\text{Dice} = \frac{2|A \cap B|}{|A| + |B|} = F_1$$
where A is the set of predicted positives and B is the set of actual positives.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as npfrom sklearn.metrics import f1_scoreimport warnings def explore_f1_edge_cases(): """ Demonstrate edge cases and boundary behavior of F1 score. """ print("F1 Score Edge Cases and Boundary Behavior") print("=" * 55) # Case 1: Perfect classifier print("\nCase 1: Perfect classifier") y_true = [1, 1, 1, 0, 0, 0] y_pred = [1, 1, 1, 0, 0, 0] print(f" F1 = {f1_score(y_true, y_pred):.4f} (perfect)") # Case 2: Zero recall (no true positives found) print("\nCase 2: Zero recall (all positives missed)") y_true = [1, 1, 1, 0, 0, 0] y_pred = [0, 0, 0, 0, 0, 0] print(f" F1 = {f1_score(y_true, y_pred):.4f} (zero - recall is 0)") # Case 3: Zero precision (all predictions wrong) print("\nCase 3: Zero precision (no correct positive predictions)") y_true = [0, 0, 0, 0, 0, 0] y_pred = [1, 1, 1, 0, 0, 0] with warnings.catch_warnings(): warnings.simplefilter("ignore") print(f" F1 = {f1_score(y_true, y_pred, zero_division=0):.4f} (zero - precision is 0)") # Case 4: No predictions and no positives (undefined) print("\nCase 4: No positive predictions or actual positives") y_true = [0, 0, 0] y_pred = [0, 0, 0] with warnings.catch_warnings(): warnings.simplefilter("ignore") print(f" F1 = {f1_score(y_true, y_pred, zero_division=0):.4f} (undefined → set to 0)") print(f" F1 = {f1_score(y_true, y_pred, zero_division=1):.4f} (undefined → set to 1)") # Case 5: Single correct prediction print("\nCase 5: Single true positive") y_true = [1]*100 + [0]*900 y_pred = [1]*1 + [0]*99 + [0]*900 # One correct positive prediction f1 = f1_score(y_true, y_pred) print(f" Precision = 1/1 = 100%") print(f" Recall = 1/100 = 1%") print(f" F1 = {f1:.4f} (low due to terrible recall)") # Case 6: Effect of threshold print("\nCase 6: Threshold effects") y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] y_scores = [0.9, 0.8, 0.7, 0.6, 0.4, 0.5, 0.3, 0.2, 0.1, 0.05] print(f" {'Threshold':>10} {'Precision':>10} {'Recall':>8} {'F1':>8}") for thresh in [0.3, 0.5, 0.7, 0.9]: y_pred = [1 if s >= thresh else 0 for s in y_scores] from sklearn.metrics import precision_score, recall_score p = precision_score(y_true, y_pred, zero_division=0) r = recall_score(y_true, y_pred, zero_division=0) f = f1_score(y_true, y_pred, zero_division=0) print(f" {thresh:>10.1f} {p:>10.2%} {r:>8.0%} {f:>8.4f}") explore_f1_edge_cases()For multi-class problems, F-scores must be computed per-class and then aggregated. The aggregation strategy significantly affects the final score and its interpretation.
Per-Class F-scores:
For class $c$ treated as the positive class (one-vs-rest):
$$F_1^{(c)} = \frac{2 \cdot TP_c}{2 \cdot TP_c + FP_c + FN_c}$$
Aggregation Strategies:
| Method | Formula | Properties |
|---|---|---|
| Macro | $\frac{1}{K}\sum_{c=1}^K F_1^{(c)}$ | Treats all classes equally; sensitive to rare classes |
| Weighted | $\sum_{c=1}^K \frac{n_c}{n} F_1^{(c)}$ | Weights by class prevalence; reflects overall quality |
| Micro | $\frac{2 \sum_c TP_c}{2\sum_c TP_c + \sum_c FP_c + \sum_c FN_c}$ | Global aggregation; equals accuracy for multi-class |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npfrom sklearn.metrics import f1_score, precision_recall_fscore_support, confusion_matrix def multiclass_fscore_analysis(): """ Demonstrate multi-class F-score calculation and aggregation strategies. """ # Imbalanced 4-class problem class_names = ['A', 'B', 'C', 'D'] y_true = [0]*100 + [1]*50 + [2]*30 + [3]*20 # 100, 50, 30, 20 # Predictions with varying per-class performance y_pred = ( [0]*85 + [1]*10 + [2]*3 + [3]*2 + # Class A: 85/100 = 0.85 recall [1]*40 + [0]*5 + [2]*3 + [3]*2 + # Class B: 40/50 = 0.80 recall [2]*20 + [0]*5 + [1]*3 + [3]*2 + # Class C: 20/30 = 0.67 recall [3]*10 + [0]*5 + [1]*3 + [2]*2 # Class D: 10/20 = 0.50 recall ) cm = confusion_matrix(y_true, y_pred) print("Multi-Class F-score Analysis") print("=" * 60) print(f"\nClass distribution: A={100}, B={50}, C={30}, D={20}") print(f"\nConfusion Matrix:") print(f" Predicted") print(f" A B C D") for i, cls in enumerate(class_names): print(f"{cls:>8} {cm[i, 0]:3d} {cm[i, 1]:3d} {cm[i, 2]:3d} {cm[i, 3]:3d}") # Per-class F1 scores precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred) print(f"\nPer-Class Metrics:") print(f"{'Class':>8} {'n':>6} {'Precision':>10} {'Recall':>10} {'F1':>10}") print("-" * 50) for i, cls in enumerate(class_names): print(f"{cls:>8} {support[i]:>6} {precision[i]:>10.3f} {recall[i]:>10.3f} {f1[i]:>10.3f}") # Aggregated F1 scores f1_macro = f1_score(y_true, y_pred, average='macro') f1_weighted = f1_score(y_true, y_pred, average='weighted') f1_micro = f1_score(y_true, y_pred, average='micro') print(f"\nAggregated F1 Scores:") print(f" Macro F1: {f1_macro:.4f} (simple average of per-class F1)") print(f" Weighted F1: {f1_weighted:.4f} (weighted by class frequency)") print(f" Micro F1: {f1_micro:.4f} (global TP/FP/FN aggregation)") # Manual verification of macro manual_macro = np.mean(f1) print(f"\nVerification: Manual macro = {manual_macro:.4f}") # Manual verification of weighted weights = support / support.sum() manual_weighted = np.sum(weights * f1) print(f"Verification: Manual weighted = {manual_weighted:.4f}") multiclass_fscore_analysis()Use macro when all classes are equally important regardless of frequency. Use weighted when you want the score to reflect overall prediction quality. Use micro for a global view (note: micro F1 = micro precision = micro recall = accuracy for multi-class single-label problems).
One common use of F-scores is finding the optimal classification threshold. Rather than using the default 0.5 threshold, we can sweep through thresholds and select the one that maximizes F-score.
The Process:
Important Considerations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import precision_recall_curve, f1_score def optimize_threshold_for_f1(y_true, y_scores): """ Find the optimal classification threshold that maximizes F1 score. """ # Get precision, recall at all thresholds precision, recall, thresholds = precision_recall_curve(y_true, y_scores) # Calculate F1 at each threshold # Note: precision_recall_curve returns n+1 precision/recall values # for n threshold values, so we need to handle the length mismatch f1_scores = [] for i in range(len(thresholds)): if precision[i] + recall[i] > 0: f1 = 2 * precision[i] * recall[i] / (precision[i] + recall[i]) else: f1 = 0 f1_scores.append(f1) f1_scores = np.array(f1_scores) # Find optimal threshold optimal_idx = np.argmax(f1_scores) optimal_threshold = thresholds[optimal_idx] optimal_f1 = f1_scores[optimal_idx] return { 'thresholds': thresholds, 'f1_scores': f1_scores, 'precision': precision[:-1], 'recall': recall[:-1], 'optimal_threshold': optimal_threshold, 'optimal_f1': optimal_f1, 'optimal_precision': precision[optimal_idx], 'optimal_recall': recall[optimal_idx], } # Generate synthetic examplenp.random.seed(42)n_pos, n_neg = 100, 400 # Imbalanced: 20% positive # Positive class has higher scoresscores_pos = np.clip(np.random.normal(0.65, 0.2, n_pos), 0, 1)scores_neg = np.clip(np.random.normal(0.35, 0.2, n_neg), 0, 1) y_true = np.array([1]*n_pos + [0]*n_neg)y_scores = np.concatenate([scores_pos, scores_neg]) # Find optimal thresholdresult = optimize_threshold_for_f1(y_true, y_scores) print("Threshold Optimization for F1 Score")print("=" * 50)print(f"\nDataset: {n_pos} positives, {n_neg} negatives ({n_pos/(n_pos+n_neg):.1%} positive)") # Compare default vs optimal thresholddefault_preds = (y_scores >= 0.5).astype(int)optimal_preds = (y_scores >= result['optimal_threshold']).astype(int) f1_default = f1_score(y_true, default_preds)f1_optimal = result['optimal_f1'] print(f"\nDefault threshold (0.5):")print(f" F1 = {f1_default:.4f}") print(f"\nOptimal threshold ({result['optimal_threshold']:.3f}):")print(f" F1 = {f1_optimal:.4f}")print(f" Precision = {result['optimal_precision']:.3f}")print(f" Recall = {result['optimal_recall']:.3f}") print(f"\nImprovement: {(f1_optimal - f1_default) / f1_default * 100:.1f}%") # Visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: F1 vs thresholdaxes[0].plot(result['thresholds'], result['f1_scores'], 'b-', linewidth=2)axes[0].axvline(x=0.5, color='gray', linestyle='--', label='Default (0.5)')axes[0].axvline(x=result['optimal_threshold'], color='red', linestyle='-', label=f'Optimal ({result["optimal_threshold"]:.2f})')axes[0].scatter([result['optimal_threshold']], [result['optimal_f1']], color='red', s=100, zorder=5)axes[0].set_xlabel('Decision Threshold', fontsize=12)axes[0].set_ylabel('F1 Score', fontsize=12)axes[0].set_title('F1 Score vs Classification Threshold', fontsize=14)axes[0].legend()axes[0].grid(True, alpha=0.3) # Right: Precision, Recall, F1 vs thresholdaxes[1].plot(result['thresholds'], result['precision'], 'g-', linewidth=2, label='Precision')axes[1].plot(result['thresholds'], result['recall'], 'r-', linewidth=2, label='Recall')axes[1].plot(result['thresholds'], result['f1_scores'], 'b-', linewidth=2, label='F1')axes[1].axvline(x=result['optimal_threshold'], color='blue', linestyle='--', alpha=0.5)axes[1].set_xlabel('Decision Threshold', fontsize=12)axes[1].set_ylabel('Score', fontsize=12)axes[1].set_title('All Metrics vs Threshold', fontsize=14)axes[1].legend()axes[1].grid(True, alpha=0.3) plt.tight_layout()plt.savefig('threshold_optimization.png', dpi=150)plt.show()While F-scores are widely used, they have important limitations that practitioners should understand:
For a TN-sensitive metric, consider Matthews Correlation Coefficient (MCC). For threshold-independent evaluation, use Area Under the Precision-Recall Curve (AUPRC). For probabilistic outputs, consider log loss or Brier score. The best metric depends on your specific application requirements.
The F1 score and its F-beta generalizations provide powerful tools for single-number classifier evaluation that respects the precision-recall trade-off.
What's Next:
The final page in this module covers Specificity and Sensitivity—complementary metrics that describe performance from the perspective of each actual class, completing our understanding of confusion-matrix-derived classification metrics.
You now have a deep understanding of F-scores—when to use them, how to compute them, and crucially, their limitations. This knowledge enables you to move beyond naive accuracy optimization toward metrics that reflect true classifier quality.