Machine LearningModel Evaluation Metrics

Classification Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

4 / 5

F1 and F-beta Scores

Combining Precision and Recall

Precision and recall provide valuable separate perspectives on classifier performance, but practitioners often need a single number to compare models, select hyperparameters, or communicate performance to stakeholders. The challenge is combining these metrics in a meaningful way.

The F-score family provides this combination. The F1 score—the most widely used member of this family—is the harmonic mean of precision and recall, offering a balanced view of both metrics. The generalized F-beta score extends this concept to allow explicit control over the precision-recall trade-off, enabling domain-specific weighting.

Understanding why the harmonic mean (rather than the arithmetic mean) is used, how beta parameter selection affects the metric, and when F-scores are appropriate is essential knowledge for any ML practitioner.

What You Will Learn

By the end of this page, you will understand the mathematical derivation of F-scores, the properties that make the harmonic mean appropriate, how to select beta for your application, F-score behavior in edge cases, and multi-class aggregation strategies.

The Need for Combination

Why do we need a combined metric at all? Consider these practical scenarios:

Model Selection: You have five candidate models with the following precision-recall profiles:

Model	Precision	Recall
A	0.90	0.60
B	0.80	0.80
C	0.70	0.85
D	0.85	0.75
E	0.95	0.50

Which model is 'best'? Without a combined metric, there's no principled way to rank these models or select one for deployment.

Hyperparameter Optimization: During grid search or Bayesian optimization, you need a single objective function to maximize. Optimizing precision and recall simultaneously (multi-objective optimization) is more complex and often unnecessary.

Threshold Calibration: When selecting an operating threshold, you need a criterion. Maximizing F-score at the optimal threshold is a common and principled approach.

Communication: Stakeholders often want a single 'performance number' rather than multiple metrics that may conflict in their implications.

The Cost of Reduction

Any single-number summary loses information. The F-score compresses the 2D precision-recall space into 1D, hiding trade-off choices. Always examine precision and recall separately after using F-score for model selection or optimization.

The F1 Score Definition

The F1 score is defined as the harmonic mean of precision and recall:

$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$

The second form is derived by substituting the definitions of precision and recall:

$$F_1 = 2 \cdot \frac{\frac{TP}{TP+FP} \cdot \frac{TP}{TP+FN}}{\frac{TP}{TP+FP} + \frac{TP}{TP+FN}} = \frac{2 \cdot TP \cdot TP}{TP(TP+FN) + TP(TP+FP)} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$

Key Properties:

Range: $F_1 \in [0, 1]$, where 1 is perfect and 0 is worst
Perfect score: F1 = 1 when both precision and recall equal 1
Zero score: F1 = 0 when either precision or recall is 0
Symmetric: Precision and recall contribute equally
TN-independent: True negatives do not affect F1 (like precision and recall)

f1_score_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
 
def compute_f1_from_components(y_true, y_pred):
    """
    Calculate F1 score from precision and recall, and directly from cm.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    # Method 1: Harmonic mean formula
    if precision + recall > 0:
        f1_harmonic = 2 * precision * recall / (precision + recall)
    else:
        f1_harmonic = 0
    
    # Method 2: Direct formula from confusion matrix
    f1_direct = 2 * TP / (2 * TP + FP + FN) if (2 * TP + FP + FN) > 0 else 0
    
    # Method 3: sklearn
    f1_sklearn = f1_score(y_true, y_pred)
    
    return {
        'precision': precision,
        'recall': recall,
        'f1_harmonic': f1_harmonic,
        'f1_direct': f1_direct,
        'f1_sklearn': f1_sklearn,
        'TP': TP, 'FP': FP, 'FN': FN, 'TN': TN,
    }
 
# Example
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
 
result = compute_f1_from_components(y_true, y_pred)
 
print("F1 Score Calculation Demonstration")
print("=" * 50)
print(f"\nConfusion Matrix: TP={result['TP']}, FP={result['FP']}, FN={result['FN']}, TN={result['TN']}")
print(f"Precision: {result['precision']:.4f}")
print(f"Recall: {result['recall']:.4f}")
print(f"\nF1 Score (harmonic mean): {result['f1_harmonic']:.4f}")
print(f"F1 Score (direct formula): {result['f1_direct']:.4f}")
print(f"F1 Score (sklearn):        {result['f1_sklearn']:.4f}")
print(f"\nAll three methods give identical results: {np.allclose([result['f1_harmonic']], [result['f1_direct']])}")

Why the Harmonic Mean?

The choice of harmonic mean over arithmetic or geometric mean is deliberate and important.

Comparison of Means:

For two values $a$ and $b$:

Arithmetic Mean: $\frac{a + b}{2}$
Geometric Mean: $\sqrt{a \cdot b}$
Harmonic Mean: $\frac{2ab}{a + b} = \frac{2}{\frac{1}{a} + \frac{1}{b}}$

The Ordering Property:

For any two positive values $a \neq b$:

$$\text{Harmonic} < \text{Geometric} < \text{Arithmetic}$$

Why This Matters for F1:

The harmonic mean is dominated by the smaller value. Consider precision = 0.9, recall = 0.1:

Mean Type	Formula	Result
Arithmetic	(0.9 + 0.1) / 2	0.50
Geometric	√(0.9 × 0.1)	0.30
Harmonic	2×0.9×0.1 / (0.9+0.1)	0.18

The harmonic mean (0.18) correctly reflects that this is a poor classifier—one metric being excellent cannot compensate for the other being terrible.

The Veto Power of the Harmonic Mean

The harmonic mean gives 'veto power' to the lower value. If either precision or recall approaches zero, F1 approaches zero regardless of how good the other metric is. This prevents celebrating a model that excels at one aspect while completely failing at the other.

harmonic_mean_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def compare_means():
    """
    Visualize how different mean types behave with varying precision/recall.
    """
    # Fixed recall = 0.5, varying precision
    recall_fixed = 0.5
    precisions = np.linspace(0.01, 1, 100)
    
    arithmetic = (precisions + recall_fixed) / 2
    geometric = np.sqrt(precisions * recall_fixed)
    harmonic = 2 * precisions * recall_fixed / (precisions + recall_fixed)
    
    plt.figure(figsize=(12, 5))
    
    # Left plot: Different means
    plt.subplot(1, 2, 1)
    plt.plot(precisions, arithmetic, 'b-', linewidth=2, label='Arithmetic Mean')
    plt.plot(precisions, geometric, 'g-', linewidth=2, label='Geometric Mean')
    plt.plot(precisions, harmonic, 'r-', linewidth=2, label='Harmonic Mean (F1)')
    plt.axhline(y=recall_fixed, color='gray', linestyle='--', alpha=0.5, label=f'Recall = {recall_fixed}')
    plt.xlabel('Precision', fontsize=12)
    plt.ylabel('Combined Score', fontsize=12)
    plt.title('Different Means with Fixed Recall = 0.5', fontsize=14)
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Right plot: F1 contours
    plt.subplot(1, 2, 2)
    p_grid, r_grid = np.meshgrid(np.linspace(0.01, 1, 100), np.linspace(0.01, 1, 100))
    f1_grid = 2 * p_grid * r_grid / (p_grid + r_grid)
    
    contour = plt.contourf(p_grid, r_grid, f1_grid, levels=20, cmap='viridis')
    plt.colorbar(contour, label='F1 Score')
    
    # Add iso-F1 curves
    for f1_target in [0.2, 0.4, 0.6, 0.8]:
        plt.contour(p_grid, r_grid, f1_grid, levels=[f1_target], colors='white', linestyles='--')
    
    plt.xlabel('Precision', fontsize=12)
    plt.ylabel('Recall', fontsize=12)
    plt.title('F1 Score Contours (Iso-F1 curves in white)', fontsize=14)
    
    plt.tight_layout()
    plt.savefig('harmonic_mean_properties.png', dpi=150)
    plt.show()
    
    # Numerical examples
    print("\nNumerical Comparison of Means")
    print("=" * 55)
    print(f"{'Precision':>10} {'Recall':>8} {'Arith':>8} {'Geom':>8} {'Harm(F1)':>10}")
    print("-" * 55)
    
    test_cases = [(0.9, 0.9), (0.9, 0.5), (0.9, 0.1), (0.5, 0.5), (0.1, 0.1)]
    for p, r in test_cases:
        a = (p + r) / 2
        g = np.sqrt(p * r)
        h = 2 * p * r / (p + r)
        print(f"{p:>10.1f} {r:>8.1f} {a:>8.2f} {g:>8.2f} {h:>10.2f}")
 
compare_means()

The F-beta Generalization

The F1 score weights precision and recall equally, but this isn't always desired. The F-beta score generalizes F1 to allow explicit control over the precision-recall trade-off:

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

Interpretation of Beta:

$\beta = 1$: Equal weight to precision and recall (F1 score)
$\beta < 1$: More weight on precision (precision-focused)
$\beta > 1$: More weight on recall (recall-focused)

Specifically, beta represents how many times more important recall is than precision.

Common F-beta Variants:

Score	Beta	Interpretation	Use Case
F0.5	0.5	Precision twice as important as recall	Spam filtering, recommendations
F1	1.0	Equal importance	General-purpose, balanced
F2	2.0	Recall twice as important as precision	Medical screening, security

Mathematical Intuition:

The F-beta score can be rewritten as:

$$F_\beta = \frac{(1 + \beta^2) \cdot TP}{(1 + \beta^2) \cdot TP + \beta^2 \cdot FN + FP}$$

The $\beta^2$ factor scales the penalty for False Negatives relative to False Positives.

fbeta_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.metrics import fbeta_score, precision_score, recall_score
import matplotlib.pyplot as plt
 
def explore_fbeta(y_true, y_pred):
    """
    Explore how F-beta changes with different beta values.
    """
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    
    print("F-beta Score Exploration")
    print("=" * 60)
    print(f"Precision: {precision:.3f}")
    print(f"Recall:    {recall:.3f}")
    print(f"\n{'Beta':>6} {'F-beta':>10} {'Interpretation':<30}")
    print("-" * 60)
    
    betas = [0.25, 0.5, 1.0, 2.0, 4.0]
    for beta in betas:
        f_beta = fbeta_score(y_true, y_pred, beta=beta)
        if beta < 1:
            interp = f"Precision {1/beta:.1f}x more important"
        elif beta > 1:
            interp = f"Recall {beta:.1f}x more important"
        else:
            interp = "Equal weight (F1)"
        print(f"{beta:>6.2f} {f_beta:>10.4f} {interp:<30}")
    
    return precision, recall
 
def visualize_fbeta_curves():
    """
    Visualize iso-F-beta curves for different beta values.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    betas = [0.5, 1.0, 2.0]
    titles = ['F0.5 (Precision-focused)', 'F1 (Balanced)', 'F2 (Recall-focused)']
    
    for ax, beta, title in zip(axes, betas, titles):
        p_grid, r_grid = np.meshgrid(np.linspace(0.01, 1, 100), np.linspace(0.01, 1, 100))
        f_grid = (1 + beta**2) * p_grid * r_grid / (beta**2 * p_grid + r_grid)
        
        contour = ax.contourf(p_grid, r_grid, f_grid, levels=20, cmap='viridis')
        plt.colorbar(contour, ax=ax, label=f'F{beta}')
        
        # Mark the optimal point for fixed P+R=1 (trade-off line)
        ax.plot([0, 1], [1, 0], 'w--', alpha=0.5, label='P + R = 1')
        
        ax.set_xlabel('Precision', fontsize=12)
        ax.set_ylabel('Recall', fontsize=12)
        ax.set_title(title, fontsize=14)
    
    plt.tight_layout()
    plt.savefig('fbeta_contours.png', dpi=150)
    plt.show()
 
# Example: Model with high precision, lower recall
y_true = [1]*100 + [0]*900
y_pred = [1]*60 + [0]*40 + [0]*880 + [1]*20  # 60/100 recall, 60/80 precision
 
precision, recall = explore_fbeta(y_true, y_pred)
 
print(f"\nKey Insight:")
print(f"  With Precision={precision:.2%} and Recall={recall:.2%}:")
print(f"  - F0.5 rewards the higher precision")
print(f"  - F2 would reward higher recall (this model scores lower)")
 
visualize_fbeta_curves()

Choosing Beta

The beta parameter should be chosen based on domain requirements, not tuned to maximize the score. If recall is twice as important as precision in your application (e.g., missing a disease is twice as bad as a false alarm), use β=2. Choose beta before seeing results, not after.

F-score Properties and Edge Cases

Understanding the mathematical properties and behavior at boundaries is crucial for correct interpretation.

Range and Bounds:

$$0 \leq F_\beta \leq 1$$

$F_\beta = 1$ if and only if both precision = 1 and recall = 1 (perfect classifier on positive class)
$F_\beta = 0$ if either precision = 0 or recall = 0
$F_\beta$ is undefined if TP + FP = 0 (no positive predictions) AND TP + FN = 0 (no actual positives)

Monotonicity:

F-beta is monotonically increasing in both precision and recall (all else equal)
Improving either metric without degrading the other always improves F-beta

Related Metrics:

$$F_\beta^2 = \frac{(1 + \beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}} \cdot \frac{(1 + \beta^2) \cdot \text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

Dice Coefficient Connection:

The F1 score is equivalent to the Dice coefficient (or Sørensen-Dice coefficient) used in image segmentation:

$$\text{Dice} = \frac{2|A \cap B|}{|A| + |B|} = F_1$$

where A is the set of predicted positives and B is the set of actual positives.

f1_edge_cases.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from sklearn.metrics import f1_score
import warnings
 
def explore_f1_edge_cases():
    """
    Demonstrate edge cases and boundary behavior of F1 score.
    """
    print("F1 Score Edge Cases and Boundary Behavior")
    print("=" * 55)
    
    # Case 1: Perfect classifier
    print("\nCase 1: Perfect classifier")
    y_true = [1, 1, 1, 0, 0, 0]
    y_pred = [1, 1, 1, 0, 0, 0]
    print(f"  F1 = {f1_score(y_true, y_pred):.4f} (perfect)")
    
    # Case 2: Zero recall (no true positives found)
    print("\nCase 2: Zero recall (all positives missed)")
    y_true = [1, 1, 1, 0, 0, 0]
    y_pred = [0, 0, 0, 0, 0, 0]
    print(f"  F1 = {f1_score(y_true, y_pred):.4f} (zero - recall is 0)")
    
    # Case 3: Zero precision (all predictions wrong)
    print("\nCase 3: Zero precision (no correct positive predictions)")
    y_true = [0, 0, 0, 0, 0, 0]
    y_pred = [1, 1, 1, 0, 0, 0]
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print(f"  F1 = {f1_score(y_true, y_pred, zero_division=0):.4f} (zero - precision is 0)")
    
    # Case 4: No predictions and no positives (undefined)
    print("\nCase 4: No positive predictions or actual positives")
    y_true = [0, 0, 0]
    y_pred = [0, 0, 0]
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print(f"  F1 = {f1_score(y_true, y_pred, zero_division=0):.4f} (undefined → set to 0)")
        print(f"  F1 = {f1_score(y_true, y_pred, zero_division=1):.4f} (undefined → set to 1)")
    
    # Case 5: Single correct prediction
    print("\nCase 5: Single true positive")
    y_true = [1]*100 + [0]*900
    y_pred = [1]*1 + [0]*99 + [0]*900  # One correct positive prediction
    f1 = f1_score(y_true, y_pred)
    print(f"  Precision = 1/1 = 100%")
    print(f"  Recall = 1/100 = 1%")
    print(f"  F1 = {f1:.4f} (low due to terrible recall)")
    
    # Case 6: Effect of threshold
    print("\nCase 6: Threshold effects")
    y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    y_scores = [0.9, 0.8, 0.7, 0.6, 0.4, 0.5, 0.3, 0.2, 0.1, 0.05]
    
    print(f"  {'Threshold':>10} {'Precision':>10} {'Recall':>8} {'F1':>8}")
    for thresh in [0.3, 0.5, 0.7, 0.9]:
        y_pred = [1 if s >= thresh else 0 for s in y_scores]
        from sklearn.metrics import precision_score, recall_score
        p = precision_score(y_true, y_pred, zero_division=0)
        r = recall_score(y_true, y_pred, zero_division=0)
        f = f1_score(y_true, y_pred, zero_division=0)
        print(f"  {thresh:>10.1f} {p:>10.2%} {r:>8.0%} {f:>8.4f}")
 
explore_f1_edge_cases()

Multi-Class F-scores

For multi-class problems, F-scores must be computed per-class and then aggregated. The aggregation strategy significantly affects the final score and its interpretation.

Per-Class F-scores:

For class $c$ treated as the positive class (one-vs-rest):

$$F_1^{(c)} = \frac{2 \cdot TP_c}{2 \cdot TP_c + FP_c + FN_c}$$

Aggregation Strategies:

Multi-Class F-score Aggregation Methods
Method	Formula	Properties
Macro	$\frac{1}{K}\sum_{c=1}^K F_1^{(c)}$	Treats all classes equally; sensitive to rare classes
Weighted	$\sum_{c=1}^K \frac{n_c}{n} F_1^{(c)}$	Weights by class prevalence; reflects overall quality
Micro	$\frac{2 \sum_c TP_c}{2\sum_c TP_c + \sum_c FP_c + \sum_c FN_c}$	Global aggregation; equals accuracy for multi-class

multiclass_fscores.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from sklearn.metrics import f1_score, precision_recall_fscore_support, confusion_matrix
 
def multiclass_fscore_analysis():
    """
    Demonstrate multi-class F-score calculation and aggregation strategies.
    """
    # Imbalanced 4-class problem
    class_names = ['A', 'B', 'C', 'D']
    y_true = [0]*100 + [1]*50 + [2]*30 + [3]*20  # 100, 50, 30, 20
    
    # Predictions with varying per-class performance
    y_pred = (
        [0]*85 + [1]*10 + [2]*3 + [3]*2 +   # Class A: 85/100 = 0.85 recall
        [1]*40 + [0]*5 + [2]*3 + [3]*2 +    # Class B: 40/50 = 0.80 recall
        [2]*20 + [0]*5 + [1]*3 + [3]*2 +    # Class C: 20/30 = 0.67 recall
        [3]*10 + [0]*5 + [1]*3 + [2]*2      # Class D: 10/20 = 0.50 recall
    )
    
    cm = confusion_matrix(y_true, y_pred)
    
    print("Multi-Class F-score Analysis")
    print("=" * 60)
    print(f"\nClass distribution: A={100}, B={50}, C={30}, D={20}")
    print(f"\nConfusion Matrix:")
    print(f"           Predicted")
    print(f"           A     B     C     D")
    for i, cls in enumerate(class_names):
        print(f"{cls:>8}  {cm[i, 0]:3d}   {cm[i, 1]:3d}   {cm[i, 2]:3d}   {cm[i, 3]:3d}")
    
    # Per-class F1 scores
    precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)
    
    print(f"\nPer-Class Metrics:")
    print(f"{'Class':>8} {'n':>6} {'Precision':>10} {'Recall':>10} {'F1':>10}")
    print("-" * 50)
    for i, cls in enumerate(class_names):
        print(f"{cls:>8} {support[i]:>6} {precision[i]:>10.3f} {recall[i]:>10.3f} {f1[i]:>10.3f}")
    
    # Aggregated F1 scores
    f1_macro = f1_score(y_true, y_pred, average='macro')
    f1_weighted = f1_score(y_true, y_pred, average='weighted')
    f1_micro = f1_score(y_true, y_pred, average='micro')
    
    print(f"\nAggregated F1 Scores:")
    print(f"  Macro F1:    {f1_macro:.4f} (simple average of per-class F1)")
    print(f"  Weighted F1: {f1_weighted:.4f} (weighted by class frequency)")
    print(f"  Micro F1:    {f1_micro:.4f} (global TP/FP/FN aggregation)")
    
    # Manual verification of macro
    manual_macro = np.mean(f1)
    print(f"\nVerification: Manual macro = {manual_macro:.4f}")
    
    # Manual verification of weighted
    weights = support / support.sum()
    manual_weighted = np.sum(weights * f1)
    print(f"Verification: Manual weighted = {manual_weighted:.4f}")
 
multiclass_fscore_analysis()

Choosing the Right Aggregation

Use macro when all classes are equally important regardless of frequency. Use weighted when you want the score to reflect overall prediction quality. Use micro for a global view (note: micro F1 = micro precision = micro recall = accuracy for multi-class single-label problems).

Optimizing Threshold for F-score

One common use of F-scores is finding the optimal classification threshold. Rather than using the default 0.5 threshold, we can sweep through thresholds and select the one that maximizes F-score.

The Process:

Obtain probability predictions $P(y=1|x)$ for all test instances
For each candidate threshold $\tau \in (0, 1)$, compute $\hat{y} = \mathbf{1}[P \geq \tau]$
Calculate F-score at each threshold
Select $\tau^* = \arg\max_\tau F_\beta(\tau)$

Important Considerations:

Threshold optimization should be done on a validation set, not the test set
The optimal threshold depends on the class distribution in your data
Optimal thresholds often differ significantly from 0.5, especially with class imbalance

threshold_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, f1_score
 
def optimize_threshold_for_f1(y_true, y_scores):
    """
    Find the optimal classification threshold that maximizes F1 score.
    """
    # Get precision, recall at all thresholds
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
    
    # Calculate F1 at each threshold
    # Note: precision_recall_curve returns n+1 precision/recall values
    # for n threshold values, so we need to handle the length mismatch
    f1_scores = []
    for i in range(len(thresholds)):
        if precision[i] + recall[i] > 0:
            f1 = 2 * precision[i] * recall[i] / (precision[i] + recall[i])
        else:
            f1 = 0
        f1_scores.append(f1)
    
    f1_scores = np.array(f1_scores)
    
    # Find optimal threshold
    optimal_idx = np.argmax(f1_scores)
    optimal_threshold = thresholds[optimal_idx]
    optimal_f1 = f1_scores[optimal_idx]
    
    return {
        'thresholds': thresholds,
        'f1_scores': f1_scores,
        'precision': precision[:-1],
        'recall': recall[:-1],
        'optimal_threshold': optimal_threshold,
        'optimal_f1': optimal_f1,
        'optimal_precision': precision[optimal_idx],
        'optimal_recall': recall[optimal_idx],
    }
 
# Generate synthetic example
np.random.seed(42)
n_pos, n_neg = 100, 400  # Imbalanced: 20% positive
 
# Positive class has higher scores
scores_pos = np.clip(np.random.normal(0.65, 0.2, n_pos), 0, 1)
scores_neg = np.clip(np.random.normal(0.35, 0.2, n_neg), 0, 1)
 
y_true = np.array([1]*n_pos + [0]*n_neg)
y_scores = np.concatenate([scores_pos, scores_neg])
 
# Find optimal threshold
result = optimize_threshold_for_f1(y_true, y_scores)
 
print("Threshold Optimization for F1 Score")
print("=" * 50)
print(f"\nDataset: {n_pos} positives, {n_neg} negatives ({n_pos/(n_pos+n_neg):.1%} positive)")
 
# Compare default vs optimal threshold
default_preds = (y_scores >= 0.5).astype(int)
optimal_preds = (y_scores >= result['optimal_threshold']).astype(int)
 
f1_default = f1_score(y_true, default_preds)
f1_optimal = result['optimal_f1']
 
print(f"\nDefault threshold (0.5):")
print(f"  F1 = {f1_default:.4f}")
 
print(f"\nOptimal threshold ({result['optimal_threshold']:.3f}):")
print(f"  F1 = {f1_optimal:.4f}")
print(f"  Precision = {result['optimal_precision']:.3f}")
print(f"  Recall = {result['optimal_recall']:.3f}")
 
print(f"\nImprovement: {(f1_optimal - f1_default) / f1_default * 100:.1f}%")
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: F1 vs threshold
axes[0].plot(result['thresholds'], result['f1_scores'], 'b-', linewidth=2)
axes[0].axvline(x=0.5, color='gray', linestyle='--', label='Default (0.5)')
axes[0].axvline(x=result['optimal_threshold'], color='red', linestyle='-', 
                label=f'Optimal ({result["optimal_threshold"]:.2f})')
axes[0].scatter([result['optimal_threshold']], [result['optimal_f1']], 
                color='red', s=100, zorder=5)
axes[0].set_xlabel('Decision Threshold', fontsize=12)
axes[0].set_ylabel('F1 Score', fontsize=12)
axes[0].set_title('F1 Score vs Classification Threshold', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Right: Precision, Recall, F1 vs threshold
axes[1].plot(result['thresholds'], result['precision'], 'g-', linewidth=2, label='Precision')
axes[1].plot(result['thresholds'], result['recall'], 'r-', linewidth=2, label='Recall')
axes[1].plot(result['thresholds'], result['f1_scores'], 'b-', linewidth=2, label='F1')
axes[1].axvline(x=result['optimal_threshold'], color='blue', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Decision Threshold', fontsize=12)
axes[1].set_ylabel('Score', fontsize=12)
axes[1].set_title('All Metrics vs Threshold', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('threshold_optimization.png', dpi=150)
plt.show()

Limitations and Critiques of F-scores

While F-scores are widely used, they have important limitations that practitioners should understand:

F-score Limitations

•Ignores True Negatives: F-score is computed only from TP, FP, and FN. The TN count does not affect F-score. For problems where correctly identifying negatives is important, this is a significant gap.
•Not comparable across different class proportions: F-scores on datasets with different class ratios are not directly comparable. A model achieving F1=0.8 on balanced data is not equivalent to F1=0.8 on highly imbalanced data.
•Sensitive to class definition: Swapping which class is 'positive' changes F-score values. Always be explicit about positive class definition.
•Fixed importance weighting: Once beta is chosen, the precision-recall trade-off is fixed. Real applications may need context-dependent weighting.
•Threshold-dependent: F-score is computed at a single threshold. Full performance characterization requires examining the entire precision-recall curve.
•Aggregation ambiguity: For multi-class, the choice of averaging method (macro/weighted/micro) significantly affects the score and its interpretation.

Alternatives to Consider

For a TN-sensitive metric, consider Matthews Correlation Coefficient (MCC). For threshold-independent evaluation, use Area Under the Precision-Recall Curve (AUPRC). For probabilistic outputs, consider log loss or Brier score. The best metric depends on your specific application requirements.

Summary: The F-score Family

The F1 score and its F-beta generalizations provide powerful tools for single-number classifier evaluation that respects the precision-recall trade-off.

Key Takeaways

•F1 is the harmonic mean of precision and recall, providing a single score that requires both to be high.
•The harmonic mean can't be gamed — a model must perform well on both precision AND recall to achieve high F1.
•F-beta generalizes F1 — beta controls relative importance of recall vs precision (beta=1 is F1, beta>1 favors recall).
•Choose beta based on domain requirements — before seeing results, not after.
•Multi-class requires aggregation — macro, weighted, and micro averages serve different purposes.
•Threshold optimization can significantly improve F-score beyond default 0.5.
•F-scores ignore true negatives — consider MCC or other metrics if TN performance matters.

What's Next:

The final page in this module covers Specificity and Sensitivity—complementary metrics that describe performance from the perspective of each actual class, completing our understanding of confusion-matrix-derived classification metrics.

Page Complete

You now have a deep understanding of F-scores—when to use them, how to compute them, and crucially, their limitations. This knowledge enables you to move beyond naive accuracy optimization toward metrics that reflect true classifier quality.

4 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

Classification Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

4 / 5

F1 and F-beta Scores

Combining Precision and Recall

What You Will Learn

The Need for Combination

Why do we need a combined metric at all? Consider these practical scenarios:

Model Selection: You have five candidate models with the following precision-recall profiles:

Model	Precision	Recall
A	0.90	0.60
B	0.80	0.80
C	0.70	0.85
D	0.85	0.75
E	0.95	0.50

Which model is 'best'? Without a combined metric, there's no principled way to rank these models or select one for deployment.

Threshold Calibration: When selecting an operating threshold, you need a criterion. Maximizing F-score at the optimal threshold is a common and principled approach.

Communication: Stakeholders often want a single 'performance number' rather than multiple metrics that may conflict in their implications.

The Cost of Reduction

The F1 Score Definition

The F1 score is defined as the harmonic mean of precision and recall:

$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$

The second form is derived by substituting the definitions of precision and recall:

$$F_1 = 2 \cdot \frac{\frac{TP}{TP+FP} \cdot \frac{TP}{TP+FN}}{\frac{TP}{TP+FP} + \frac{TP}{TP+FN}} = \frac{2 \cdot TP \cdot TP}{TP(TP+FN) + TP(TP+FP)} = \frac{2 \cdot TP}{2 \cdot TP + FP + FN}$$

Key Properties:

Range: $F_1 \in [0, 1]$, where 1 is perfect and 0 is worst
Perfect score: F1 = 1 when both precision and recall equal 1
Zero score: F1 = 0 when either precision or recall is 0
Symmetric: Precision and recall contribute equally
TN-independent: True negatives do not affect F1 (like precision and recall)

f1_score_basics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix
 
def compute_f1_from_components(y_true, y_pred):
    """
    Calculate F1 score from precision and recall, and directly from cm.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    # Method 1: Harmonic mean formula
    if precision + recall > 0:
        f1_harmonic = 2 * precision * recall / (precision + recall)
    else:
        f1_harmonic = 0
    
    # Method 2: Direct formula from confusion matrix
    f1_direct = 2 * TP / (2 * TP + FP + FN) if (2 * TP + FP + FN) > 0 else 0
    
    # Method 3: sklearn
    f1_sklearn = f1_score(y_true, y_pred)
    
    return {
        'precision': precision,
        'recall': recall,
        'f1_harmonic': f1_harmonic,
        'f1_direct': f1_direct,
        'f1_sklearn': f1_sklearn,
        'TP': TP, 'FP': FP, 'FN': FN, 'TN': TN,
    }
 
# Example
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
 
result = compute_f1_from_components(y_true, y_pred)
 
print("F1 Score Calculation Demonstration")
print("=" * 50)
print(f"\nConfusion Matrix: TP={result['TP']}, FP={result['FP']}, FN={result['FN']}, TN={result['TN']}")
print(f"Precision: {result['precision']:.4f}")
print(f"Recall: {result['recall']:.4f}")
print(f"\nF1 Score (harmonic mean): {result['f1_harmonic']:.4f}")
print(f"F1 Score (direct formula): {result['f1_direct']:.4f}")
print(f"F1 Score (sklearn):        {result['f1_sklearn']:.4f}")
print(f"\nAll three methods give identical results: {np.allclose([result['f1_harmonic']], [result['f1_direct']])}")

Why the Harmonic Mean?

The choice of harmonic mean over arithmetic or geometric mean is deliberate and important.

Comparison of Means:

For two values $a$ and $b$:

Arithmetic Mean: $\frac{a + b}{2}$
Geometric Mean: $\sqrt{a \cdot b}$
Harmonic Mean: $\frac{2ab}{a + b} = \frac{2}{\frac{1}{a} + \frac{1}{b}}$

The Ordering Property:

For any two positive values $a \neq b$:

$$\text{Harmonic} < \text{Geometric} < \text{Arithmetic}$$

Why This Matters for F1:

The harmonic mean is dominated by the smaller value. Consider precision = 0.9, recall = 0.1:

Mean Type	Formula	Result
Arithmetic	(0.9 + 0.1) / 2	0.50
Geometric	√(0.9 × 0.1)	0.30
Harmonic	2×0.9×0.1 / (0.9+0.1)	0.18

The harmonic mean (0.18) correctly reflects that this is a poor classifier—one metric being excellent cannot compensate for the other being terrible.

The Veto Power of the Harmonic Mean

harmonic_mean_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def compare_means():
    """
    Visualize how different mean types behave with varying precision/recall.
    """
    # Fixed recall = 0.5, varying precision
    recall_fixed = 0.5
    precisions = np.linspace(0.01, 1, 100)
    
    arithmetic = (precisions + recall_fixed) / 2
    geometric = np.sqrt(precisions * recall_fixed)
    harmonic = 2 * precisions * recall_fixed / (precisions + recall_fixed)
    
    plt.figure(figsize=(12, 5))
    
    # Left plot: Different means
    plt.subplot(1, 2, 1)
    plt.plot(precisions, arithmetic, 'b-', linewidth=2, label='Arithmetic Mean')
    plt.plot(precisions, geometric, 'g-', linewidth=2, label='Geometric Mean')
    plt.plot(precisions, harmonic, 'r-', linewidth=2, label='Harmonic Mean (F1)')
    plt.axhline(y=recall_fixed, color='gray', linestyle='--', alpha=0.5, label=f'Recall = {recall_fixed}')
    plt.xlabel('Precision', fontsize=12)
    plt.ylabel('Combined Score', fontsize=12)
    plt.title('Different Means with Fixed Recall = 0.5', fontsize=14)
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    # Right plot: F1 contours
    plt.subplot(1, 2, 2)
    p_grid, r_grid = np.meshgrid(np.linspace(0.01, 1, 100), np.linspace(0.01, 1, 100))
    f1_grid = 2 * p_grid * r_grid / (p_grid + r_grid)
    
    contour = plt.contourf(p_grid, r_grid, f1_grid, levels=20, cmap='viridis')
    plt.colorbar(contour, label='F1 Score')
    
    # Add iso-F1 curves
    for f1_target in [0.2, 0.4, 0.6, 0.8]:
        plt.contour(p_grid, r_grid, f1_grid, levels=[f1_target], colors='white', linestyles='--')
    
    plt.xlabel('Precision', fontsize=12)
    plt.ylabel('Recall', fontsize=12)
    plt.title('F1 Score Contours (Iso-F1 curves in white)', fontsize=14)
    
    plt.tight_layout()
    plt.savefig('harmonic_mean_properties.png', dpi=150)
    plt.show()
    
    # Numerical examples
    print("\nNumerical Comparison of Means")
    print("=" * 55)
    print(f"{'Precision':>10} {'Recall':>8} {'Arith':>8} {'Geom':>8} {'Harm(F1)':>10}")
    print("-" * 55)
    
    test_cases = [(0.9, 0.9), (0.9, 0.5), (0.9, 0.1), (0.5, 0.5), (0.1, 0.1)]
    for p, r in test_cases:
        a = (p + r) / 2
        g = np.sqrt(p * r)
        h = 2 * p * r / (p + r)
        print(f"{p:>10.1f} {r:>8.1f} {a:>8.2f} {g:>8.2f} {h:>10.2f}")
 
compare_means()

The F-beta Generalization

The F1 score weights precision and recall equally, but this isn't always desired. The F-beta score generalizes F1 to allow explicit control over the precision-recall trade-off:

$$F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$

Interpretation of Beta:

$\beta = 1$: Equal weight to precision and recall (F1 score)
$\beta < 1$: More weight on precision (precision-focused)
$\beta > 1$: More weight on recall (recall-focused)

Specifically, beta represents how many times more important recall is than precision.

Common F-beta Variants:

Score	Beta	Interpretation	Use Case
F0.5	0.5	Precision twice as important as recall	Spam filtering, recommendations
F1	1.0	Equal importance	General-purpose, balanced
F2	2.0	Recall twice as important as precision	Medical screening, security

Mathematical Intuition:

The F-beta score can be rewritten as:

$$F_\beta = \frac{(1 + \beta^2) \cdot TP}{(1 + \beta^2) \cdot TP + \beta^2 \cdot FN + FP}$$

The $\beta^2$ factor scales the penalty for False Negatives relative to False Positives.

fbeta_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
from sklearn.metrics import fbeta_score, precision_score, recall_score
import matplotlib.pyplot as plt
 
def explore_fbeta(y_true, y_pred):
    """
    Explore how F-beta changes with different beta values.
    """
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    
    print("F-beta Score Exploration")
    print("=" * 60)
    print(f"Precision: {precision:.3f}")
    print(f"Recall:    {recall:.3f}")
    print(f"\n{'Beta':>6} {'F-beta':>10} {'Interpretation':<30}")
    print("-" * 60)
    
    betas = [0.25, 0.5, 1.0, 2.0, 4.0]
    for beta in betas:
        f_beta = fbeta_score(y_true, y_pred, beta=beta)
        if beta < 1:
            interp = f"Precision {1/beta:.1f}x more important"
        elif beta > 1:
            interp = f"Recall {beta:.1f}x more important"
        else:
            interp = "Equal weight (F1)"
        print(f"{beta:>6.2f} {f_beta:>10.4f} {interp:<30}")
    
    return precision, recall
 
def visualize_fbeta_curves():
    """
    Visualize iso-F-beta curves for different beta values.
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    betas = [0.5, 1.0, 2.0]
    titles = ['F0.5 (Precision-focused)', 'F1 (Balanced)', 'F2 (Recall-focused)']
    
    for ax, beta, title in zip(axes, betas, titles):
        p_grid, r_grid = np.meshgrid(np.linspace(0.01, 1, 100), np.linspace(0.01, 1, 100))
        f_grid = (1 + beta**2) * p_grid * r_grid / (beta**2 * p_grid + r_grid)
        
        contour = ax.contourf(p_grid, r_grid, f_grid, levels=20, cmap='viridis')
        plt.colorbar(contour, ax=ax, label=f'F{beta}')
        
        # Mark the optimal point for fixed P+R=1 (trade-off line)
        ax.plot([0, 1], [1, 0], 'w--', alpha=0.5, label='P + R = 1')
        
        ax.set_xlabel('Precision', fontsize=12)
        ax.set_ylabel('Recall', fontsize=12)
        ax.set_title(title, fontsize=14)
    
    plt.tight_layout()
    plt.savefig('fbeta_contours.png', dpi=150)
    plt.show()
 
# Example: Model with high precision, lower recall
y_true = [1]*100 + [0]*900
y_pred = [1]*60 + [0]*40 + [0]*880 + [1]*20  # 60/100 recall, 60/80 precision
 
precision, recall = explore_fbeta(y_true, y_pred)
 
print(f"\nKey Insight:")
print(f"  With Precision={precision:.2%} and Recall={recall:.2%}:")
print(f"  - F0.5 rewards the higher precision")
print(f"  - F2 would reward higher recall (this model scores lower)")
 
visualize_fbeta_curves()

Choosing Beta

F-score Properties and Edge Cases

Understanding the mathematical properties and behavior at boundaries is crucial for correct interpretation.

Range and Bounds:

$$0 \leq F_\beta \leq 1$$

$F_\beta = 1$ if and only if both precision = 1 and recall = 1 (perfect classifier on positive class)
$F_\beta = 0$ if either precision = 0 or recall = 0
$F_\beta$ is undefined if TP + FP = 0 (no positive predictions) AND TP + FN = 0 (no actual positives)

Monotonicity:

F-beta is monotonically increasing in both precision and recall (all else equal)
Improving either metric without degrading the other always improves F-beta

Related Metrics:

Dice Coefficient Connection:

The F1 score is equivalent to the Dice coefficient (or Sørensen-Dice coefficient) used in image segmentation:

$$\text{Dice} = \frac{2|A \cap B|}{|A| + |B|} = F_1$$

where A is the set of predicted positives and B is the set of actual positives.

f1_edge_cases.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from sklearn.metrics import f1_score
import warnings
 
def explore_f1_edge_cases():
    """
    Demonstrate edge cases and boundary behavior of F1 score.
    """
    print("F1 Score Edge Cases and Boundary Behavior")
    print("=" * 55)
    
    # Case 1: Perfect classifier
    print("\nCase 1: Perfect classifier")
    y_true = [1, 1, 1, 0, 0, 0]
    y_pred = [1, 1, 1, 0, 0, 0]
    print(f"  F1 = {f1_score(y_true, y_pred):.4f} (perfect)")
    
    # Case 2: Zero recall (no true positives found)
    print("\nCase 2: Zero recall (all positives missed)")
    y_true = [1, 1, 1, 0, 0, 0]
    y_pred = [0, 0, 0, 0, 0, 0]
    print(f"  F1 = {f1_score(y_true, y_pred):.4f} (zero - recall is 0)")
    
    # Case 3: Zero precision (all predictions wrong)
    print("\nCase 3: Zero precision (no correct positive predictions)")
    y_true = [0, 0, 0, 0, 0, 0]
    y_pred = [1, 1, 1, 0, 0, 0]
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print(f"  F1 = {f1_score(y_true, y_pred, zero_division=0):.4f} (zero - precision is 0)")
    
    # Case 4: No predictions and no positives (undefined)
    print("\nCase 4: No positive predictions or actual positives")
    y_true = [0, 0, 0]
    y_pred = [0, 0, 0]
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        print(f"  F1 = {f1_score(y_true, y_pred, zero_division=0):.4f} (undefined → set to 0)")
        print(f"  F1 = {f1_score(y_true, y_pred, zero_division=1):.4f} (undefined → set to 1)")
    
    # Case 5: Single correct prediction
    print("\nCase 5: Single true positive")
    y_true = [1]*100 + [0]*900
    y_pred = [1]*1 + [0]*99 + [0]*900  # One correct positive prediction
    f1 = f1_score(y_true, y_pred)
    print(f"  Precision = 1/1 = 100%")
    print(f"  Recall = 1/100 = 1%")
    print(f"  F1 = {f1:.4f} (low due to terrible recall)")
    
    # Case 6: Effect of threshold
    print("\nCase 6: Threshold effects")
    y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
    y_scores = [0.9, 0.8, 0.7, 0.6, 0.4, 0.5, 0.3, 0.2, 0.1, 0.05]
    
    print(f"  {'Threshold':>10} {'Precision':>10} {'Recall':>8} {'F1':>8}")
    for thresh in [0.3, 0.5, 0.7, 0.9]:
        y_pred = [1 if s >= thresh else 0 for s in y_scores]
        from sklearn.metrics import precision_score, recall_score
        p = precision_score(y_true, y_pred, zero_division=0)
        r = recall_score(y_true, y_pred, zero_division=0)
        f = f1_score(y_true, y_pred, zero_division=0)
        print(f"  {thresh:>10.1f} {p:>10.2%} {r:>8.0%} {f:>8.4f}")
 
explore_f1_edge_cases()

Multi-Class F-scores

For multi-class problems, F-scores must be computed per-class and then aggregated. The aggregation strategy significantly affects the final score and its interpretation.

Per-Class F-scores:

For class $c$ treated as the positive class (one-vs-rest):

$$F_1^{(c)} = \frac{2 \cdot TP_c}{2 \cdot TP_c + FP_c + FN_c}$$

Aggregation Strategies:

Multi-Class F-score Aggregation Methods
Method	Formula	Properties
Macro	$\frac{1}{K}\sum_{c=1}^K F_1^{(c)}$	Treats all classes equally; sensitive to rare classes
Weighted	$\sum_{c=1}^K \frac{n_c}{n} F_1^{(c)}$	Weights by class prevalence; reflects overall quality
Micro	$\frac{2 \sum_c TP_c}{2\sum_c TP_c + \sum_c FP_c + \sum_c FN_c}$	Global aggregation; equals accuracy for multi-class

multiclass_fscores.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from sklearn.metrics import f1_score, precision_recall_fscore_support, confusion_matrix
 
def multiclass_fscore_analysis():
    """
    Demonstrate multi-class F-score calculation and aggregation strategies.
    """
    # Imbalanced 4-class problem
    class_names = ['A', 'B', 'C', 'D']
    y_true = [0]*100 + [1]*50 + [2]*30 + [3]*20  # 100, 50, 30, 20
    
    # Predictions with varying per-class performance
    y_pred = (
        [0]*85 + [1]*10 + [2]*3 + [3]*2 +   # Class A: 85/100 = 0.85 recall
        [1]*40 + [0]*5 + [2]*3 + [3]*2 +    # Class B: 40/50 = 0.80 recall
        [2]*20 + [0]*5 + [1]*3 + [3]*2 +    # Class C: 20/30 = 0.67 recall
        [3]*10 + [0]*5 + [1]*3 + [2]*2      # Class D: 10/20 = 0.50 recall
    )
    
    cm = confusion_matrix(y_true, y_pred)
    
    print("Multi-Class F-score Analysis")
    print("=" * 60)
    print(f"\nClass distribution: A={100}, B={50}, C={30}, D={20}")
    print(f"\nConfusion Matrix:")
    print(f"           Predicted")
    print(f"           A     B     C     D")
    for i, cls in enumerate(class_names):
        print(f"{cls:>8}  {cm[i, 0]:3d}   {cm[i, 1]:3d}   {cm[i, 2]:3d}   {cm[i, 3]:3d}")
    
    # Per-class F1 scores
    precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)
    
    print(f"\nPer-Class Metrics:")
    print(f"{'Class':>8} {'n':>6} {'Precision':>10} {'Recall':>10} {'F1':>10}")
    print("-" * 50)
    for i, cls in enumerate(class_names):
        print(f"{cls:>8} {support[i]:>6} {precision[i]:>10.3f} {recall[i]:>10.3f} {f1[i]:>10.3f}")
    
    # Aggregated F1 scores
    f1_macro = f1_score(y_true, y_pred, average='macro')
    f1_weighted = f1_score(y_true, y_pred, average='weighted')
    f1_micro = f1_score(y_true, y_pred, average='micro')
    
    print(f"\nAggregated F1 Scores:")
    print(f"  Macro F1:    {f1_macro:.4f} (simple average of per-class F1)")
    print(f"  Weighted F1: {f1_weighted:.4f} (weighted by class frequency)")
    print(f"  Micro F1:    {f1_micro:.4f} (global TP/FP/FN aggregation)")
    
    # Manual verification of macro
    manual_macro = np.mean(f1)
    print(f"\nVerification: Manual macro = {manual_macro:.4f}")
    
    # Manual verification of weighted
    weights = support / support.sum()
    manual_weighted = np.sum(weights * f1)
    print(f"Verification: Manual weighted = {manual_weighted:.4f}")
 
multiclass_fscore_analysis()

Choosing the Right Aggregation

Optimizing Threshold for F-score

One common use of F-scores is finding the optimal classification threshold. Rather than using the default 0.5 threshold, we can sweep through thresholds and select the one that maximizes F-score.

The Process:

Obtain probability predictions $P(y=1|x)$ for all test instances
For each candidate threshold $\tau \in (0, 1)$, compute $\hat{y} = \mathbf{1}[P \geq \tau]$
Calculate F-score at each threshold
Select $\tau^* = \arg\max_\tau F_\beta(\tau)$

Important Considerations:

Threshold optimization should be done on a validation set, not the test set
The optimal threshold depends on the class distribution in your data
Optimal thresholds often differ significantly from 0.5, especially with class imbalance

threshold_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, f1_score
 
def optimize_threshold_for_f1(y_true, y_scores):
    """
    Find the optimal classification threshold that maximizes F1 score.
    """
    # Get precision, recall at all thresholds
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
    
    # Calculate F1 at each threshold
    # Note: precision_recall_curve returns n+1 precision/recall values
    # for n threshold values, so we need to handle the length mismatch
    f1_scores = []
    for i in range(len(thresholds)):
        if precision[i] + recall[i] > 0:
            f1 = 2 * precision[i] * recall[i] / (precision[i] + recall[i])
        else:
            f1 = 0
        f1_scores.append(f1)
    
    f1_scores = np.array(f1_scores)
    
    # Find optimal threshold
    optimal_idx = np.argmax(f1_scores)
    optimal_threshold = thresholds[optimal_idx]
    optimal_f1 = f1_scores[optimal_idx]
    
    return {
        'thresholds': thresholds,
        'f1_scores': f1_scores,
        'precision': precision[:-1],
        'recall': recall[:-1],
        'optimal_threshold': optimal_threshold,
        'optimal_f1': optimal_f1,
        'optimal_precision': precision[optimal_idx],
        'optimal_recall': recall[optimal_idx],
    }
 
# Generate synthetic example
np.random.seed(42)
n_pos, n_neg = 100, 400  # Imbalanced: 20% positive
 
# Positive class has higher scores
scores_pos = np.clip(np.random.normal(0.65, 0.2, n_pos), 0, 1)
scores_neg = np.clip(np.random.normal(0.35, 0.2, n_neg), 0, 1)
 
y_true = np.array([1]*n_pos + [0]*n_neg)
y_scores = np.concatenate([scores_pos, scores_neg])
 
# Find optimal threshold
result = optimize_threshold_for_f1(y_true, y_scores)
 
print("Threshold Optimization for F1 Score")
print("=" * 50)
print(f"\nDataset: {n_pos} positives, {n_neg} negatives ({n_pos/(n_pos+n_neg):.1%} positive)")
 
# Compare default vs optimal threshold
default_preds = (y_scores >= 0.5).astype(int)
optimal_preds = (y_scores >= result['optimal_threshold']).astype(int)
 
f1_default = f1_score(y_true, default_preds)
f1_optimal = result['optimal_f1']
 
print(f"\nDefault threshold (0.5):")
print(f"  F1 = {f1_default:.4f}")
 
print(f"\nOptimal threshold ({result['optimal_threshold']:.3f}):")
print(f"  F1 = {f1_optimal:.4f}")
print(f"  Precision = {result['optimal_precision']:.3f}")
print(f"  Recall = {result['optimal_recall']:.3f}")
 
print(f"\nImprovement: {(f1_optimal - f1_default) / f1_default * 100:.1f}%")
 
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: F1 vs threshold
axes[0].plot(result['thresholds'], result['f1_scores'], 'b-', linewidth=2)
axes[0].axvline(x=0.5, color='gray', linestyle='--', label='Default (0.5)')
axes[0].axvline(x=result['optimal_threshold'], color='red', linestyle='-', 
                label=f'Optimal ({result["optimal_threshold"]:.2f})')
axes[0].scatter([result['optimal_threshold']], [result['optimal_f1']], 
                color='red', s=100, zorder=5)
axes[0].set_xlabel('Decision Threshold', fontsize=12)
axes[0].set_ylabel('F1 Score', fontsize=12)
axes[0].set_title('F1 Score vs Classification Threshold', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
 
# Right: Precision, Recall, F1 vs threshold
axes[1].plot(result['thresholds'], result['precision'], 'g-', linewidth=2, label='Precision')
axes[1].plot(result['thresholds'], result['recall'], 'r-', linewidth=2, label='Recall')
axes[1].plot(result['thresholds'], result['f1_scores'], 'b-', linewidth=2, label='F1')
axes[1].axvline(x=result['optimal_threshold'], color='blue', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Decision Threshold', fontsize=12)
axes[1].set_ylabel('Score', fontsize=12)
axes[1].set_title('All Metrics vs Threshold', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('threshold_optimization.png', dpi=150)
plt.show()

Limitations and Critiques of F-scores

While F-scores are widely used, they have important limitations that practitioners should understand:

F-score Limitations

•Ignores True Negatives: F-score is computed only from TP, FP, and FN. The TN count does not affect F-score. For problems where correctly identifying negatives is important, this is a significant gap.
•Not comparable across different class proportions: F-scores on datasets with different class ratios are not directly comparable. A model achieving F1=0.8 on balanced data is not equivalent to F1=0.8 on highly imbalanced data.
•Sensitive to class definition: Swapping which class is 'positive' changes F-score values. Always be explicit about positive class definition.
•Fixed importance weighting: Once beta is chosen, the precision-recall trade-off is fixed. Real applications may need context-dependent weighting.
•Threshold-dependent: F-score is computed at a single threshold. Full performance characterization requires examining the entire precision-recall curve.
•Aggregation ambiguity: For multi-class, the choice of averaging method (macro/weighted/micro) significantly affects the score and its interpretation.

Alternatives to Consider

Summary: The F-score Family

The F1 score and its F-beta generalizations provide powerful tools for single-number classifier evaluation that respects the precision-recall trade-off.

Key Takeaways

•F1 is the harmonic mean of precision and recall, providing a single score that requires both to be high.
•The harmonic mean can't be gamed — a model must perform well on both precision AND recall to achieve high F1.
•F-beta generalizes F1 — beta controls relative importance of recall vs precision (beta=1 is F1, beta>1 favors recall).
•Choose beta based on domain requirements — before seeing results, not after.
•Multi-class requires aggregation — macro, weighted, and micro averages serve different purposes.
•Threshold optimization can significantly improve F-score beyond default 0.5.
•F-scores ignore true negatives — consider MCC or other metrics if TN performance matters.

What's Next:

Page Complete

4 / 5