Machine LearningModel Evaluation Metrics

Classification Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

1 / 5

Confusion Matrix

The Foundation of Classification Evaluation

Before a single metric can be calculated—before precision, recall, F1-score, or any other measure of classifier performance can be determined—there exists a fundamental structure that makes all of these calculations possible: the confusion matrix.

The confusion matrix is not merely a visualization tool or a pedagogical convenience. It is the complete accounting ledger of a classifier's predictions. Every prediction a model makes falls into exactly one cell of this matrix, and from this exhaustive categorization, all classification metrics can be derived. Understanding the confusion matrix deeply is therefore prerequisite to understanding any metric built upon it.

What You Will Learn

By the end of this page, you will understand the confusion matrix from first principles—its structure, its four fundamental cells, its generalizations to multi-class settings, and its role as the foundation for all classification metrics. You will be able to construct, interpret, and extract insights from confusion matrices in any classification context.

The Classification Problem Revisited

To understand the confusion matrix, we must first establish the fundamental nature of the classification task. In classification, a model receives an input $x \in \mathcal{X}$ and produces a predicted class label $\hat{y} \in {1, 2, \ldots, K}$, where $K$ is the number of classes. The true label $y \in {1, 2, \ldots, K}$ is known during evaluation.

The Evaluation Question:

Given a dataset of $n$ examples ${(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$ and corresponding predictions ${\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_n}$, how do we systematically characterize the classifier's performance?

The simplest approach—counting correct predictions—provides only a single number (accuracy) that often obscures critical information. The confusion matrix instead provides a complete decomposition of all predictions, enabling nuanced analysis of where and how a classifier succeeds or fails.

Why 'Confusion'?

The name 'confusion matrix' reflects that it reveals the confusions made by a classifier—specifically, which classes are mistaken for which other classes. A perfect classifier produces a diagonal matrix (no confusion), while an imperfect classifier shows off-diagonal entries indicating systematic errors.

Binary Classification: The 2×2 Matrix

We begin with binary classification ($K = 2$), the most common and foundational case. By convention, the two classes are designated as:

Positive class (P): The class of interest, typically the minority or important class (e.g., 'disease present', 'fraud detected', 'spam')
Negative class (N): The default or majority class (e.g., 'disease absent', 'legitimate transaction', 'not spam')

Every prediction falls into exactly one of four categories based on the relationship between the predicted label $\hat{y}$ and the true label $y$:

The Four Fundamental Outcomes in Binary Classification
Outcome	Symbol	Predicted	Actual	Description
True Positive	TP	Positive	Positive	Correctly identified positive case
True Negative	TN	Negative	Negative	Correctly identified negative case
False Positive	FP	Positive	Negative	Incorrectly labeled negative as positive (Type I Error)
False Negative	FN	Negative	Positive	Incorrectly labeled positive as negative (Type II Error)

These four quantities are mutually exclusive and collectively exhaustive. Every single prediction in your evaluation dataset belongs to exactly one of these categories:

$$n = TP + TN + FP + FN$$

where $n$ is the total number of predictions.

The Confusion Matrix Structure:

These four outcomes are arranged in a 2×2 matrix with a specific convention:

$$\text{Confusion Matrix} = \begin{bmatrix} TN & FP \ FN & TP \end{bmatrix}$$

Alternatively, and equivalently:

	Predicted Negative	Predicted Positive
Actual Negative	TN	FP
Actual Positive	FN	TP

Convention Matters

Different textbooks and libraries use different conventions for row/column ordering (actual vs. predicted on rows/columns) and class ordering (positive class first vs. second). Always verify the convention used in your specific context. Scikit-learn, for example, places actual labels on rows and predicted labels on columns, with classes in sorted order.

Understanding Error Types: A Deep Dive

The distinction between False Positives and False Negatives is not merely technical—it maps to fundamentally different kinds of errors with vastly different consequences depending on the application domain.

Type I Error (False Positive):

A False Positive occurs when the model incorrectly predicts the positive class for an instance that actually belongs to the negative class. The model has 'cried wolf'—it has flagged something as important or concerning when it was not.

Statistical interpretation: In hypothesis testing terms, this is rejecting the null hypothesis when it is actually true. The model claims to have detected an effect or condition that does not exist.

False Positive Consequences Across Domains
Domain	What FP Means	Consequence
Medical Diagnosis	Healthy patient diagnosed with disease	Unnecessary treatment, psychological stress, wasted resources
Spam Detection	Legitimate email marked as spam	Important communication lost, user frustration
Fraud Detection	Legitimate transaction blocked	Customer inconvenience, lost sales, support costs
Security Screening	Innocent person flagged as threat	Delays, privacy violation, reputational harm
Manufacturing QC	Good product rejected	Wasted materials, reduced throughput, increased costs

Type II Error (False Negative):

A False Negative occurs when the model incorrectly predicts the negative class for an instance that actually belongs to the positive class. The model has 'missed' something important—it has failed to detect a condition or event that was actually present.

Statistical interpretation: In hypothesis testing terms, this is failing to reject the null hypothesis when it is actually false. The model has failed to detect an effect or condition that genuinely exists.

False Negative Consequences Across Domains
Domain	What FN Means	Consequence
Medical Diagnosis	Diseased patient cleared as healthy	Delayed treatment, disease progression, potential death
Spam Detection	Spam email allowed through	User annoyance, potential phishing exposure
Fraud Detection	Fraudulent transaction approved	Financial loss, regulatory penalties
Security Screening	Actual threat not detected	Security breach, potential casualties
Manufacturing QC	Defective product approved	Customer harm, recalls, liability

The Cost Asymmetry Principle

In almost every real-world application, the costs of False Positives and False Negatives are asymmetric. Medical screening prioritizes avoiding False Negatives (missed diseases) even at the cost of more False Positives. Email filters might prioritize avoiding False Positives (lost important mail) over catching every spam message. Understanding this asymmetry is critical for choosing appropriate metrics and decision thresholds.

Marginal Totals and Fundamental Rates

The confusion matrix encodes not just the four cell values but also meaningful marginal totals—the row sums and column sums—that represent important quantities:

Actual Class Totals (Row Sums):

P = TP + FN: Total number of actual positive instances (the true positives and the positives we missed)
N = TN + FP: Total number of actual negative instances (the true negatives and the negatives we incorrectly flagged)

These are fixed by the ground truth—they depend only on the dataset, not on the classifier.

Predicted Class Totals (Column Sums):

PP = TP + FP: Total number of predicted positives (correct and incorrect positive predictions)
PN = TN + FN: Total number of predicted negatives (correct and incorrect negative predictions)

These depend on the classifier's behavior and threshold settings.

confusion_matrix_structure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
from sklearn.metrics import confusion_matrix
 
# Example: Binary classification results
y_true = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])  # 5 positives, 5 negatives
y_pred = np.array([1, 1, 1, 0, 0, 0, 0, 0, 1, 1])  # Model predictions
 
# Compute confusion matrix
# Note: sklearn convention - rows are actual, columns are predicted
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix (sklearn convention):")
print(cm)
# Output:
# [[3 2]   # Row 0: Actual Negative - [TN=3, FP=2]
#  [2 3]]  # Row 1: Actual Positive - [FN=2, TP=3]
 
# Extract the four fundamental quantities
TN, FP, FN, TP = cm.ravel()
print(f"\nTrue Positives (TP):  {TP}")   # Correctly predicted positives
print(f"True Negatives (TN):  {TN}")      # Correctly predicted negatives
print(f"False Positives (FP): {FP}")      # Type I errors
print(f"False Negatives (FN): {FN}")      # Type II errors
 
# Compute marginal totals
P = TP + FN  # Total actual positives
N = TN + FP  # Total actual negatives
PP = TP + FP # Total predicted positives
PN = TN + FN # Total predicted negatives
 
print(f"\nActual Positives (P):    {P}")
print(f"Actual Negatives (N):    {N}")
print(f"Predicted Positives (PP): {PP}")
print(f"Predicted Negatives (PN): {PN}")
print(f"Total samples (n):        {P + N}")

Fundamental Rate Definitions:

From these quantities, we can define the fundamental rates that form the basis of all classification metrics:

Rate	Formula	Also Known As	Meaning
True Positive Rate	$TPR = \frac{TP}{P} = \frac{TP}{TP + FN}$	Sensitivity, Recall, Hit Rate	Proportion of actual positives correctly identified
True Negative Rate	$TNR = \frac{TN}{N} = \frac{TN}{TN + FP}$	Specificity, Selectivity	Proportion of actual negatives correctly identified
False Positive Rate	$FPR = \frac{FP}{N} = \frac{FP}{FP + TN}$	Fall-out, Type I Error Rate	Proportion of actual negatives incorrectly flagged
False Negative Rate	$FNR = \frac{FN}{P} = \frac{FN}{FN + TP}$	Miss Rate, Type II Error Rate	Proportion of actual positives missed

Note the complementary relationships:

$TPR + FNR = 1$ (we either catch a positive or miss it)
$TNR + FPR = 1$ (we either correctly clear a negative or falsely flag it)

Multi-Class Confusion Matrix

When $K > 2$ (multi-class classification), the confusion matrix generalizes to a $K \times K$ structure. The entry $C_{ij}$ represents the number of instances with true class $i$ that were predicted as class $j$.

$$C = \begin{bmatrix} C_{11} & C_{12} & \cdots & C_{1K} \ C_{21} & C_{22} & \cdots & C_{2K} \ \vdots & \vdots & \ddots & \vdots \ C_{K1} & C_{K2} & \cdots & C_{KK} \end{bmatrix}$$

Key Properties:

Diagonal entries $C_{ii}$: Correct predictions for class $i$
Off-diagonal entries $C_{ij}$ (where $i \neq j$): Confusions—class $i$ instances misclassified as class $j$
Row sum $\sum_j C_{ij}$: Total actual instances of class $i$
Column sum $\sum_i C_{ij}$: Total predictions of class $j$
Total $\sum_{i,j} C_{ij} = n$: Total number of instances

multiclass_confusion_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
 
# Multi-class example: 4-class classification
classes = ['Cat', 'Dog', 'Bird', 'Fish']
y_true = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 0, 1, 2, 3, 0, 1, 2, 3]
y_pred = [0, 0, 1, 1, 1, 0, 2, 3, 2, 3, 3, 2, 0, 1, 2, 3, 1, 1, 2, 3]
 
# Compute multi-class confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Multi-class Confusion Matrix:")
print(cm)
 
# Analyze the confusion matrix
print("\nDetailed Analysis:")
for i, class_name in enumerate(classes):
    correct = cm[i, i]
    total_actual = cm[i, :].sum()
    total_predicted = cm[:, i].sum()
    
    print(f"\n{class_name}:")
    print(f"  Correct predictions: {correct}/{total_actual} ({100*correct/total_actual:.1f}%)")
    print(f"  Times predicted: {total_predicted}")
    
    # Show confusions
    for j, other_class in enumerate(classes):
        if i != j and cm[i, j] > 0:
            print(f"  Confused with {other_class}: {cm[i, j]} times")
 
# Calculate per-class metrics (treating each class as binary)
print("\nPer-class Binary Metrics:")
for i, class_name in enumerate(classes):
    # For class i: TP, FP, FN, TN
    TP = cm[i, i]
    FP = cm[:, i].sum() - TP  # Column sum minus diagonal
    FN = cm[i, :].sum() - TP  # Row sum minus diagonal
    TN = cm.sum() - TP - FP - FN  # Everything else
    
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    print(f"{class_name}: TP={TP}, FP={FP}, FN={FN}, TN={TN}")
    print(f"  Precision: {precision:.3f}, Recall: {recall:.3f}")

Reading Multi-Class Confusion Matrices

When interpreting a K×K confusion matrix, look for patterns in the off-diagonal entries. Large values indicate systematic confusions—classes that the model frequently mistakes for each other. This often reveals semantic similarities (e.g., 'Dog' confused with 'Cat' more often than with 'Car') or data collection issues (e.g., mislabeled training examples).

Normalization and Visualization

Raw counts in a confusion matrix can be difficult to interpret, especially with class imbalance. Normalization transforms counts into proportions, enabling easier comparison and pattern recognition.

Three Normalization Strategies:

Row normalization (by true class): Divide each row by its sum. Each row sums to 1. Entry $C'{ij} = C{ij} / \sum_j C_{ij}$ represents the proportion of class $i$ instances predicted as class $j$. This shows recall per class on the diagonal.
Column normalization (by predicted class): Divide each column by its sum. Each column sums to 1. Entry $C'{ij} = C{ij} / \sum_i C_{ij}$ represents the proportion of class $j$ predictions that are actually class $i$. This shows precision per class on the diagonal.
Total normalization: Divide all entries by the total count. The entire matrix sums to 1. Entry $C'{ij} = C{ij} / n$ represents the proportion of all predictions that are true class $i$ predicted as class $j$.

confusion_matrix_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
# Example with class imbalance
y_true = [0]*100 + [1]*50 + [2]*25  # Imbalanced: 100, 50, 25 samples
y_pred = [0]*85 + [1]*10 + [2]*5 + \
         [1]*40 + [0]*5 + [2]*5 + \
         [2]*20 + [0]*3 + [1]*2
 
cm = confusion_matrix(y_true, y_pred)
 
# Create figure with three normalizations
fig, axes = plt.subplots(1, 4, figsize=(20, 4))
 
# Raw counts
ConfusionMatrixDisplay(cm, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[0])
axes[0].set_title('Raw Counts')
 
# Row-normalized (shows recall)
cm_row = cm.astype(float) / cm.sum(axis=1, keepdims=True)
ConfusionMatrixDisplay(cm_row, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[1])
axes[1].set_title('Row Normalized (Recall)')
 
# Column-normalized (shows precision on diagonal)
cm_col = cm.astype(float) / cm.sum(axis=0, keepdims=True)
ConfusionMatrixDisplay(cm_col, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[2])
axes[2].set_title('Column Normalized (Precision)')
 
# Overall-normalized
cm_total = cm.astype(float) / cm.sum()
ConfusionMatrixDisplay(cm_total, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[3])
axes[3].set_title('Total Normalized')
 
plt.tight_layout()
plt.savefig('confusion_matrix_normalizations.png', dpi=150)
plt.show()
 
# Interpretation
print("Interpretation of Row-Normalized (Recall):")
for i, cls in enumerate(['Class 0', 'Class 1', 'Class 2']):
    print(f"  {cls} recall: {cm_row[i,i]:.1%} of actual {cls} correctly predicted")
 
print("\nInterpretation of Column-Normalized (Precision):")
for i, cls in enumerate(['Class 0', 'Class 1', 'Class 2']):
    print(f"  {cls} precision: {cm_col[i,i]:.1%} of {cls} predictions are correct")

Visualization Best Practices

Use color gradients (heatmaps) to make patterns visible at a glance. Annotate cells with both values and percentages when space permits. Order classes semantically when possible (e.g., severity levels 'Low→Medium→High') so that 'close' confusions appear near the diagonal. For large K, consider grouping similar classes or focusing on the most significant confusions.

Deriving Metrics from the Confusion Matrix

The confusion matrix is the single source of truth from which all classification metrics can be derived. Understanding this derivation process is essential for:

Understanding what each metric actually measures
Recognizing relationships between metrics
Creating custom metrics tailored to specific applications
Debugging unexpected metric values

Here is the complete derivation of common metrics from the binary confusion matrix:

Metrics Derived from the Confusion Matrix
Metric	Formula	Intuition
Accuracy	$(TP + TN) / n$	Proportion of all predictions that are correct
Error Rate	$(FP + FN) / n$	Proportion of all predictions that are wrong
Precision	$TP / (TP + FP)$	Of predicted positives, how many are correct?
Recall (Sensitivity)	$TP / (TP + FN)$	Of actual positives, how many did we find?
Specificity	$TN / (TN + FP)$	Of actual negatives, how many did we correctly identify?
F1 Score	$2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$	Harmonic mean of precision and recall
Balanced Accuracy	$(TPR + TNR) / 2$	Average of sensitivity and specificity
Matthews Correlation Coefficient	$\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$	Correlation between predicted and actual

derive_all_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from sklearn.metrics import confusion_matrix
 
def derive_all_metrics(y_true, y_pred):
    """
    Derive all common classification metrics from the confusion matrix.
    This demonstrates that the confusion matrix is the complete source
    of classification performance information.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    # Marginal totals
    P = TP + FN  # Actual positives
    N = TN + FP  # Actual negatives
    PP = TP + FP  # Predicted positives
    PN = TN + FN  # Predicted negatives
    n = P + N     # Total samples
    
    # Basic rates
    tpr = TP / P if P > 0 else 0  # True Positive Rate (Recall/Sensitivity)
    tnr = TN / N if N > 0 else 0  # True Negative Rate (Specificity)
    fpr = FP / N if N > 0 else 0  # False Positive Rate
    fnr = FN / P if P > 0 else 0  # False Negative Rate
    
    # Common metrics
    accuracy = (TP + TN) / n
    error_rate = (FP + FN) / n
    precision = TP / PP if PP > 0 else 0
    recall = tpr  # Same as TPR
    specificity = tnr  # Same as TNR
    
    # F1 Score (harmonic mean of precision and recall)
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    # Balanced Accuracy
    balanced_acc = (tpr + tnr) / 2
    
    # Matthews Correlation Coefficient
    denom = np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN))
    mcc = (TP*TN - FP*FN) / denom if denom > 0 else 0
    
    # Informedness (Youden's J) and Markedness
    informedness = tpr + tnr - 1  # Youden's J statistic
    markedness = precision + (TN / PN if PN > 0 else 0) - 1
    
    return {
        'Confusion Matrix': cm,
        'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN,
        'Accuracy': accuracy,
        'Error Rate': error_rate,
        'Precision (PPV)': precision,
        'Recall (Sensitivity, TPR)': recall,
        'Specificity (TNR)': specificity,
        'F1 Score': f1,
        'Balanced Accuracy': balanced_acc,
        'MCC': mcc,
        'Informedness (Youden J)': informedness,
        'FPR': fpr,
        'FNR': fnr,
    }
 
# Example usage
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
 
metrics = derive_all_metrics(y_true, y_pred)
print("All Metrics Derived from Confusion Matrix:")
print("-" * 45)
for name, value in metrics.items():
    if isinstance(value, np.ndarray):
        print(f"{name}:\n{value}")
    else:
        print(f"{name}: {value:.4f}" if isinstance(value, float) else f"{name}: {value}")

Common Pitfalls and Best Practices

Working with confusion matrices requires attention to several common sources of error and misinterpretation:

Common Pitfalls

•Convention confusion: Different libraries use different conventions (rows=actual vs rows=predicted). Always verify before interpreting.
•Positive class ambiguity: For binary classification, ensure you've correctly identified which class is 'positive' in your specific context.
•Class label ordering: Multi-class matrices depend on class ordering. Verify that class indices align with your expectations.
•Aggregation across folds: Summing confusion matrices across CV folds gives different results than averaging metrics. Both are valid but measure different things.
•Threshold effects: For probabilistic classifiers, the confusion matrix depends on the decision threshold. Always specify the threshold used.

Best Practices

•Always compute the confusion matrix first: Before calculating any derived metric, verify the confusion matrix makes sense.
•Use multiple views: Show both raw counts and normalized versions to get complete understanding.
•Annotate the matrix: Include axis labels, class names, and the total sample size.
•Compare against baselines: Know what a random or majority-class baseline confusion matrix would look like.
•Look for patterns: Off-diagonal clusters reveal systematic confusions that may guide data collection or feature engineering.
•Report the full matrix: In publications and reports, include the full confusion matrix, not just derived metrics.

Summary: The Confusion Matrix Foundation

We have established the confusion matrix as the foundational structure for all classification evaluation. Let's consolidate our understanding:

Key Takeaways

•The confusion matrix is the complete accounting of a classifier's predictions—every prediction falls into exactly one cell.
•Four fundamental outcomes define binary classification: True Positives, True Negatives, False Positives, and False Negatives.
•Error types have asymmetric costs—False Positives and False Negatives have different consequences depending on the application.
•All classification metrics can be derived from the confusion matrix, making it the single source of truth.
•Multi-class extension generalizes to a K×K matrix where entry (i,j) counts instances of class i predicted as class j.
•Normalization strategies reveal different aspects: row-normalization shows recall, column-normalization shows precision.
•Convention awareness is critical—always verify which axis represents actual vs. predicted labels.

What's Next:

With the confusion matrix as our foundation, we're ready to examine specific metrics in detail. The next page explores accuracy—the most intuitive metric—and critically examines its limitations, particularly in the presence of class imbalance.

Page Complete

You now understand the confusion matrix as the foundational structure for classification evaluation. This knowledge is prerequisite to understanding every metric we will examine in subsequent pages—each is simply a different way of summarizing the information contained in this fundamental matrix.

1 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

Classification Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

1 / 5

Confusion Matrix

The Foundation of Classification Evaluation

What You Will Learn

The Classification Problem Revisited

The Evaluation Question:

Why 'Confusion'?

Binary Classification: The 2×2 Matrix

We begin with binary classification ($K = 2$), the most common and foundational case. By convention, the two classes are designated as:

Positive class (P): The class of interest, typically the minority or important class (e.g., 'disease present', 'fraud detected', 'spam')
Negative class (N): The default or majority class (e.g., 'disease absent', 'legitimate transaction', 'not spam')

Every prediction falls into exactly one of four categories based on the relationship between the predicted label $\hat{y}$ and the true label $y$:

The Four Fundamental Outcomes in Binary Classification
Outcome	Symbol	Predicted	Actual	Description
True Positive	TP	Positive	Positive	Correctly identified positive case
True Negative	TN	Negative	Negative	Correctly identified negative case
False Positive	FP	Positive	Negative	Incorrectly labeled negative as positive (Type I Error)
False Negative	FN	Negative	Positive	Incorrectly labeled positive as negative (Type II Error)

These four quantities are mutually exclusive and collectively exhaustive. Every single prediction in your evaluation dataset belongs to exactly one of these categories:

$$n = TP + TN + FP + FN$$

where $n$ is the total number of predictions.

The Confusion Matrix Structure:

These four outcomes are arranged in a 2×2 matrix with a specific convention:

$$\text{Confusion Matrix} = \begin{bmatrix} TN & FP \ FN & TP \end{bmatrix}$$

Alternatively, and equivalently:

	Predicted Negative	Predicted Positive
Actual Negative	TN	FP
Actual Positive	FN	TP

Convention Matters

Understanding Error Types: A Deep Dive

Type I Error (False Positive):

False Positive Consequences Across Domains
Domain	What FP Means	Consequence
Medical Diagnosis	Healthy patient diagnosed with disease	Unnecessary treatment, psychological stress, wasted resources
Spam Detection	Legitimate email marked as spam	Important communication lost, user frustration
Fraud Detection	Legitimate transaction blocked	Customer inconvenience, lost sales, support costs
Security Screening	Innocent person flagged as threat	Delays, privacy violation, reputational harm
Manufacturing QC	Good product rejected	Wasted materials, reduced throughput, increased costs

Type II Error (False Negative):

False Negative Consequences Across Domains
Domain	What FN Means	Consequence
Medical Diagnosis	Diseased patient cleared as healthy	Delayed treatment, disease progression, potential death
Spam Detection	Spam email allowed through	User annoyance, potential phishing exposure
Fraud Detection	Fraudulent transaction approved	Financial loss, regulatory penalties
Security Screening	Actual threat not detected	Security breach, potential casualties
Manufacturing QC	Defective product approved	Customer harm, recalls, liability

The Cost Asymmetry Principle

Marginal Totals and Fundamental Rates

The confusion matrix encodes not just the four cell values but also meaningful marginal totals—the row sums and column sums—that represent important quantities:

Actual Class Totals (Row Sums):

P = TP + FN: Total number of actual positive instances (the true positives and the positives we missed)
N = TN + FP: Total number of actual negative instances (the true negatives and the negatives we incorrectly flagged)

These are fixed by the ground truth—they depend only on the dataset, not on the classifier.

Predicted Class Totals (Column Sums):

PP = TP + FP: Total number of predicted positives (correct and incorrect positive predictions)
PN = TN + FN: Total number of predicted negatives (correct and incorrect negative predictions)

These depend on the classifier's behavior and threshold settings.

confusion_matrix_structure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import numpy as np
from sklearn.metrics import confusion_matrix
 
# Example: Binary classification results
y_true = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])  # 5 positives, 5 negatives
y_pred = np.array([1, 1, 1, 0, 0, 0, 0, 0, 1, 1])  # Model predictions
 
# Compute confusion matrix
# Note: sklearn convention - rows are actual, columns are predicted
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix (sklearn convention):")
print(cm)
# Output:
# [[3 2]   # Row 0: Actual Negative - [TN=3, FP=2]
#  [2 3]]  # Row 1: Actual Positive - [FN=2, TP=3]
 
# Extract the four fundamental quantities
TN, FP, FN, TP = cm.ravel()
print(f"\nTrue Positives (TP):  {TP}")   # Correctly predicted positives
print(f"True Negatives (TN):  {TN}")      # Correctly predicted negatives
print(f"False Positives (FP): {FP}")      # Type I errors
print(f"False Negatives (FN): {FN}")      # Type II errors
 
# Compute marginal totals
P = TP + FN  # Total actual positives
N = TN + FP  # Total actual negatives
PP = TP + FP # Total predicted positives
PN = TN + FN # Total predicted negatives
 
print(f"\nActual Positives (P):    {P}")
print(f"Actual Negatives (N):    {N}")
print(f"Predicted Positives (PP): {PP}")
print(f"Predicted Negatives (PN): {PN}")
print(f"Total samples (n):        {P + N}")

Fundamental Rate Definitions:

From these quantities, we can define the fundamental rates that form the basis of all classification metrics:

Rate	Formula	Also Known As	Meaning
True Positive Rate	$TPR = \frac{TP}{P} = \frac{TP}{TP + FN}$	Sensitivity, Recall, Hit Rate	Proportion of actual positives correctly identified
True Negative Rate	$TNR = \frac{TN}{N} = \frac{TN}{TN + FP}$	Specificity, Selectivity	Proportion of actual negatives correctly identified
False Positive Rate	$FPR = \frac{FP}{N} = \frac{FP}{FP + TN}$	Fall-out, Type I Error Rate	Proportion of actual negatives incorrectly flagged
False Negative Rate	$FNR = \frac{FN}{P} = \frac{FN}{FN + TP}$	Miss Rate, Type II Error Rate	Proportion of actual positives missed

Note the complementary relationships:

$TPR + FNR = 1$ (we either catch a positive or miss it)
$TNR + FPR = 1$ (we either correctly clear a negative or falsely flag it)

Multi-Class Confusion Matrix

$$C = \begin{bmatrix} C_{11} & C_{12} & \cdots & C_{1K} \ C_{21} & C_{22} & \cdots & C_{2K} \ \vdots & \vdots & \ddots & \vdots \ C_{K1} & C_{K2} & \cdots & C_{KK} \end{bmatrix}$$

Key Properties:

Diagonal entries $C_{ii}$: Correct predictions for class $i$
Off-diagonal entries $C_{ij}$ (where $i \neq j$): Confusions—class $i$ instances misclassified as class $j$
Row sum $\sum_j C_{ij}$: Total actual instances of class $i$
Column sum $\sum_i C_{ij}$: Total predictions of class $j$
Total $\sum_{i,j} C_{ij} = n$: Total number of instances

multiclass_confusion_matrix.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
 
# Multi-class example: 4-class classification
classes = ['Cat', 'Dog', 'Bird', 'Fish']
y_true = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 0, 1, 2, 3, 0, 1, 2, 3]
y_pred = [0, 0, 1, 1, 1, 0, 2, 3, 2, 3, 3, 2, 0, 1, 2, 3, 1, 1, 2, 3]
 
# Compute multi-class confusion matrix
cm = confusion_matrix(y_true, y_pred)
print("Multi-class Confusion Matrix:")
print(cm)
 
# Analyze the confusion matrix
print("\nDetailed Analysis:")
for i, class_name in enumerate(classes):
    correct = cm[i, i]
    total_actual = cm[i, :].sum()
    total_predicted = cm[:, i].sum()
    
    print(f"\n{class_name}:")
    print(f"  Correct predictions: {correct}/{total_actual} ({100*correct/total_actual:.1f}%)")
    print(f"  Times predicted: {total_predicted}")
    
    # Show confusions
    for j, other_class in enumerate(classes):
        if i != j and cm[i, j] > 0:
            print(f"  Confused with {other_class}: {cm[i, j]} times")
 
# Calculate per-class metrics (treating each class as binary)
print("\nPer-class Binary Metrics:")
for i, class_name in enumerate(classes):
    # For class i: TP, FP, FN, TN
    TP = cm[i, i]
    FP = cm[:, i].sum() - TP  # Column sum minus diagonal
    FN = cm[i, :].sum() - TP  # Row sum minus diagonal
    TN = cm.sum() - TP - FP - FN  # Everything else
    
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    print(f"{class_name}: TP={TP}, FP={FP}, FN={FN}, TN={TN}")
    print(f"  Precision: {precision:.3f}, Recall: {recall:.3f}")

Reading Multi-Class Confusion Matrices

Normalization and Visualization

Three Normalization Strategies:

Row normalization (by true class): Divide each row by its sum. Each row sums to 1. Entry $C'{ij} = C{ij} / \sum_j C_{ij}$ represents the proportion of class $i$ instances predicted as class $j$. This shows recall per class on the diagonal.
Column normalization (by predicted class): Divide each column by its sum. Each column sums to 1. Entry $C'{ij} = C{ij} / \sum_i C_{ij}$ represents the proportion of class $j$ predictions that are actually class $i$. This shows precision per class on the diagonal.
Total normalization: Divide all entries by the total count. The entire matrix sums to 1. Entry $C'{ij} = C{ij} / n$ represents the proportion of all predictions that are true class $i$ predicted as class $j$.

confusion_matrix_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
# Example with class imbalance
y_true = [0]*100 + [1]*50 + [2]*25  # Imbalanced: 100, 50, 25 samples
y_pred = [0]*85 + [1]*10 + [2]*5 + \
         [1]*40 + [0]*5 + [2]*5 + \
         [2]*20 + [0]*3 + [1]*2
 
cm = confusion_matrix(y_true, y_pred)
 
# Create figure with three normalizations
fig, axes = plt.subplots(1, 4, figsize=(20, 4))
 
# Raw counts
ConfusionMatrixDisplay(cm, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[0])
axes[0].set_title('Raw Counts')
 
# Row-normalized (shows recall)
cm_row = cm.astype(float) / cm.sum(axis=1, keepdims=True)
ConfusionMatrixDisplay(cm_row, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[1])
axes[1].set_title('Row Normalized (Recall)')
 
# Column-normalized (shows precision on diagonal)
cm_col = cm.astype(float) / cm.sum(axis=0, keepdims=True)
ConfusionMatrixDisplay(cm_col, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[2])
axes[2].set_title('Column Normalized (Precision)')
 
# Overall-normalized
cm_total = cm.astype(float) / cm.sum()
ConfusionMatrixDisplay(cm_total, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[3])
axes[3].set_title('Total Normalized')
 
plt.tight_layout()
plt.savefig('confusion_matrix_normalizations.png', dpi=150)
plt.show()
 
# Interpretation
print("Interpretation of Row-Normalized (Recall):")
for i, cls in enumerate(['Class 0', 'Class 1', 'Class 2']):
    print(f"  {cls} recall: {cm_row[i,i]:.1%} of actual {cls} correctly predicted")
 
print("\nInterpretation of Column-Normalized (Precision):")
for i, cls in enumerate(['Class 0', 'Class 1', 'Class 2']):
    print(f"  {cls} precision: {cm_col[i,i]:.1%} of {cls} predictions are correct")

Visualization Best Practices

Deriving Metrics from the Confusion Matrix

The confusion matrix is the single source of truth from which all classification metrics can be derived. Understanding this derivation process is essential for:

Understanding what each metric actually measures
Recognizing relationships between metrics
Creating custom metrics tailored to specific applications
Debugging unexpected metric values

Here is the complete derivation of common metrics from the binary confusion matrix:

Metrics Derived from the Confusion Matrix
Metric	Formula	Intuition
Accuracy	$(TP + TN) / n$	Proportion of all predictions that are correct
Error Rate	$(FP + FN) / n$	Proportion of all predictions that are wrong
Precision	$TP / (TP + FP)$	Of predicted positives, how many are correct?
Recall (Sensitivity)	$TP / (TP + FN)$	Of actual positives, how many did we find?
Specificity	$TN / (TN + FP)$	Of actual negatives, how many did we correctly identify?
F1 Score	$2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$	Harmonic mean of precision and recall
Balanced Accuracy	$(TPR + TNR) / 2$	Average of sensitivity and specificity
Matthews Correlation Coefficient	$\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$	Correlation between predicted and actual

derive_all_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from sklearn.metrics import confusion_matrix
 
def derive_all_metrics(y_true, y_pred):
    """
    Derive all common classification metrics from the confusion matrix.
    This demonstrates that the confusion matrix is the complete source
    of classification performance information.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    # Marginal totals
    P = TP + FN  # Actual positives
    N = TN + FP  # Actual negatives
    PP = TP + FP  # Predicted positives
    PN = TN + FN  # Predicted negatives
    n = P + N     # Total samples
    
    # Basic rates
    tpr = TP / P if P > 0 else 0  # True Positive Rate (Recall/Sensitivity)
    tnr = TN / N if N > 0 else 0  # True Negative Rate (Specificity)
    fpr = FP / N if N > 0 else 0  # False Positive Rate
    fnr = FN / P if P > 0 else 0  # False Negative Rate
    
    # Common metrics
    accuracy = (TP + TN) / n
    error_rate = (FP + FN) / n
    precision = TP / PP if PP > 0 else 0
    recall = tpr  # Same as TPR
    specificity = tnr  # Same as TNR
    
    # F1 Score (harmonic mean of precision and recall)
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    # Balanced Accuracy
    balanced_acc = (tpr + tnr) / 2
    
    # Matthews Correlation Coefficient
    denom = np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN))
    mcc = (TP*TN - FP*FN) / denom if denom > 0 else 0
    
    # Informedness (Youden's J) and Markedness
    informedness = tpr + tnr - 1  # Youden's J statistic
    markedness = precision + (TN / PN if PN > 0 else 0) - 1
    
    return {
        'Confusion Matrix': cm,
        'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN,
        'Accuracy': accuracy,
        'Error Rate': error_rate,
        'Precision (PPV)': precision,
        'Recall (Sensitivity, TPR)': recall,
        'Specificity (TNR)': specificity,
        'F1 Score': f1,
        'Balanced Accuracy': balanced_acc,
        'MCC': mcc,
        'Informedness (Youden J)': informedness,
        'FPR': fpr,
        'FNR': fnr,
    }
 
# Example usage
y_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1]
 
metrics = derive_all_metrics(y_true, y_pred)
print("All Metrics Derived from Confusion Matrix:")
print("-" * 45)
for name, value in metrics.items():
    if isinstance(value, np.ndarray):
        print(f"{name}:\n{value}")
    else:
        print(f"{name}: {value:.4f}" if isinstance(value, float) else f"{name}: {value}")

Common Pitfalls and Best Practices

Working with confusion matrices requires attention to several common sources of error and misinterpretation:

Common Pitfalls

•Convention confusion: Different libraries use different conventions (rows=actual vs rows=predicted). Always verify before interpreting.
•Positive class ambiguity: For binary classification, ensure you've correctly identified which class is 'positive' in your specific context.
•Class label ordering: Multi-class matrices depend on class ordering. Verify that class indices align with your expectations.
•Aggregation across folds: Summing confusion matrices across CV folds gives different results than averaging metrics. Both are valid but measure different things.
•Threshold effects: For probabilistic classifiers, the confusion matrix depends on the decision threshold. Always specify the threshold used.

Best Practices

•Always compute the confusion matrix first: Before calculating any derived metric, verify the confusion matrix makes sense.
•Use multiple views: Show both raw counts and normalized versions to get complete understanding.
•Annotate the matrix: Include axis labels, class names, and the total sample size.
•Compare against baselines: Know what a random or majority-class baseline confusion matrix would look like.
•Look for patterns: Off-diagonal clusters reveal systematic confusions that may guide data collection or feature engineering.
•Report the full matrix: In publications and reports, include the full confusion matrix, not just derived metrics.

Summary: The Confusion Matrix Foundation

We have established the confusion matrix as the foundational structure for all classification evaluation. Let's consolidate our understanding:

Key Takeaways

•The confusion matrix is the complete accounting of a classifier's predictions—every prediction falls into exactly one cell.
•Four fundamental outcomes define binary classification: True Positives, True Negatives, False Positives, and False Negatives.
•Error types have asymmetric costs—False Positives and False Negatives have different consequences depending on the application.
•All classification metrics can be derived from the confusion matrix, making it the single source of truth.
•Multi-class extension generalizes to a K×K matrix where entry (i,j) counts instances of class i predicted as class j.
•Normalization strategies reveal different aspects: row-normalization shows recall, column-normalization shows precision.
•Convention awareness is critical—always verify which axis represents actual vs. predicted labels.

What's Next:

Page Complete

1 / 5