Loading learning content...
Before a single metric can be calculated—before precision, recall, F1-score, or any other measure of classifier performance can be determined—there exists a fundamental structure that makes all of these calculations possible: the confusion matrix.
The confusion matrix is not merely a visualization tool or a pedagogical convenience. It is the complete accounting ledger of a classifier's predictions. Every prediction a model makes falls into exactly one cell of this matrix, and from this exhaustive categorization, all classification metrics can be derived. Understanding the confusion matrix deeply is therefore prerequisite to understanding any metric built upon it.
By the end of this page, you will understand the confusion matrix from first principles—its structure, its four fundamental cells, its generalizations to multi-class settings, and its role as the foundation for all classification metrics. You will be able to construct, interpret, and extract insights from confusion matrices in any classification context.
To understand the confusion matrix, we must first establish the fundamental nature of the classification task. In classification, a model receives an input $x \in \mathcal{X}$ and produces a predicted class label $\hat{y} \in {1, 2, \ldots, K}$, where $K$ is the number of classes. The true label $y \in {1, 2, \ldots, K}$ is known during evaluation.
The Evaluation Question:
Given a dataset of $n$ examples ${(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$ and corresponding predictions ${\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_n}$, how do we systematically characterize the classifier's performance?
The simplest approach—counting correct predictions—provides only a single number (accuracy) that often obscures critical information. The confusion matrix instead provides a complete decomposition of all predictions, enabling nuanced analysis of where and how a classifier succeeds or fails.
The name 'confusion matrix' reflects that it reveals the confusions made by a classifier—specifically, which classes are mistaken for which other classes. A perfect classifier produces a diagonal matrix (no confusion), while an imperfect classifier shows off-diagonal entries indicating systematic errors.
We begin with binary classification ($K = 2$), the most common and foundational case. By convention, the two classes are designated as:
Every prediction falls into exactly one of four categories based on the relationship between the predicted label $\hat{y}$ and the true label $y$:
| Outcome | Symbol | Predicted | Actual | Description |
|---|---|---|---|---|
| True Positive | TP | Positive | Positive | Correctly identified positive case |
| True Negative | TN | Negative | Negative | Correctly identified negative case |
| False Positive | FP | Positive | Negative | Incorrectly labeled negative as positive (Type I Error) |
| False Negative | FN | Negative | Positive | Incorrectly labeled positive as negative (Type II Error) |
These four quantities are mutually exclusive and collectively exhaustive. Every single prediction in your evaluation dataset belongs to exactly one of these categories:
$$n = TP + TN + FP + FN$$
where $n$ is the total number of predictions.
The Confusion Matrix Structure:
These four outcomes are arranged in a 2×2 matrix with a specific convention:
$$\text{Confusion Matrix} = \begin{bmatrix} TN & FP \ FN & TP \end{bmatrix}$$
Alternatively, and equivalently:
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | TN | FP |
| Actual Positive | FN | TP |
Different textbooks and libraries use different conventions for row/column ordering (actual vs. predicted on rows/columns) and class ordering (positive class first vs. second). Always verify the convention used in your specific context. Scikit-learn, for example, places actual labels on rows and predicted labels on columns, with classes in sorted order.
The distinction between False Positives and False Negatives is not merely technical—it maps to fundamentally different kinds of errors with vastly different consequences depending on the application domain.
Type I Error (False Positive):
A False Positive occurs when the model incorrectly predicts the positive class for an instance that actually belongs to the negative class. The model has 'cried wolf'—it has flagged something as important or concerning when it was not.
Statistical interpretation: In hypothesis testing terms, this is rejecting the null hypothesis when it is actually true. The model claims to have detected an effect or condition that does not exist.
| Domain | What FP Means | Consequence |
|---|---|---|
| Medical Diagnosis | Healthy patient diagnosed with disease | Unnecessary treatment, psychological stress, wasted resources |
| Spam Detection | Legitimate email marked as spam | Important communication lost, user frustration |
| Fraud Detection | Legitimate transaction blocked | Customer inconvenience, lost sales, support costs |
| Security Screening | Innocent person flagged as threat | Delays, privacy violation, reputational harm |
| Manufacturing QC | Good product rejected | Wasted materials, reduced throughput, increased costs |
Type II Error (False Negative):
A False Negative occurs when the model incorrectly predicts the negative class for an instance that actually belongs to the positive class. The model has 'missed' something important—it has failed to detect a condition or event that was actually present.
Statistical interpretation: In hypothesis testing terms, this is failing to reject the null hypothesis when it is actually false. The model has failed to detect an effect or condition that genuinely exists.
| Domain | What FN Means | Consequence |
|---|---|---|
| Medical Diagnosis | Diseased patient cleared as healthy | Delayed treatment, disease progression, potential death |
| Spam Detection | Spam email allowed through | User annoyance, potential phishing exposure |
| Fraud Detection | Fraudulent transaction approved | Financial loss, regulatory penalties |
| Security Screening | Actual threat not detected | Security breach, potential casualties |
| Manufacturing QC | Defective product approved | Customer harm, recalls, liability |
In almost every real-world application, the costs of False Positives and False Negatives are asymmetric. Medical screening prioritizes avoiding False Negatives (missed diseases) even at the cost of more False Positives. Email filters might prioritize avoiding False Positives (lost important mail) over catching every spam message. Understanding this asymmetry is critical for choosing appropriate metrics and decision thresholds.
The confusion matrix encodes not just the four cell values but also meaningful marginal totals—the row sums and column sums—that represent important quantities:
Actual Class Totals (Row Sums):
These are fixed by the ground truth—they depend only on the dataset, not on the classifier.
Predicted Class Totals (Column Sums):
These depend on the classifier's behavior and threshold settings.
12345678910111213141516171819202122232425262728293031323334
import numpy as npfrom sklearn.metrics import confusion_matrix # Example: Binary classification resultsy_true = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0]) # 5 positives, 5 negativesy_pred = np.array([1, 1, 1, 0, 0, 0, 0, 0, 1, 1]) # Model predictions # Compute confusion matrix# Note: sklearn convention - rows are actual, columns are predictedcm = confusion_matrix(y_true, y_pred)print("Confusion Matrix (sklearn convention):")print(cm)# Output:# [[3 2] # Row 0: Actual Negative - [TN=3, FP=2]# [2 3]] # Row 1: Actual Positive - [FN=2, TP=3] # Extract the four fundamental quantitiesTN, FP, FN, TP = cm.ravel()print(f"\nTrue Positives (TP): {TP}") # Correctly predicted positivesprint(f"True Negatives (TN): {TN}") # Correctly predicted negativesprint(f"False Positives (FP): {FP}") # Type I errorsprint(f"False Negatives (FN): {FN}") # Type II errors # Compute marginal totalsP = TP + FN # Total actual positivesN = TN + FP # Total actual negativesPP = TP + FP # Total predicted positivesPN = TN + FN # Total predicted negatives print(f"\nActual Positives (P): {P}")print(f"Actual Negatives (N): {N}")print(f"Predicted Positives (PP): {PP}")print(f"Predicted Negatives (PN): {PN}")print(f"Total samples (n): {P + N}")Fundamental Rate Definitions:
From these quantities, we can define the fundamental rates that form the basis of all classification metrics:
| Rate | Formula | Also Known As | Meaning |
|---|---|---|---|
| True Positive Rate | $TPR = \frac{TP}{P} = \frac{TP}{TP + FN}$ | Sensitivity, Recall, Hit Rate | Proportion of actual positives correctly identified |
| True Negative Rate | $TNR = \frac{TN}{N} = \frac{TN}{TN + FP}$ | Specificity, Selectivity | Proportion of actual negatives correctly identified |
| False Positive Rate | $FPR = \frac{FP}{N} = \frac{FP}{FP + TN}$ | Fall-out, Type I Error Rate | Proportion of actual negatives incorrectly flagged |
| False Negative Rate | $FNR = \frac{FN}{P} = \frac{FN}{FN + TP}$ | Miss Rate, Type II Error Rate | Proportion of actual positives missed |
Note the complementary relationships:
When $K > 2$ (multi-class classification), the confusion matrix generalizes to a $K \times K$ structure. The entry $C_{ij}$ represents the number of instances with true class $i$ that were predicted as class $j$.
$$C = \begin{bmatrix} C_{11} & C_{12} & \cdots & C_{1K} \ C_{21} & C_{22} & \cdots & C_{2K} \ \vdots & \vdots & \ddots & \vdots \ C_{K1} & C_{K2} & \cdots & C_{KK} \end{bmatrix}$$
Key Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as npfrom sklearn.metrics import confusion_matriximport matplotlib.pyplot as pltimport seaborn as sns # Multi-class example: 4-class classificationclasses = ['Cat', 'Dog', 'Bird', 'Fish']y_true = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 0, 1, 2, 3, 0, 1, 2, 3]y_pred = [0, 0, 1, 1, 1, 0, 2, 3, 2, 3, 3, 2, 0, 1, 2, 3, 1, 1, 2, 3] # Compute multi-class confusion matrixcm = confusion_matrix(y_true, y_pred)print("Multi-class Confusion Matrix:")print(cm) # Analyze the confusion matrixprint("\nDetailed Analysis:")for i, class_name in enumerate(classes): correct = cm[i, i] total_actual = cm[i, :].sum() total_predicted = cm[:, i].sum() print(f"\n{class_name}:") print(f" Correct predictions: {correct}/{total_actual} ({100*correct/total_actual:.1f}%)") print(f" Times predicted: {total_predicted}") # Show confusions for j, other_class in enumerate(classes): if i != j and cm[i, j] > 0: print(f" Confused with {other_class}: {cm[i, j]} times") # Calculate per-class metrics (treating each class as binary)print("\nPer-class Binary Metrics:")for i, class_name in enumerate(classes): # For class i: TP, FP, FN, TN TP = cm[i, i] FP = cm[:, i].sum() - TP # Column sum minus diagonal FN = cm[i, :].sum() - TP # Row sum minus diagonal TN = cm.sum() - TP - FP - FN # Everything else precision = TP / (TP + FP) if (TP + FP) > 0 else 0 recall = TP / (TP + FN) if (TP + FN) > 0 else 0 print(f"{class_name}: TP={TP}, FP={FP}, FN={FN}, TN={TN}") print(f" Precision: {precision:.3f}, Recall: {recall:.3f}")When interpreting a K×K confusion matrix, look for patterns in the off-diagonal entries. Large values indicate systematic confusions—classes that the model frequently mistakes for each other. This often reveals semantic similarities (e.g., 'Dog' confused with 'Cat' more often than with 'Car') or data collection issues (e.g., mislabeled training examples).
Raw counts in a confusion matrix can be difficult to interpret, especially with class imbalance. Normalization transforms counts into proportions, enabling easier comparison and pattern recognition.
Three Normalization Strategies:
Row normalization (by true class): Divide each row by its sum. Each row sums to 1. Entry $C'{ij} = C{ij} / \sum_j C_{ij}$ represents the proportion of class $i$ instances predicted as class $j$. This shows recall per class on the diagonal.
Column normalization (by predicted class): Divide each column by its sum. Each column sums to 1. Entry $C'{ij} = C{ij} / \sum_i C_{ij}$ represents the proportion of class $j$ predictions that are actually class $i$. This shows precision per class on the diagonal.
Total normalization: Divide all entries by the total count. The entire matrix sums to 1. Entry $C'{ij} = C{ij} / n$ represents the proportion of all predictions that are true class $i$ predicted as class $j$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay # Example with class imbalancey_true = [0]*100 + [1]*50 + [2]*25 # Imbalanced: 100, 50, 25 samplesy_pred = [0]*85 + [1]*10 + [2]*5 + \ [1]*40 + [0]*5 + [2]*5 + \ [2]*20 + [0]*3 + [1]*2 cm = confusion_matrix(y_true, y_pred) # Create figure with three normalizationsfig, axes = plt.subplots(1, 4, figsize=(20, 4)) # Raw countsConfusionMatrixDisplay(cm, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[0])axes[0].set_title('Raw Counts') # Row-normalized (shows recall)cm_row = cm.astype(float) / cm.sum(axis=1, keepdims=True)ConfusionMatrixDisplay(cm_row, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[1])axes[1].set_title('Row Normalized (Recall)') # Column-normalized (shows precision on diagonal)cm_col = cm.astype(float) / cm.sum(axis=0, keepdims=True)ConfusionMatrixDisplay(cm_col, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[2])axes[2].set_title('Column Normalized (Precision)') # Overall-normalizedcm_total = cm.astype(float) / cm.sum()ConfusionMatrixDisplay(cm_total, display_labels=['Class 0', 'Class 1', 'Class 2']).plot(ax=axes[3])axes[3].set_title('Total Normalized') plt.tight_layout()plt.savefig('confusion_matrix_normalizations.png', dpi=150)plt.show() # Interpretationprint("Interpretation of Row-Normalized (Recall):")for i, cls in enumerate(['Class 0', 'Class 1', 'Class 2']): print(f" {cls} recall: {cm_row[i,i]:.1%} of actual {cls} correctly predicted") print("\nInterpretation of Column-Normalized (Precision):")for i, cls in enumerate(['Class 0', 'Class 1', 'Class 2']): print(f" {cls} precision: {cm_col[i,i]:.1%} of {cls} predictions are correct")Use color gradients (heatmaps) to make patterns visible at a glance. Annotate cells with both values and percentages when space permits. Order classes semantically when possible (e.g., severity levels 'Low→Medium→High') so that 'close' confusions appear near the diagonal. For large K, consider grouping similar classes or focusing on the most significant confusions.
The confusion matrix is the single source of truth from which all classification metrics can be derived. Understanding this derivation process is essential for:
Here is the complete derivation of common metrics from the binary confusion matrix:
| Metric | Formula | Intuition |
|---|---|---|
| Accuracy | $(TP + TN) / n$ | Proportion of all predictions that are correct |
| Error Rate | $(FP + FN) / n$ | Proportion of all predictions that are wrong |
| Precision | $TP / (TP + FP)$ | Of predicted positives, how many are correct? |
| Recall (Sensitivity) | $TP / (TP + FN)$ | Of actual positives, how many did we find? |
| Specificity | $TN / (TN + FP)$ | Of actual negatives, how many did we correctly identify? |
| F1 Score | $2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$ | Harmonic mean of precision and recall |
| Balanced Accuracy | $(TPR + TNR) / 2$ | Average of sensitivity and specificity |
| Matthews Correlation Coefficient | $\frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}$ | Correlation between predicted and actual |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom sklearn.metrics import confusion_matrix def derive_all_metrics(y_true, y_pred): """ Derive all common classification metrics from the confusion matrix. This demonstrates that the confusion matrix is the complete source of classification performance information. """ cm = confusion_matrix(y_true, y_pred) TN, FP, FN, TP = cm.ravel() # Marginal totals P = TP + FN # Actual positives N = TN + FP # Actual negatives PP = TP + FP # Predicted positives PN = TN + FN # Predicted negatives n = P + N # Total samples # Basic rates tpr = TP / P if P > 0 else 0 # True Positive Rate (Recall/Sensitivity) tnr = TN / N if N > 0 else 0 # True Negative Rate (Specificity) fpr = FP / N if N > 0 else 0 # False Positive Rate fnr = FN / P if P > 0 else 0 # False Negative Rate # Common metrics accuracy = (TP + TN) / n error_rate = (FP + FN) / n precision = TP / PP if PP > 0 else 0 recall = tpr # Same as TPR specificity = tnr # Same as TNR # F1 Score (harmonic mean of precision and recall) f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0 # Balanced Accuracy balanced_acc = (tpr + tnr) / 2 # Matthews Correlation Coefficient denom = np.sqrt((TP+FP) * (TP+FN) * (TN+FP) * (TN+FN)) mcc = (TP*TN - FP*FN) / denom if denom > 0 else 0 # Informedness (Youden's J) and Markedness informedness = tpr + tnr - 1 # Youden's J statistic markedness = precision + (TN / PN if PN > 0 else 0) - 1 return { 'Confusion Matrix': cm, 'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN, 'Accuracy': accuracy, 'Error Rate': error_rate, 'Precision (PPV)': precision, 'Recall (Sensitivity, TPR)': recall, 'Specificity (TNR)': specificity, 'F1 Score': f1, 'Balanced Accuracy': balanced_acc, 'MCC': mcc, 'Informedness (Youden J)': informedness, 'FPR': fpr, 'FNR': fnr, } # Example usagey_true = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]y_pred = [1, 1, 1, 0, 0, 0, 0, 0, 1, 1] metrics = derive_all_metrics(y_true, y_pred)print("All Metrics Derived from Confusion Matrix:")print("-" * 45)for name, value in metrics.items(): if isinstance(value, np.ndarray): print(f"{name}:\n{value}") else: print(f"{name}: {value:.4f}" if isinstance(value, float) else f"{name}: {value}")Working with confusion matrices requires attention to several common sources of error and misinterpretation:
We have established the confusion matrix as the foundational structure for all classification evaluation. Let's consolidate our understanding:
What's Next:
With the confusion matrix as our foundation, we're ready to examine specific metrics in detail. The next page explores accuracy—the most intuitive metric—and critically examines its limitations, particularly in the presence of class imbalance.
You now understand the confusion matrix as the foundational structure for classification evaluation. This knowledge is prerequisite to understanding every metric we will examine in subsequent pages—each is simply a different way of summarizing the information contained in this fundamental matrix.