Machine LearningModel Evaluation Metrics

Classification Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

2 / 5

Accuracy and its Limitations

The Deceptive Simplicity of Accuracy

If you ask a non-technical stakeholder how to measure whether a classifier is performing well, their answer will almost certainly involve some form of 'percentage correct.' This intuition leads directly to accuracy, the most natural and universally understood classification metric.

Yet accuracy, despite its simplicity and intuitive appeal, is perhaps the most dangerously misleading metric in the machine learning practitioner's toolkit. Its apparent straightforwardness conceals fundamental limitations that have led countless projects astray, produced models that fail spectacularly in production, and fostered a false sense of confidence in systems that don't actually work.

This page will thoroughly examine accuracy—its definition, its proper use cases, and its critical limitations—so that you can wield it appropriately and recognize when it will mislead you.

What You Will Learn

By the end of this page, you will understand accuracy's mathematical definition, recognize the scenarios where it is appropriate, identify the conditions under which it becomes unreliable, and learn to detect accuracy paradoxes before they impact your work.

The Definition of Accuracy

Accuracy is defined as the proportion of predictions that are correct:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{TP + TN}{TP + TN + FP + FN}$$

In terms of the confusion matrix:

The numerator is the sum of diagonal entries (correct predictions)
The denominator is the sum of all entries (total predictions)

For multi-class classification with $K$ classes:

$$\text{Accuracy} = \frac{\sum_{i=1}^{K} C_{ii}}{\sum_{i=1}^{K} \sum_{j=1}^{K} C_{ij}} = \frac{\text{Trace}(C)}{\sum C}$$

where $C_{ij}$ is the confusion matrix entry for true class $i$, predicted class $j$.

The Complementary Error Rate

The error rate (or misclassification rate) is the complement of accuracy: $\text{Error Rate} = 1 - \text{Accuracy} = \frac{FP + FN}{n}$. Any statement about accuracy implies the corresponding statement about error rate and vice versa.

accuracy_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
 
# Example predictions
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
 
# Method 1: Direct calculation
accuracy_direct = np.mean(np.array(y_true) == np.array(y_pred))
print(f"Accuracy (direct): {accuracy_direct:.4f}")
 
# Method 2: From confusion matrix
cm = confusion_matrix(y_true, y_pred)
TN, FP, FN, TP = cm.ravel()
accuracy_from_cm = (TP + TN) / (TP + TN + FP + FN)
print(f"Accuracy (from CM): {accuracy_from_cm:.4f}")
 
# Method 3: Using sklearn
accuracy_sklearn = accuracy_score(y_true, y_pred)
print(f"Accuracy (sklearn): {accuracy_sklearn:.4f}")
 
# All three methods give identical results
print(f"
Confusion Matrix:")
print(cm)
print(f"
Breakdown: {TP + TN} correct / {TP + TN + FP + FN} total = {accuracy_sklearn:.1%}")

Why Accuracy Feels Right

Accuracy's widespread use isn't accidental—it possesses several properties that make it cognitively appealing:

1. Interpretability: An accuracy of 95% is immediately understandable: 'The model is correct 95% of the time.' No technical explanation is needed. Stakeholders, executives, and non-technical team members can immediately grasp what this means.

2. Bounded range: Accuracy falls in $[0, 1]$ (or 0-100%), providing a clear sense of scale. Values near 1 are good; values near 0 are bad. The endpoints are meaningful and interpretable.

3. Global view: Accuracy considers all predictions equally, providing a single summary of overall performance without favoring any particular class or outcome.

4. Comparison simplicity: Comparing models by accuracy is trivial: the model with higher accuracy is better (under the assumption that all errors are equally costly).

5. Historical precedent: Accuracy has been used for decades across statistics, machine learning, and related fields, making it a lingua franca for performance reporting.

When Accuracy Works Well

Accuracy is appropriate when: (1) classes are balanced (roughly equal proportions), (2) all errors are equally costly (no asymmetric error costs), and (3) you genuinely care about overall correctness rather than performance on any specific class. These conditions are rarer than most practitioners realize.

The Accuracy Paradox

The accuracy paradox is the phenomenon where a model with higher accuracy can be objectively worse than a model with lower accuracy. This occurs when class distributions are imbalanced.

The Setup:

Consider a fraud detection dataset with 10,000 transactions:

9,900 legitimate transactions (99%)
100 fraudulent transactions (1%)

The Trivial Baseline:

A 'model' that simply predicts 'legitimate' for every transaction achieves:

$$\text{Accuracy}_{trivial} = \frac{9900}{10000} = 99%$$

This model:

Catches zero frauds (100% miss rate)
Is completely useless for its intended purpose
Yet has 99% accuracy!

The Paradox:

A real fraud detection model that catches 80% of frauds while having a 2% false alarm rate:

Correctly identifies 80 frauds (TP = 80)
Misses 20 frauds (FN = 20)
Correctly clears 9,702 legitimate transactions (TN = 9702)
Falsely flags 198 legitimate transactions (FP = 198)

$$\text{Accuracy}_{real} = \frac{80 + 9702}{10000} = 97.82%$$

The real model has lower accuracy (97.82% vs 99%) but is infinitely more valuable for its intended purpose.

Trivial Model (99% Accuracy)

•Predicts 'legitimate' for everything
•Catches 0 out of 100 frauds
•Detection rate: 0%
•Business value: Zero
•Effort to build: None

Real Model (97.82% Accuracy)

•Actually distinguishes fraud from legitimate
•Catches 80 out of 100 frauds
•Detection rate: 80%
•Business value: Substantial
•Lower accuracy, superior model

The Danger of Accuracy in Imbalanced Settings

When classes are imbalanced, accuracy is dominated by the majority class. A classifier can achieve high accuracy by simply predicting the majority class, learning nothing about the minority class. This makes accuracy actively harmful as a metric in many real-world scenarios.

Mathematical Analysis of the Accuracy Paradox

Let's formalize why accuracy fails under class imbalance. Let $\pi$ be the proportion of positive examples in the dataset ($\pi = P/n$), and $(1-\pi)$ be the proportion of negatives.

Baseline Accuracy:

A trivial classifier that always predicts the majority class achieves:

$$\text{Accuracy}_{baseline} = \max(\pi, 1-\pi)$$

This means:

If $\pi = 0.5$: Baseline accuracy = 50%
If $\pi = 0.01$: Baseline accuracy = 99%
If $\pi = 0.001$: Baseline accuracy = 99.9%

The more imbalanced the dataset, the higher the 'free' baseline accuracy.

Why This Matters:

Two components contribute to accuracy:

$$\text{Accuracy} = \pi \cdot TPR + (1-\pi) \cdot TNR$$

where TPR is the True Positive Rate and TNR is the True Negative Rate.

When $\pi \ll 0.5$ (highly imbalanced toward negatives):

The weight on TNR is much larger than on TPR
High accuracy can be achieved with TNR ≈ 1 and TPR ≈ 0
The metric effectively ignores minority class performance

accuracy_paradox_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_accuracy_paradox():
    """
    Demonstrate how class imbalance breaks accuracy as a useful metric.
    """
    # Range of class imbalance ratios
    positive_proportions = [0.5, 0.1, 0.05, 0.01, 0.001]
    
    # Fixed TPR (recall) for a "good" model
    tpr_good = 0.80  # Catches 80% of positives
    tnr_good = 0.98  # Correctly identifies 98% of negatives
    
    print("Impact of Class Imbalance on Accuracy")
    print("=" * 60)
    print(f"Model performance: TPR = {tpr_good:.0%}, TNR = {tnr_good:.0%}")
    print("-" * 60)
    
    for pi in positive_proportions:
        # Baseline: Always predict majority class
        baseline_acc = max(pi, 1 - pi)
        
        # Our model's accuracy
        model_acc = pi * tpr_good + (1 - pi) * tnr_good
        
        # How much better than baseline?
        relative_improvement = (model_acc - baseline_acc) / baseline_acc * 100
        
        print(f"
Positive class proportion: {pi:.1%}")
        print(f"  Baseline (always negative): {baseline_acc:.3%}")
        print(f"  Our model:                  {model_acc:.3%}")
        print(f"  Improvement over baseline:  {relative_improvement:+.2f}%")
        
        if model_acc < baseline_acc:
            print(f"  ⚠️  Model has LOWER accuracy than trivial baseline!")
    
    # Visualization
    pi_range = np.linspace(0.001, 0.5, 100)
    baseline_accs = np.maximum(pi_range, 1 - pi_range)
    model_accs = pi_range * tpr_good + (1 - pi_range) * tnr_good
    
    plt.figure(figsize=(10, 6))
    plt.plot(pi_range, baseline_accs * 100, 'r--', linewidth=2, label='Baseline (majority class)')
    plt.plot(pi_range, model_accs * 100, 'b-', linewidth=2, label='Model (TPR=80%, TNR=98%)')
    plt.fill_between(pi_range, baseline_accs * 100, model_accs * 100, 
                     where=model_accs > baseline_accs, alpha=0.3, color='blue', 
                     label='Model beats baseline')
    plt.fill_between(pi_range, baseline_accs * 100, model_accs * 100,
                     where=model_accs < baseline_accs, alpha=0.3, color='red',
                     label='Baseline beats model')
    plt.xlabel('Proportion of Positive Class (π)', fontsize=12)
    plt.ylabel('Accuracy (%)', fontsize=12)
    plt.title('The Accuracy Paradox: When Good Models Have Lower Accuracy', fontsize=14)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.xlim(0, 0.5)
    plt.ylim(75, 100)
    plt.savefig('accuracy_paradox.png', dpi=150, bbox_inches='tight')
    plt.show()
 
analyze_accuracy_paradox()

Cost-Sensitive Considerations

Beyond class imbalance, accuracy fails when error costs are asymmetric—which is almost always in real-world applications.

The Implicit Assumption:

Accuracy treats all errors equally:

A False Positive costs the same as a False Negative
Misclassifying class 1 as class 2 is as bad as misclassifying class 2 as class 1

This assumption is rarely valid in practice.

Example: Medical Screening

Consider a cancer screening test:

False Negative (missed cancer): Patient doesn't receive treatment, disease progresses, potentially fatal
False Positive (false alarm): Patient undergoes additional tests, experiences anxiety, incurs costs

While both errors have costs, a False Negative is dramatically more serious. A metric that weights them equally cannot guide model selection appropriately.

Error Cost Asymmetry Across Domains
Domain	High-Cost Error	Why Cost is High	Acceptable Trade-off
Cancer Screening	False Negative	Missed diagnosis → delayed treatment → death	Accept more False Positives
Spam Email	False Positive	Important email lost → missed opportunity	Accept more spam through
Autonomous Driving	False Negative (obstacle)	Collision → injury/death	Accept unnecessary stops
Criminal Sentencing	False Positive	Innocent person imprisoned	Accept more guilty go free
Credit Scoring	Context-dependent	FN = bad loan; FP = lost customer	Depends on business model

Accuracy Ignores Cost Structure

When errors have unequal costs, accuracy optimization produces suboptimal decisions. A model that minimizes accuracy-based error may maximize total cost. Cost-sensitive metrics or explicit cost-benefit analysis should replace accuracy in these scenarios.

Accuracy Variants and Improvements

Several metrics address accuracy's limitations while retaining some of its simplicity:

1. Balanced Accuracy:

Balanced accuracy averages the recall across classes, giving equal weight to each class regardless of size:

$$\text{Balanced Accuracy} = \frac{1}{K} \sum_{i=1}^{K} \text{Recall}_i = \frac{TPR + TNR}{2} \text{ (for binary)}$$

For binary classification: $\text{Balanced Accuracy} = \frac{1}{2}\left(\frac{TP}{TP+FN} + \frac{TN}{TN+FP}\right)$

This metric:

Ranges from 0 to 1
Equals 0.5 for a random classifier (regardless of class balance)
Gives equal importance to identifying each class correctly

2. Weighted Accuracy:

Assigns explicit weights $w_i$ to each class:

$$\text{Weighted Accuracy} = \frac{\sum_{i=1}^{K} w_i \cdot C_{ii}}{\sum_{i=1}^{K} w_i \cdot \sum_{j=1}^{K} C_{ij}}$$

This allows incorporating domain knowledge about class importance.

3. Geometric Mean of Class Accuracies:

$$G\text{-Mean} = \sqrt{TPR \cdot TNR} = \sqrt{\text{Sensitivity} \cdot \text{Specificity}}$$

This metric is zero if either TPR or TNR is zero, penalizing models that completely fail on any class.

accuracy_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from sklearn.metrics import balanced_accuracy_score, confusion_matrix
 
def compute_accuracy_variants(y_true, y_pred):
    """
    Compute standard accuracy and its improved variants.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    # Standard accuracy
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    
    # Balanced accuracy
    tpr = TP / (TP + FN) if (TP + FN) > 0 else 0
    tnr = TN / (TN + FP) if (TN + FP) > 0 else 0
    balanced_acc = (tpr + tnr) / 2
    
    # Geometric mean
    g_mean = np.sqrt(tpr * tnr)
    
    return {
        'Standard Accuracy': accuracy,
        'Balanced Accuracy': balanced_acc,
        'G-Mean': g_mean,
        'TPR (Sensitivity)': tpr,
        'TNR (Specificity)': tnr,
    }
 
# Example: Imbalanced dataset
np.random.seed(42)
n_neg, n_pos = 950, 50  # 95% negative, 5% positive
 
# Scenario 1: model that mostly predicts negative
y_true_1 = [0] * n_neg + [1] * n_pos
y_pred_1 = [0] * n_neg + [1] * 10 + [0] * 40  # Catches only 10/50 positives
 
print("Scenario 1: Conservative Model (predicts few positives)")
print("-" * 55)
for name, value in compute_accuracy_variants(y_true_1, y_pred_1).items():
    print(f"  {name}: {value:.4f}")
 
# Scenario 2: model with better minority class performance
y_pred_2 = [0] * 920 + [1] * 30 + [1] * 40 + [0] * 10  # Catches 40/50 positives
 
print("
Scenario 2: Aggressive Model (better minority detection)")
print("-" * 55)
for name, value in compute_accuracy_variants(y_true_1, y_pred_2).items():
    print(f"  {name}: {value:.4f}")
 
print("
Key Observation:")
print("  Standard accuracy prefers the conservative model.")
print("  Balanced accuracy and G-Mean prefer the aggressive model.")

When Accuracy IS Appropriate

Despite its limitations, accuracy remains the right metric in specific circumstances. Understanding when to use it—rather than blanket avoidance—is the mark of a thoughtful practitioner.

Conditions for Appropriate Use:

Scenarios Where Accuracy Works

•Balanced classes: When $|P - N| / n$ is small (e.g., classes within 40-60% split), accuracy reflects meaningful performance.
•Symmetric error costs: When False Positives and False Negatives are genuinely equally bad (rare in practice but possible).
•Preliminary comparison: For quick model comparison before detailed analysis, accuracy provides a rough filter.
•Communication: When explaining performance to non-technical audiences, accuracy may serve as an initial bridge before introducing nuanced metrics.
•Theoretical analysis: In learning theory and statistical analysis, accuracy (or its complement, error rate) has well-understood properties.

The Baseline Check

Always compare your model's accuracy against the baseline accuracy (majority class proportion). If your model's accuracy is close to the baseline, the model may not have learned anything useful about the minority class. This simple check catches many cases where accuracy is misleading.

Decision Framework:

Should I use accuracy?
├── Are classes balanced (within 40-60%)?
│   ├── No → Use balanced accuracy, F1, or class-specific metrics
│   └── Yes → Continue to next check
├── Are error costs roughly symmetric?
│   ├── No → Use cost-sensitive metrics or threshold optimization
│   └── Yes → Continue to next check
├── Do I care equally about all classes?
│   ├── No → Use class-weighted metrics or macro/micro averages
│   └── Yes → Accuracy is appropriate

Statistical Properties of Accuracy

For completeness, let's examine accuracy's statistical properties:

1. Accuracy as a Random Variable:

When evaluated on a finite test set of size $n$, accuracy is a point estimate of the true probability of correct classification $p$. If each prediction is independent (often approximately true), the number of correct predictions follows a Binomial distribution:

$$\text{Correct predictions} \sim \text{Binomial}(n, p)$$

2. Confidence Intervals:

A $(1-\alpha)$ confidence interval for accuracy can be constructed using:

$$\hat{p} \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

where $\hat{p}$ is the observed accuracy. For small $n$ or extreme $\hat{p}$, use Wilson score intervals or exact binomial intervals.

3. Sample Size Requirements:

To estimate accuracy within margin of error $\epsilon$ with confidence $(1-\alpha)$:

$$n \geq \frac{z_{1-\alpha/2}^2 \cdot p(1-p)}{\epsilon^2}$$

For the worst case ($p = 0.5$) with 95% confidence and $\epsilon = 0.01$:

$$n \geq \frac{1.96^2 \cdot 0.25}{0.01^2} = 9604$$

accuracy_confidence_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from scipy import stats
 
def accuracy_confidence_interval(y_true, y_pred, confidence=0.95):
    """
    Compute confidence interval for accuracy using Wilson score interval.
    More reliable than normal approximation for extreme proportions or small n.
    """
    n = len(y_true)
    correct = np.sum(np.array(y_true) == np.array(y_pred))
    accuracy = correct / n
    
    # Wilson score interval
    z = stats.norm.ppf((1 + confidence) / 2)
    
    denominator = 1 + z**2 / n
    center = (accuracy + z**2 / (2*n)) / denominator
    margin = z * np.sqrt((accuracy * (1 - accuracy) + z**2 / (4*n)) / n) / denominator
    
    lower = max(0, center - margin)
    upper = min(1, center + margin)
    
    return {
        'accuracy': accuracy,
        'n': n,
        'confidence': confidence,
        'ci_lower': lower,
        'ci_upper': upper,
        'margin_of_error': (upper - lower) / 2
    }
 
# Example: Two models with different sample sizes
print("Impact of Sample Size on Confidence Intervals")
print("=" * 50)
 
# Small sample
y_true_small = [1]*8 + [0]*2
y_pred_small = [1]*7 + [0]*1 + [0]*1 + [0]*1
result_small = accuracy_confidence_interval(y_true_small, y_pred_small)
print(f"
Small sample (n={result_small['n']}):")
print(f"  Accuracy: {result_small['accuracy']:.1%}")
print(f"  95% CI: [{result_small['ci_lower']:.1%}, {result_small['ci_upper']:.1%}]")
print(f"  Margin of error: ±{result_small['margin_of_error']:.1%}")
 
# Large sample
y_true_large = [1]*800 + [0]*200
y_pred_large = [1]*700 + [0]*100 + [0]*100 + [1]*100
result_large = accuracy_confidence_interval(y_true_large, y_pred_large)
print(f"
Large sample (n={result_large['n']}):")
print(f"  Accuracy: {result_large['accuracy']:.1%}")
print(f"  95% CI: [{result_large['ci_lower']:.1%}, {result_large['ci_upper']:.1%}]")
print(f"  Margin of error: ±{result_large['margin_of_error']:.1%}")
 
print("
Conclusion: Larger samples give tighter confidence intervals.")

Summary: The Accuracy Double-Edged Sword

Accuracy is the most intuitive and widely reported classification metric—and also the most commonly misused. Understanding its proper scope is essential for any ML practitioner.

Key Takeaways

•Accuracy is intuitive — the proportion of correct predictions — but this simplicity conceals critical limitations.
•The accuracy paradox shows that trivial classifiers can achieve high accuracy on imbalanced datasets by always predicting the majority class.
•Class imbalance breaks accuracy — when classes are unequal, accuracy is dominated by majority class performance and ignores minority class failures.
•Asymmetric error costs make accuracy inappropriate even with balanced classes — it treats all errors as equally bad.
•Balanced accuracy and G-mean address class imbalance by averaging class-specific performance.
•Always compare against the baseline — if your accuracy is close to max(π, 1-π), your model may have learned nothing useful.
•Accuracy is appropriate when classes are balanced, errors are equally costly, and you care about overall correctness.

What's Next:

Recognizing accuracy's limitations motivates the search for better metrics. The next page introduces precision and recall—two complementary metrics that separately address different types of classifier errors and provide a more nuanced view of performance.

Page Complete

You now understand accuracy's definition, intuitive appeal, and critical limitations. You can identify when accuracy is appropriate and when it will mislead you—an essential skill for any machine learning practitioner working on real-world classification problems.

2 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

Classification Metrics

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

2 / 5

Accuracy and its Limitations

The Deceptive Simplicity of Accuracy

This page will thoroughly examine accuracy—its definition, its proper use cases, and its critical limitations—so that you can wield it appropriately and recognize when it will mislead you.

What You Will Learn

The Definition of Accuracy

Accuracy is defined as the proportion of predictions that are correct:

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} = \frac{TP + TN}{TP + TN + FP + FN}$$

In terms of the confusion matrix:

The numerator is the sum of diagonal entries (correct predictions)
The denominator is the sum of all entries (total predictions)

For multi-class classification with $K$ classes:

$$\text{Accuracy} = \frac{\sum_{i=1}^{K} C_{ii}}{\sum_{i=1}^{K} \sum_{j=1}^{K} C_{ij}} = \frac{\text{Trace}(C)}{\sum C}$$

where $C_{ij}$ is the confusion matrix entry for true class $i$, predicted class $j$.

The Complementary Error Rate

accuracy_calculation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
 
# Example predictions
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
 
# Method 1: Direct calculation
accuracy_direct = np.mean(np.array(y_true) == np.array(y_pred))
print(f"Accuracy (direct): {accuracy_direct:.4f}")
 
# Method 2: From confusion matrix
cm = confusion_matrix(y_true, y_pred)
TN, FP, FN, TP = cm.ravel()
accuracy_from_cm = (TP + TN) / (TP + TN + FP + FN)
print(f"Accuracy (from CM): {accuracy_from_cm:.4f}")
 
# Method 3: Using sklearn
accuracy_sklearn = accuracy_score(y_true, y_pred)
print(f"Accuracy (sklearn): {accuracy_sklearn:.4f}")
 
# All three methods give identical results
print(f"
Confusion Matrix:")
print(cm)
print(f"
Breakdown: {TP + TN} correct / {TP + TN + FP + FN} total = {accuracy_sklearn:.1%}")

Why Accuracy Feels Right

Accuracy's widespread use isn't accidental—it possesses several properties that make it cognitively appealing:

2. Bounded range: Accuracy falls in $[0, 1]$ (or 0-100%), providing a clear sense of scale. Values near 1 are good; values near 0 are bad. The endpoints are meaningful and interpretable.

3. Global view: Accuracy considers all predictions equally, providing a single summary of overall performance without favoring any particular class or outcome.

4. Comparison simplicity: Comparing models by accuracy is trivial: the model with higher accuracy is better (under the assumption that all errors are equally costly).

5. Historical precedent: Accuracy has been used for decades across statistics, machine learning, and related fields, making it a lingua franca for performance reporting.

When Accuracy Works Well

The Accuracy Paradox

The accuracy paradox is the phenomenon where a model with higher accuracy can be objectively worse than a model with lower accuracy. This occurs when class distributions are imbalanced.

The Setup:

Consider a fraud detection dataset with 10,000 transactions:

9,900 legitimate transactions (99%)
100 fraudulent transactions (1%)

The Trivial Baseline:

A 'model' that simply predicts 'legitimate' for every transaction achieves:

$$\text{Accuracy}_{trivial} = \frac{9900}{10000} = 99%$$

This model:

Catches zero frauds (100% miss rate)
Is completely useless for its intended purpose
Yet has 99% accuracy!

The Paradox:

A real fraud detection model that catches 80% of frauds while having a 2% false alarm rate:

Correctly identifies 80 frauds (TP = 80)
Misses 20 frauds (FN = 20)
Correctly clears 9,702 legitimate transactions (TN = 9702)
Falsely flags 198 legitimate transactions (FP = 198)

$$\text{Accuracy}_{real} = \frac{80 + 9702}{10000} = 97.82%$$

The real model has lower accuracy (97.82% vs 99%) but is infinitely more valuable for its intended purpose.

Trivial Model (99% Accuracy)

•Predicts 'legitimate' for everything
•Catches 0 out of 100 frauds
•Detection rate: 0%
•Business value: Zero
•Effort to build: None

Real Model (97.82% Accuracy)

•Actually distinguishes fraud from legitimate
•Catches 80 out of 100 frauds
•Detection rate: 80%
•Business value: Substantial
•Lower accuracy, superior model

The Danger of Accuracy in Imbalanced Settings

Mathematical Analysis of the Accuracy Paradox

Let's formalize why accuracy fails under class imbalance. Let $\pi$ be the proportion of positive examples in the dataset ($\pi = P/n$), and $(1-\pi)$ be the proportion of negatives.

Baseline Accuracy:

A trivial classifier that always predicts the majority class achieves:

$$\text{Accuracy}_{baseline} = \max(\pi, 1-\pi)$$

This means:

If $\pi = 0.5$: Baseline accuracy = 50%
If $\pi = 0.01$: Baseline accuracy = 99%
If $\pi = 0.001$: Baseline accuracy = 99.9%

The more imbalanced the dataset, the higher the 'free' baseline accuracy.

Why This Matters:

Two components contribute to accuracy:

$$\text{Accuracy} = \pi \cdot TPR + (1-\pi) \cdot TNR$$

where TPR is the True Positive Rate and TNR is the True Negative Rate.

When $\pi \ll 0.5$ (highly imbalanced toward negatives):

The weight on TNR is much larger than on TPR
High accuracy can be achieved with TNR ≈ 1 and TPR ≈ 0
The metric effectively ignores minority class performance

accuracy_paradox_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_accuracy_paradox():
    """
    Demonstrate how class imbalance breaks accuracy as a useful metric.
    """
    # Range of class imbalance ratios
    positive_proportions = [0.5, 0.1, 0.05, 0.01, 0.001]
    
    # Fixed TPR (recall) for a "good" model
    tpr_good = 0.80  # Catches 80% of positives
    tnr_good = 0.98  # Correctly identifies 98% of negatives
    
    print("Impact of Class Imbalance on Accuracy")
    print("=" * 60)
    print(f"Model performance: TPR = {tpr_good:.0%}, TNR = {tnr_good:.0%}")
    print("-" * 60)
    
    for pi in positive_proportions:
        # Baseline: Always predict majority class
        baseline_acc = max(pi, 1 - pi)
        
        # Our model's accuracy
        model_acc = pi * tpr_good + (1 - pi) * tnr_good
        
        # How much better than baseline?
        relative_improvement = (model_acc - baseline_acc) / baseline_acc * 100
        
        print(f"
Positive class proportion: {pi:.1%}")
        print(f"  Baseline (always negative): {baseline_acc:.3%}")
        print(f"  Our model:                  {model_acc:.3%}")
        print(f"  Improvement over baseline:  {relative_improvement:+.2f}%")
        
        if model_acc < baseline_acc:
            print(f"  ⚠️  Model has LOWER accuracy than trivial baseline!")
    
    # Visualization
    pi_range = np.linspace(0.001, 0.5, 100)
    baseline_accs = np.maximum(pi_range, 1 - pi_range)
    model_accs = pi_range * tpr_good + (1 - pi_range) * tnr_good
    
    plt.figure(figsize=(10, 6))
    plt.plot(pi_range, baseline_accs * 100, 'r--', linewidth=2, label='Baseline (majority class)')
    plt.plot(pi_range, model_accs * 100, 'b-', linewidth=2, label='Model (TPR=80%, TNR=98%)')
    plt.fill_between(pi_range, baseline_accs * 100, model_accs * 100, 
                     where=model_accs > baseline_accs, alpha=0.3, color='blue', 
                     label='Model beats baseline')
    plt.fill_between(pi_range, baseline_accs * 100, model_accs * 100,
                     where=model_accs < baseline_accs, alpha=0.3, color='red',
                     label='Baseline beats model')
    plt.xlabel('Proportion of Positive Class (π)', fontsize=12)
    plt.ylabel('Accuracy (%)', fontsize=12)
    plt.title('The Accuracy Paradox: When Good Models Have Lower Accuracy', fontsize=14)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.xlim(0, 0.5)
    plt.ylim(75, 100)
    plt.savefig('accuracy_paradox.png', dpi=150, bbox_inches='tight')
    plt.show()
 
analyze_accuracy_paradox()

Cost-Sensitive Considerations

Beyond class imbalance, accuracy fails when error costs are asymmetric—which is almost always in real-world applications.

The Implicit Assumption:

Accuracy treats all errors equally:

A False Positive costs the same as a False Negative
Misclassifying class 1 as class 2 is as bad as misclassifying class 2 as class 1

This assumption is rarely valid in practice.

Example: Medical Screening

Consider a cancer screening test:

False Negative (missed cancer): Patient doesn't receive treatment, disease progresses, potentially fatal
False Positive (false alarm): Patient undergoes additional tests, experiences anxiety, incurs costs

While both errors have costs, a False Negative is dramatically more serious. A metric that weights them equally cannot guide model selection appropriately.

Error Cost Asymmetry Across Domains
Domain	High-Cost Error	Why Cost is High	Acceptable Trade-off
Cancer Screening	False Negative	Missed diagnosis → delayed treatment → death	Accept more False Positives
Spam Email	False Positive	Important email lost → missed opportunity	Accept more spam through
Autonomous Driving	False Negative (obstacle)	Collision → injury/death	Accept unnecessary stops
Criminal Sentencing	False Positive	Innocent person imprisoned	Accept more guilty go free
Credit Scoring	Context-dependent	FN = bad loan; FP = lost customer	Depends on business model

Accuracy Ignores Cost Structure

Accuracy Variants and Improvements

Several metrics address accuracy's limitations while retaining some of its simplicity:

1. Balanced Accuracy:

Balanced accuracy averages the recall across classes, giving equal weight to each class regardless of size:

$$\text{Balanced Accuracy} = \frac{1}{K} \sum_{i=1}^{K} \text{Recall}_i = \frac{TPR + TNR}{2} \text{ (for binary)}$$

For binary classification: $\text{Balanced Accuracy} = \frac{1}{2}\left(\frac{TP}{TP+FN} + \frac{TN}{TN+FP}\right)$

This metric:

Ranges from 0 to 1
Equals 0.5 for a random classifier (regardless of class balance)
Gives equal importance to identifying each class correctly

2. Weighted Accuracy:

Assigns explicit weights $w_i$ to each class:

$$\text{Weighted Accuracy} = \frac{\sum_{i=1}^{K} w_i \cdot C_{ii}}{\sum_{i=1}^{K} w_i \cdot \sum_{j=1}^{K} C_{ij}}$$

This allows incorporating domain knowledge about class importance.

3. Geometric Mean of Class Accuracies:

$$G\text{-Mean} = \sqrt{TPR \cdot TNR} = \sqrt{\text{Sensitivity} \cdot \text{Specificity}}$$

This metric is zero if either TPR or TNR is zero, penalizing models that completely fail on any class.

accuracy_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
from sklearn.metrics import balanced_accuracy_score, confusion_matrix
 
def compute_accuracy_variants(y_true, y_pred):
    """
    Compute standard accuracy and its improved variants.
    """
    cm = confusion_matrix(y_true, y_pred)
    TN, FP, FN, TP = cm.ravel()
    
    # Standard accuracy
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    
    # Balanced accuracy
    tpr = TP / (TP + FN) if (TP + FN) > 0 else 0
    tnr = TN / (TN + FP) if (TN + FP) > 0 else 0
    balanced_acc = (tpr + tnr) / 2
    
    # Geometric mean
    g_mean = np.sqrt(tpr * tnr)
    
    return {
        'Standard Accuracy': accuracy,
        'Balanced Accuracy': balanced_acc,
        'G-Mean': g_mean,
        'TPR (Sensitivity)': tpr,
        'TNR (Specificity)': tnr,
    }
 
# Example: Imbalanced dataset
np.random.seed(42)
n_neg, n_pos = 950, 50  # 95% negative, 5% positive
 
# Scenario 1: model that mostly predicts negative
y_true_1 = [0] * n_neg + [1] * n_pos
y_pred_1 = [0] * n_neg + [1] * 10 + [0] * 40  # Catches only 10/50 positives
 
print("Scenario 1: Conservative Model (predicts few positives)")
print("-" * 55)
for name, value in compute_accuracy_variants(y_true_1, y_pred_1).items():
    print(f"  {name}: {value:.4f}")
 
# Scenario 2: model with better minority class performance
y_pred_2 = [0] * 920 + [1] * 30 + [1] * 40 + [0] * 10  # Catches 40/50 positives
 
print("
Scenario 2: Aggressive Model (better minority detection)")
print("-" * 55)
for name, value in compute_accuracy_variants(y_true_1, y_pred_2).items():
    print(f"  {name}: {value:.4f}")
 
print("
Key Observation:")
print("  Standard accuracy prefers the conservative model.")
print("  Balanced accuracy and G-Mean prefer the aggressive model.")

When Accuracy IS Appropriate

Despite its limitations, accuracy remains the right metric in specific circumstances. Understanding when to use it—rather than blanket avoidance—is the mark of a thoughtful practitioner.

Conditions for Appropriate Use:

Scenarios Where Accuracy Works

•Balanced classes: When $|P - N| / n$ is small (e.g., classes within 40-60% split), accuracy reflects meaningful performance.
•Symmetric error costs: When False Positives and False Negatives are genuinely equally bad (rare in practice but possible).
•Preliminary comparison: For quick model comparison before detailed analysis, accuracy provides a rough filter.
•Communication: When explaining performance to non-technical audiences, accuracy may serve as an initial bridge before introducing nuanced metrics.
•Theoretical analysis: In learning theory and statistical analysis, accuracy (or its complement, error rate) has well-understood properties.

The Baseline Check

Decision Framework:

Should I use accuracy?
├── Are classes balanced (within 40-60%)?
│   ├── No → Use balanced accuracy, F1, or class-specific metrics
│   └── Yes → Continue to next check
├── Are error costs roughly symmetric?
│   ├── No → Use cost-sensitive metrics or threshold optimization
│   └── Yes → Continue to next check
├── Do I care equally about all classes?
│   ├── No → Use class-weighted metrics or macro/micro averages
│   └── Yes → Accuracy is appropriate

Statistical Properties of Accuracy

For completeness, let's examine accuracy's statistical properties:

1. Accuracy as a Random Variable:

$$\text{Correct predictions} \sim \text{Binomial}(n, p)$$

2. Confidence Intervals:

A $(1-\alpha)$ confidence interval for accuracy can be constructed using:

$$\hat{p} \pm z_{1-\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

where $\hat{p}$ is the observed accuracy. For small $n$ or extreme $\hat{p}$, use Wilson score intervals or exact binomial intervals.

3. Sample Size Requirements:

To estimate accuracy within margin of error $\epsilon$ with confidence $(1-\alpha)$:

$$n \geq \frac{z_{1-\alpha/2}^2 \cdot p(1-p)}{\epsilon^2}$$

For the worst case ($p = 0.5$) with 95% confidence and $\epsilon = 0.01$:

$$n \geq \frac{1.96^2 \cdot 0.25}{0.01^2} = 9604$$

accuracy_confidence_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from scipy import stats
 
def accuracy_confidence_interval(y_true, y_pred, confidence=0.95):
    """
    Compute confidence interval for accuracy using Wilson score interval.
    More reliable than normal approximation for extreme proportions or small n.
    """
    n = len(y_true)
    correct = np.sum(np.array(y_true) == np.array(y_pred))
    accuracy = correct / n
    
    # Wilson score interval
    z = stats.norm.ppf((1 + confidence) / 2)
    
    denominator = 1 + z**2 / n
    center = (accuracy + z**2 / (2*n)) / denominator
    margin = z * np.sqrt((accuracy * (1 - accuracy) + z**2 / (4*n)) / n) / denominator
    
    lower = max(0, center - margin)
    upper = min(1, center + margin)
    
    return {
        'accuracy': accuracy,
        'n': n,
        'confidence': confidence,
        'ci_lower': lower,
        'ci_upper': upper,
        'margin_of_error': (upper - lower) / 2
    }
 
# Example: Two models with different sample sizes
print("Impact of Sample Size on Confidence Intervals")
print("=" * 50)
 
# Small sample
y_true_small = [1]*8 + [0]*2
y_pred_small = [1]*7 + [0]*1 + [0]*1 + [0]*1
result_small = accuracy_confidence_interval(y_true_small, y_pred_small)
print(f"
Small sample (n={result_small['n']}):")
print(f"  Accuracy: {result_small['accuracy']:.1%}")
print(f"  95% CI: [{result_small['ci_lower']:.1%}, {result_small['ci_upper']:.1%}]")
print(f"  Margin of error: ±{result_small['margin_of_error']:.1%}")
 
# Large sample
y_true_large = [1]*800 + [0]*200
y_pred_large = [1]*700 + [0]*100 + [0]*100 + [1]*100
result_large = accuracy_confidence_interval(y_true_large, y_pred_large)
print(f"
Large sample (n={result_large['n']}):")
print(f"  Accuracy: {result_large['accuracy']:.1%}")
print(f"  95% CI: [{result_large['ci_lower']:.1%}, {result_large['ci_upper']:.1%}]")
print(f"  Margin of error: ±{result_large['margin_of_error']:.1%}")
 
print("
Conclusion: Larger samples give tighter confidence intervals.")

Summary: The Accuracy Double-Edged Sword

Accuracy is the most intuitive and widely reported classification metric—and also the most commonly misused. Understanding its proper scope is essential for any ML practitioner.

Key Takeaways

•Accuracy is intuitive — the proportion of correct predictions — but this simplicity conceals critical limitations.
•The accuracy paradox shows that trivial classifiers can achieve high accuracy on imbalanced datasets by always predicting the majority class.
•Class imbalance breaks accuracy — when classes are unequal, accuracy is dominated by majority class performance and ignores minority class failures.
•Asymmetric error costs make accuracy inappropriate even with balanced classes — it treats all errors as equally bad.
•Balanced accuracy and G-mean address class imbalance by averaging class-specific performance.
•Always compare against the baseline — if your accuracy is close to max(π, 1-π), your model may have learned nothing useful.
•Accuracy is appropriate when classes are balanced, errors are equally costly, and you care about overall correctness.

What's Next:

Page Complete

2 / 5