Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

2 / 5

AUC Interpretation

Summarizing Classifier Performance in One Number

The ROC curve provides a complete picture of classifier behavior across all thresholds—but that completeness comes with complexity. Comparing two ROC curves visually works for simple cases, but what if curves cross? What if we need to rank 50 models programmatically? What if we need a single number for optimization or reporting?

This is where the Area Under the ROC Curve (AUC) becomes indispensable. AUC collapses an entire curve into a single scalar value between 0 and 1, enabling:

Objective model comparison
Hyperparameter optimization targets
Clear communication to stakeholders
Statistical significance testing between models

But reducing a curve to a number involves tradeoffs. When is AUC the right summary? When does it mislead? This page provides the deep understanding needed to answer these questions confidently.

What You Will Learn

By the end of this page, you will understand AUC's mathematical definition, its probabilistic interpretation as a ranking metric, efficient computation algorithms, statistical properties and confidence intervals, and critical guidance on when AUC is appropriate versus when alternative metrics are needed.

Mathematical Definition

The Area Under the Curve (AUC) is exactly what the name suggests: the area between the ROC curve and the x-axis. But this simple geometric definition has profound implications.

Formal Definition

Given an ROC curve as a function TPR(FPR), the AUC is:

$$\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) , d(\text{FPR})$$

Alternatively, parameterizing by threshold τ:

$$\text{AUC} = -\int_{\tau=-\infty}^{\tau=+\infty} \text{TPR}(\tau) \cdot \frac{d\text{FPR}(\tau)}{d\tau} , d\tau$$

Discrete Approximation (Trapezoidal Rule)

In practice, we have discrete points on the ROC curve. The trapezoidal rule integrates the area under the step function:

AUC Computation (Trapezoidal)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def auc_trapezoidal(fpr, tpr):
    """
    Compute AUC using the trapezoidal rule.
    
    Parameters:
    -----------
    fpr : array-like
        False positive rates (sorted in ascending order)
    tpr : array-like
        Corresponding true positive rates
    
    Returns:
    --------
    float : Area under the ROC curve
    """
    fpr = np.array(fpr)
    tpr = np.array(tpr)
    
    # Ensure proper ordering
    order = np.argsort(fpr)
    fpr = fpr[order]
    tpr = tpr[order]
    
    # Trapezoidal integration
    # Area = sum of trapezoid areas between consecutive points
    # Trapezoid area = (base) × (average of heights)
    #                = (fpr[i+1] - fpr[i]) × (tpr[i+1] + tpr[i]) / 2
    
    auc = 0.0
    for i in range(len(fpr) - 1):
        width = fpr[i + 1] - fpr[i]
        height = (tpr[i] + tpr[i + 1]) / 2
        auc += width * height
    
    return auc
 
# Equivalent one-liner with numpy:
# auc = np.trapz(tpr, fpr)
 
# Example usage
fpr_example = [0.0, 0.1, 0.2, 0.5, 1.0]
tpr_example = [0.0, 0.4, 0.6, 0.9, 1.0]
print(f"AUC = {auc_trapezoidal(fpr_example, tpr_example):.4f}")

Bounded Range and Interpretation

AUC Value	Interpretation
1.0	Perfect classifier — all positives ranked above all negatives
0.9 - 1.0	Excellent — strong separation between classes
0.8 - 0.9	Good — useful discrimination
0.7 - 0.8	Fair — moderate separation
0.5 - 0.7	Poor — weak discrimination, approaching random
0.5	Random — no better than chance
0.0 - 0.5	Worse than random — invert predictions to improve
0.0	Perfectly inverted — flip predictions for perfect classifier

The Symmetric Property

If you swap every positive and negative label, the ROC curve reflects across the diagonal, and AUC becomes 1 - AUC. A classifier with AUC = 0.3 becomes AUC = 0.7 after label swapping. This symmetry explains why AUC < 0.5 indicates useful signal (just inverted).

The Probabilistic Interpretation

The most profound insight about AUC is its probabilistic interpretation. This isn't just a mathematical curiosity—it fundamentally shapes how we should think about AUC as a metric.

The Mann-Whitney-Wilcoxon Statistic

AUC equals the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:

$$\text{AUC} = P(\text{score}(x^+) > \text{score}(x^-))$$

where $x^+$ is a random positive instance and $x^-$ is a random negative instance.

Handling ties: $$\text{AUC} = P(\text{score}(x^+) > \text{score}(x^-)) + \frac{1}{2}P(\text{score}(x^+) = \text{score}(x^-))$$

This is exactly the Wilcoxon rank-sum statistic (Mann-Whitney U), connecting ROC analysis to non-parametric statistics.

What AUC Really Measures

AUC measures RANKING QUALITY, not classification accuracy.

If you randomly pick one positive and one negative example, AUC is the probability the classifier ranks the positive higher. This makes AUC ideal for comparing models on their ability to ORDER examples by class likelihood, independent of threshold choice.

Proof Sketch

The connection between the geometric AUC and the probabilistic interpretation can be proven by considering all positive-negative pairs:

For each positive example $x_i^+$ with score $s_i^+$, count how many negative examples $x_j^-$ with score $s_j^-$ satisfy $s_i^+ > s_j^-$.
Sum these counts across all positives.
Divide by total pairs (P × N).

This count-based calculation exactly equals the trapezoidal AUC.

Intuitive Meaning for Practitioners

Think of AUC as answering this question:

"If I show the model two examples—one that is truly positive and one that is truly negative—how often will it correctly say the positive one has a higher risk/probability/score?"

AUC = 0.5: The model has a 50-50 chance — pure guessing, flipping a coin
AUC = 0.8: The model gets it right 80% of the time — useful
AUC = 1.0: The model always gets it right — perfect

This interpretation is intuitive for stakeholders: "Our fraud model correctly ranks fraud above non-fraud 94% of the time."

AUC via Pair Counting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def auc_mann_whitney(scores, labels):
    """
    Compute AUC using the Mann-Whitney U statistic interpretation.
    
    This directly computes the probability that a random positive
    is scored higher than a random negative.
    
    Time Complexity: O(P × N) - can be slow for large datasets
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    pos_scores = scores[labels == 1]
    neg_scores = scores[labels == 0]
    
    P = len(pos_scores)
    N = len(neg_scores)
    
    if P == 0 or N == 0:
        return 0.5  # Undefined, return random baseline
    
    # Count pairs where positive > negative
    correct = 0
    ties = 0
    
    for pos_score in pos_scores:
        for neg_score in neg_scores:
            if pos_score > neg_score:
                correct += 1
            elif pos_score == neg_score:
                ties += 1
    
    # AUC = (correct + 0.5 * ties) / (P * N)
    return (correct + 0.5 * ties) / (P * N)
 
# Efficient O(n log n) version using sorting
def auc_efficient(scores, labels):
    """
    Compute AUC efficiently using sorting and ranking.
    
    Time Complexity: O(n log n)
    """
    from scipy import stats
    
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Use scipy's rankdata for efficient ranking
    ranks = stats.rankdata(scores)
    
    pos_ranks = ranks[labels == 1]
    P = len(pos_ranks)
    N = len(labels) - P
    
    if P == 0 or N == 0:
        return 0.5
    
    # Mann-Whitney U statistic formula
    # Sum of ranks of positives minus expected sum if random
    rank_sum = np.sum(pos_ranks)
    expected_if_random = P * (P + 1) / 2  # Minimum possible sum
    
    U = rank_sum - expected_if_random
    auc = U / (P * N)
    
    return auc

Gini Coefficient Connection

In some domains—particularly insurance and credit scoring—you'll encounter the Gini coefficient (or Gini index) as an evaluation metric. The Gini coefficient is directly related to AUC.

Definition

The Gini coefficient measures how far the ROC curve is from the diagonal (random baseline):

$$\text{Gini} = 2 \times \text{AUC} - 1$$

Equivalently:

$$\text{AUC} = \frac{\text{Gini} + 1}{2}$$

Interpretation

AUC and Gini Correspondence
AUC	Gini	Interpretation
1.0	1.0	Perfect separation
0.9	0.8	Excellent model
0.8	0.6	Good model
0.7	0.4	Fair model
0.5	0.0	Random (no value)
0.0	-1.0	Perfect inverted (flip labels)

Why Two Metrics?

Gini's advantage: It's normalized with respect to random performance. A Gini of 0.4 means the model captures 40% of the possible lift above random, regardless of the problem's difficulty. This can be more intuitive for comparing models across different datasets.

AUC's advantage: It has the direct probabilistic interpretation (pairwise ranking probability) and is more standard in machine learning literature.

Both carry identical information—use whichever your domain prefers.

Naming Confusion

Be careful: the 'Gini coefficient' in model evaluation is different from the 'Gini impurity' used in decision tree splitting (which measures node impurity). Same name, different concepts. Context determines meaning.

Statistical Properties of AUC

AUC is a sample statistic—it varies depending on which examples appear in the test set. Understanding its statistical properties enables rigorous model comparison.

AUC is a U-Statistic

The Mann-Whitney representation shows that AUC is a U-statistic—a statistic computed from averages over pairs of observations. U-statistics have well-understood asymptotic properties:

Consistency: As sample size increases, AUC converges to the true population value
Asymptotic normality: For large samples, AUC is approximately normally distributed
Known variance formula: We can compute confidence intervals

Variance Estimation (DeLong Method)

The DeLong method (1988) provides an efficient way to estimate AUC variance, enabling confidence interval construction and hypothesis testing:

AUC Confidence Interval (DeLong)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy import stats
 
def delong_variance(scores, labels):
    """
    Compute DeLong variance estimate for AUC.
    
    Based on: DeLong, DeLong, Clarke-Pearson (1988)
    "Comparing the Areas under Two or More Correlated
    Receiver Operating Characteristic Curves"
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    pos_scores = scores[labels == 1]
    neg_scores = scores[labels == 0]
    
    P = len(pos_scores)
    N = len(neg_scores)
    
    # Placement values: for each positive, what fraction of 
    # negatives are scored lower?
    V10 = np.zeros(P)  # Placements for positives
    for i, ps in enumerate(pos_scores):
        V10[i] = np.mean(ps > neg_scores) + 0.5 * np.mean(ps == neg_scores)
    
    # For each negative, what fraction of positives score higher?
    V01 = np.zeros(N)
    for j, ns in enumerate(neg_scores):
        V01[j] = np.mean(pos_scores > ns) + 0.5 * np.mean(pos_scores == ns)
    
    # AUC
    auc = np.mean(V10)  # S10
    
    # Variance components
    S10_var = np.var(V10, ddof=1) if P > 1 else 0
    S01_var = np.var(V01, ddof=1) if N > 1 else 0
    
    # DeLong variance of AUC
    var_auc = S10_var / P + S01_var / N
    
    return auc, var_auc
 
def auc_confidence_interval(scores, labels, alpha=0.05):
    """
    Compute AUC with confidence interval.
    
    Returns: (auc, lower_bound, upper_bound)
    """
    auc, var_auc = delong_variance(scores, labels)
    se = np.sqrt(var_auc)
    
    z = stats.norm.ppf(1 - alpha / 2)
    
    lower = max(0, auc - z * se)
    upper = min(1, auc + z * se)
    
    return auc, lower, upper
 
# Example usage
np.random.seed(42)
n = 200
scores = np.concatenate([
    np.random.normal(0.6, 0.2, 50),   # Positives
    np.random.normal(0.4, 0.2, 150)   # Negatives
])
labels = np.array([1]*50 + [0]*150)
 
auc, lower, upper = auc_confidence_interval(scores, labels)
print(f"AUC = {auc:.3f}")
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")

Sample Size Effects

AUC variance depends heavily on sample size, particularly on the number of positive examples (often the minority class):

Positive Examples	Negative Examples	Approximate SE for AUC ~0.80
50	200	~0.04
100	400	~0.03
500	2000	~0.013
1000	4000	~0.009

Key insight: The minority class dominates variance. With only 50 positives, even AUC = 0.85 could have a 95% CI of [0.77, 0.93]. With 1000 positives, the same AUC might have CI [0.83, 0.87].

Report Confidence Intervals

Never report a bare AUC value without context. For small test sets, always include confidence intervals or bootstrap estimates. An AUC of 0.78 with CI [0.65, 0.91] is very different from 0.78 with CI [0.76, 0.80].

Comparing AUCs Statistically

When comparing two models, observing that Model A has AUC = 0.82 and Model B has AUC = 0.79 isn't sufficient to claim A is better. We need statistical testing.

The Challenge: Correlated AUCs

When two models are evaluated on the same test set, their AUCs are correlated—they share the same examples. Ignoring this correlation inflates Type I error (false positives).

DeLong Test for Correlated AUCs

The DeLong test (1988) properly accounts for this correlation:

DeLong Test for Comparing AUCs
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy import stats
 
def delong_test(scores1, scores2, labels):
    """
    Compare two classifiers' AUCs using DeLong's test.
    
    H0: AUC1 = AUC2 (no difference)
    H1: AUC1 ≠ AUC2 (two-sided test)
    
    Parameters:
    -----------
    scores1 : array-like
        Scores from first classifier
    scores2 : array-like
        Scores from second classifier (same examples)
    labels : array-like
        True labels
    
    Returns:
    --------
    z_stat : float
        Z-statistic for the difference
    p_value : float
        Two-sided p-value
    """
    scores1 = np.array(scores1)
    scores2 = np.array(scores2)
    labels = np.array(labels)
    
    pos_mask = labels == 1
    neg_mask = labels == 0
    
    pos_scores1 = scores1[pos_mask]
    pos_scores2 = scores2[pos_mask]
    neg_scores1 = scores1[neg_mask]
    neg_scores2 = scores2[neg_mask]
    
    P = len(pos_scores1)
    N = len(neg_scores1)
    
    # Placement values for each classifier
    def placements(pos_s, neg_s):
        V10 = np.array([np.mean(ps > neg_s) + 0.5 * np.mean(ps == neg_s) 
                        for ps in pos_s])
        V01 = np.array([np.mean(pos_s > ns) + 0.5 * np.mean(pos_s == ns) 
                        for ns in neg_s])
        return V10, V01
    
    V10_1, V01_1 = placements(pos_scores1, neg_scores1)
    V10_2, V01_2 = placements(pos_scores2, neg_scores2)
    
    auc1 = np.mean(V10_1)
    auc2 = np.mean(V10_2)
    
    # Covariance structure
    S10 = np.cov(V10_1, V10_2)
    S01 = np.cov(V01_1, V01_2)
    
    # Variance of (AUC1 - AUC2)
    S = S10 / P + S01 / N
    var_diff = S[0, 0] + S[1, 1] - 2 * S[0, 1]
    
    if var_diff <= 0:
        return 0.0, 1.0  # Identical, no difference
    
    z = (auc1 - auc2) / np.sqrt(var_diff)
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    return z, p_value, auc1, auc2
 
# Example usage
np.random.seed(42)
n = 300
true_signal = np.random.randn(n)
labels = (true_signal > 0).astype(int)
 
# Two models with different noise levels
scores1 = true_signal + np.random.randn(n) * 0.5  # Better model
scores2 = true_signal + np.random.randn(n) * 0.8  # Worse model
 
z, p, auc1, auc2 = delong_test(scores1, scores2, labels)
print(f"Model 1 AUC: {auc1:.3f}")
print(f"Model 2 AUC: {auc2:.3f}")
print(f"Difference: {auc1 - auc2:.3f}")
print(f"Z-statistic: {z:.3f}")
print(f"P-value: {p:.4f}")
print(f"Significant at α=0.05? {p < 0.05}")

Bootstrap Comparison (Alternative)

When the DeLong test's normality assumptions may not hold (small samples), bootstrap resampling provides a non-parametric alternative:

Bootstrap AUC Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from sklearn.metrics import roc_auc_score
from sklearn.utils import resample
 
def bootstrap_auc_comparison(scores1, scores2, labels, n_bootstrap=1000, 
                             confidence_level=0.95):
    """
    Compare AUCs using stratified bootstrap resampling.
    
    Returns confidence interval for (AUC1 - AUC2).
    """
    differences = []
    
    for _ in range(n_bootstrap):
        # Stratified resample (maintain class balance)
        idx = resample(range(len(labels)), stratify=labels)
        
        s1_boot = scores1[idx]
        s2_boot = scores2[idx]
        y_boot = labels[idx]
        
        auc1_boot = roc_auc_score(y_boot, s1_boot)
        auc2_boot = roc_auc_score(y_boot, s2_boot)
        
        differences.append(auc1_boot - auc2_boot)
    
    differences = np.array(differences)
    
    alpha = 1 - confidence_level
    lower = np.percentile(differences, 100 * alpha / 2)
    upper = np.percentile(differences, 100 * (1 - alpha / 2))
    
    # If CI doesn't contain 0, difference is significant
    significant = (lower > 0) or (upper < 0)
    
    return {
        'mean_diff': np.mean(differences),
        'std_diff': np.std(differences),
        'ci_lower': lower,
        'ci_upper': upper,
        'significant': significant
    }

Practical Significance vs Statistical Significance

With large test sets, even tiny AUC differences become statistically significant. But is a difference of 0.81 vs 0.80 practically meaningful? Consider both: statistical significance tells you the difference is real; practical significance tells you it matters.

When AUC Is Appropriate

AUC is widely used, but it's not universally appropriate. Understanding when AUC shines helps you apply it correctly.

Ideal Use Cases

AUC Works Well When...

•Threshold is unknown at training time — You're building a model but users will choose their own operating point based on their specific cost tradeoffs.
•Ranking quality matters — The goal is to order examples by risk/likelihood, not to make hard classifications (e.g., generating leads for manual review).
•Comparing models' discrimination power — You want to know which model better separates classes, independent of calibration or threshold choice.
•Class balance varies across deployments — The same model will be used in contexts with different positive rates, making threshold-specific metrics unstable.
•Early model development — You're comparing many architectures/features and need a quick, threshold-free metric for iteration.

Why AUC Is Class-Balance Invariant

A key strength of AUC: it's invariant to class imbalance. Unlike accuracy (which is dominated by the majority class), AUC treats positives and negatives symmetrically because:

TPR is computed only within positives
FPR is computed only within negatives
AUC compares positive-negative pairs, not raw counts

A dataset with 99% negatives produces the same AUC as one with 50% negatives if the classifier ranks examples identically.

Why AUC Is Score-Scale Invariant

AUC only cares about ranking, not absolute scores. Apply any strictly monotonic transformation to scores:

log(score)
score³
σ(score) (sigmoid)

The AUC remains unchanged because the ranking of examples doesn't change.

This invariance is both a strength and a weakness:

Strength: Compare models that output different score ranges
Weakness: A poorly calibrated model (scores ≠ probabilities) can have high AUC but misleading probability estimates

When AUC Misleads

AUC's very properties that make it useful can also make it misleading in specific contexts. Understanding these limitations is essential for proper model evaluation.

AUC Can Mislead When...

•Operating point is fixed — If you'll always use threshold = 0.5 (or any fixed value), AUC averages over irrelevant thresholds. Local ROC behavior matters more than global AUC.
•Class imbalance is extreme — While AUC is mathematically class-balance invariant, its practical interpretation weakens. When negatives vastly outnumber positives, FPR differences involve huge absolute numbers.
•False positives and false negatives have very different costs — AUC weights all FPR regions equally, but in fraud detection, you might only care about FPR < 0.01. High AUC might come from good performance in irrelevant FPR regions.
•Calibration matters — AUC doesn't measure whether predicted probabilities are accurate. A model with AUC = 0.90 might predict P(y=1) = 0.95 when the true rate is 0.30.
•ROC curves cross — When ROC curves of two models cross, neither dominates everywhere. AUC picks a 'winner' but the better model depends on the operating region you care about.

The 'All Thresholds' Fallacy

AUC averages performance across all thresholds. But you'll only ever deploy at ONE threshold. If Model A dominates at thresholds you'll use (say, FPR < 0.05) while Model B dominates elsewhere, Model B might have higher AUC but be worse for your application. AUC can choose the wrong model.

The Extreme Imbalance Problem

Consider fraud detection with 0.1% fraud (1 in 1000 transactions):

At FPR = 0.01, you're flagging 1% of legitimate transactions — 10 false alarms per fraud
At FPR = 0.001, you're flagging 0.1% — 1 false alarm per fraud

AUC treats improvements in the FPR range [0.01, 0.10] the same as [0.001, 0.01]. But operationally, the latter matters far more.

Solution: Use partial AUC (pAUC) that focuses on the relevant FPR region, or switch to Precision-Recall curves (covered in the next page).

When Curves Cross

Two Models with Crossing ROC Curves
FPR Range	Model A TPR	Model B TPR	Better Model
0.00 - 0.05	0.45	0.55	B (high precision region)
0.05 - 0.15	0.70	0.65	A
0.15 - 0.30	0.85	0.80	A
0.30 - 1.00	0.95	1.00	B (high recall region)
AUC	0.82	0.80	A wins AUC contest

If you need to operate at very low FPR (high precision required), Model B is better despite losing AUC. Always look at the ROC curves, not just AUC.

Partial AUC

When you only care about a specific region of the ROC curve, partial AUC (pAUC) provides a more focused metric.

Definition

Partial AUC integrates only over a restricted FPR range [α, β]:

$$\text{pAUC}(\alpha, \beta) = \int_{\alpha}^{\beta} \text{TPR}(\text{FPR}) , d(\text{FPR})$$

Common use cases:

pAUC(0, 0.1): Performance in the high-precision region
pAUC(0, 0.2): Practical operating region for imbalanced data

Standardized Partial AUC

Raw pAUC depends on the interval width. To compare across intervals, standardize:

$$\text{Standardized pAUC} = \frac{\text{pAUC}(\alpha, \beta)}{\beta - \alpha}$$

This represents the average TPR over the FPR range, lying in [0, 1].

McClish Normalization

McClish (1989) proposed a normalization that maps pAUC to [0, 1] where 0.5 represents random performance:

$$\text{McClish} = \frac{1}{2}\left(1 + \frac{\text{pAUC} - \text{min}}{\text{max} - \text{min}}\right)$$

where min = α(β-α)/2 (random baseline) and max = (β-α) (perfect classifier in that region).

Partial AUC Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score
 
def partial_auc(fpr, tpr, fpr_max=0.1, standardized=True):
    """
    Compute partial AUC for FPR in [0, fpr_max].
    
    Parameters:
    -----------
    fpr : array-like
        False positive rates from roc_curve
    tpr : array-like
        True positive rates from roc_curve
    fpr_max : float
        Maximum FPR to include (default 0.1)
    standardized : bool
        If True, return McClish-normalized pAUC in [0.5, 1]
    
    Returns:
    --------
    float : Partial AUC
    """
    fpr = np.array(fpr)
    tpr = np.array(tpr)
    
    # Find points within range
    mask = fpr <= fpr_max
    
    if not np.any(mask):
        return 0.5 if standardized else 0.0
    
    # Include boundary point through interpolation
    if mask[-1] and not mask[-2] if len(mask) > 1 else True:
        # Need to interpolate at fpr_max
        pass
    
    fpr_clipped = fpr[mask]
    tpr_clipped = tpr[mask]
    
    # Add endpoint if not exactly at fpr_max
    if fpr_clipped[-1] < fpr_max:
        # Linear interpolation for TPR at fpr_max
        next_idx = np.searchsorted(fpr, fpr_max)
        if next_idx < len(fpr):
            # Interpolate
            frac = (fpr_max - fpr[next_idx-1]) / (fpr[next_idx] - fpr[next_idx-1])
            tpr_at_max = tpr[next_idx-1] + frac * (tpr[next_idx] - tpr[next_idx-1])
            fpr_clipped = np.append(fpr_clipped, fpr_max)
            tpr_clipped = np.append(tpr_clipped, tpr_at_max)
    
    # Trapezoidal integration
    pauc = np.trapz(tpr_clipped, fpr_clipped)
    
    if standardized:
        # McClish normalization
        min_val = 0.5 * fpr_max * fpr_max  # Random baseline area
        max_val = fpr_max  # Perfect classifier area
        pauc_normalized = 0.5 * (1 + (pauc - min_val) / (max_val - min_val))
        return pauc_normalized
    
    return pauc
 
# Example: Compare at low FPR
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
 
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05], 
                           n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
 
lr = LogisticRegression().fit(X_train, y_train)
gb = GradientBoostingClassifier().fit(X_train, y_train)
 
lr_probs = lr.predict_proba(X_test)[:, 1]
gb_probs = gb.predict_proba(X_test)[:, 1]
 
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
gb_fpr, gb_tpr, _ = roc_curve(y_test, gb_probs)
 
print("Full AUC:")
print(f"  Logistic Regression: {roc_auc_score(y_test, lr_probs):.3f}")
print(f"  Gradient Boosting:   {roc_auc_score(y_test, gb_probs):.3f}")
 
print("\nPartial AUC (FPR ≤ 0.1, standardized):")
print(f"  Logistic Regression: {partial_auc(lr_fpr, lr_tpr, 0.1):.3f}")
print(f"  Gradient Boosting:   {partial_auc(gb_fpr, gb_tpr, 0.1):.3f}")

scikit-learn Support

scikit-learn's roc_auc_score supports partial AUC via the max_fpr parameter: roc_auc_score(y_true, y_score, max_fpr=0.1). This returns McClish-normalized partial AUC.

Summary: AUC Interpretation

We've developed a comprehensive understanding of AUC—from mathematical foundations to practical guidance on appropriate use. Let's consolidate the key insights:

Key Takeaways

•AUC is the area under the ROC curve — geometrically simple, but with deep statistical meaning.
•AUC equals the pairwise ranking probability — the probability a random positive is scored higher than a random negative. This makes AUC a ranking metric, not a classification metric.
•AUC is invariant to class balance and score scale — useful properties for comparing models across different contexts, but they mean AUC ignores calibration.
•AUC is a sample statistic with variance — always report confidence intervals, especially for small test sets. Use DeLong or bootstrap methods.
•AUC averages over ALL thresholds — this can mislead when you only care about specific operating regions. Consider partial AUC for focused evaluation.
•Look at ROC curves, not just AUC — when curves cross, AUC may not reflect performance in your operating region.

What's next:

The next page covers Precision-Recall Curves—an alternative to ROC analysis that's often more informative for imbalanced problems. We'll explore their construction, the relationship between PR and ROC curves, and guidance on when to use each.

Page Complete

You now possess a rigorous understanding of AUC: its mathematical definition, probabilistic interpretation, statistical properties, and appropriate use cases. You can confidently use AUC for model comparison while understanding its limitations.

2 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

2 / 5

AUC Interpretation

Summarizing Classifier Performance in One Number

This is where the Area Under the ROC Curve (AUC) becomes indispensable. AUC collapses an entire curve into a single scalar value between 0 and 1, enabling:

Objective model comparison
Hyperparameter optimization targets
Clear communication to stakeholders
Statistical significance testing between models

But reducing a curve to a number involves tradeoffs. When is AUC the right summary? When does it mislead? This page provides the deep understanding needed to answer these questions confidently.

What You Will Learn

Mathematical Definition

The Area Under the Curve (AUC) is exactly what the name suggests: the area between the ROC curve and the x-axis. But this simple geometric definition has profound implications.

Formal Definition

Given an ROC curve as a function TPR(FPR), the AUC is:

$$\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) , d(\text{FPR})$$

Alternatively, parameterizing by threshold τ:

$$\text{AUC} = -\int_{\tau=-\infty}^{\tau=+\infty} \text{TPR}(\tau) \cdot \frac{d\text{FPR}(\tau)}{d\tau} , d\tau$$

Discrete Approximation (Trapezoidal Rule)

In practice, we have discrete points on the ROC curve. The trapezoidal rule integrates the area under the step function:

AUC Computation (Trapezoidal)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def auc_trapezoidal(fpr, tpr):
    """
    Compute AUC using the trapezoidal rule.
    
    Parameters:
    -----------
    fpr : array-like
        False positive rates (sorted in ascending order)
    tpr : array-like
        Corresponding true positive rates
    
    Returns:
    --------
    float : Area under the ROC curve
    """
    fpr = np.array(fpr)
    tpr = np.array(tpr)
    
    # Ensure proper ordering
    order = np.argsort(fpr)
    fpr = fpr[order]
    tpr = tpr[order]
    
    # Trapezoidal integration
    # Area = sum of trapezoid areas between consecutive points
    # Trapezoid area = (base) × (average of heights)
    #                = (fpr[i+1] - fpr[i]) × (tpr[i+1] + tpr[i]) / 2
    
    auc = 0.0
    for i in range(len(fpr) - 1):
        width = fpr[i + 1] - fpr[i]
        height = (tpr[i] + tpr[i + 1]) / 2
        auc += width * height
    
    return auc
 
# Equivalent one-liner with numpy:
# auc = np.trapz(tpr, fpr)
 
# Example usage
fpr_example = [0.0, 0.1, 0.2, 0.5, 1.0]
tpr_example = [0.0, 0.4, 0.6, 0.9, 1.0]
print(f"AUC = {auc_trapezoidal(fpr_example, tpr_example):.4f}")

Bounded Range and Interpretation

AUC Value	Interpretation
1.0	Perfect classifier — all positives ranked above all negatives
0.9 - 1.0	Excellent — strong separation between classes
0.8 - 0.9	Good — useful discrimination
0.7 - 0.8	Fair — moderate separation
0.5 - 0.7	Poor — weak discrimination, approaching random
0.5	Random — no better than chance
0.0 - 0.5	Worse than random — invert predictions to improve
0.0	Perfectly inverted — flip predictions for perfect classifier

The Symmetric Property

The Probabilistic Interpretation

The most profound insight about AUC is its probabilistic interpretation. This isn't just a mathematical curiosity—it fundamentally shapes how we should think about AUC as a metric.

The Mann-Whitney-Wilcoxon Statistic

AUC equals the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:

$$\text{AUC} = P(\text{score}(x^+) > \text{score}(x^-))$$

where $x^+$ is a random positive instance and $x^-$ is a random negative instance.

Handling ties: $$\text{AUC} = P(\text{score}(x^+) > \text{score}(x^-)) + \frac{1}{2}P(\text{score}(x^+) = \text{score}(x^-))$$

This is exactly the Wilcoxon rank-sum statistic (Mann-Whitney U), connecting ROC analysis to non-parametric statistics.

What AUC Really Measures

AUC measures RANKING QUALITY, not classification accuracy.

Proof Sketch

The connection between the geometric AUC and the probabilistic interpretation can be proven by considering all positive-negative pairs:

For each positive example $x_i^+$ with score $s_i^+$, count how many negative examples $x_j^-$ with score $s_j^-$ satisfy $s_i^+ > s_j^-$.
Sum these counts across all positives.
Divide by total pairs (P × N).

This count-based calculation exactly equals the trapezoidal AUC.

Intuitive Meaning for Practitioners

Think of AUC as answering this question:

"If I show the model two examples—one that is truly positive and one that is truly negative—how often will it correctly say the positive one has a higher risk/probability/score?"

AUC = 0.5: The model has a 50-50 chance — pure guessing, flipping a coin
AUC = 0.8: The model gets it right 80% of the time — useful
AUC = 1.0: The model always gets it right — perfect

This interpretation is intuitive for stakeholders: "Our fraud model correctly ranks fraud above non-fraud 94% of the time."

AUC via Pair Counting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def auc_mann_whitney(scores, labels):
    """
    Compute AUC using the Mann-Whitney U statistic interpretation.
    
    This directly computes the probability that a random positive
    is scored higher than a random negative.
    
    Time Complexity: O(P × N) - can be slow for large datasets
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    pos_scores = scores[labels == 1]
    neg_scores = scores[labels == 0]
    
    P = len(pos_scores)
    N = len(neg_scores)
    
    if P == 0 or N == 0:
        return 0.5  # Undefined, return random baseline
    
    # Count pairs where positive > negative
    correct = 0
    ties = 0
    
    for pos_score in pos_scores:
        for neg_score in neg_scores:
            if pos_score > neg_score:
                correct += 1
            elif pos_score == neg_score:
                ties += 1
    
    # AUC = (correct + 0.5 * ties) / (P * N)
    return (correct + 0.5 * ties) / (P * N)
 
# Efficient O(n log n) version using sorting
def auc_efficient(scores, labels):
    """
    Compute AUC efficiently using sorting and ranking.
    
    Time Complexity: O(n log n)
    """
    from scipy import stats
    
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Use scipy's rankdata for efficient ranking
    ranks = stats.rankdata(scores)
    
    pos_ranks = ranks[labels == 1]
    P = len(pos_ranks)
    N = len(labels) - P
    
    if P == 0 or N == 0:
        return 0.5
    
    # Mann-Whitney U statistic formula
    # Sum of ranks of positives minus expected sum if random
    rank_sum = np.sum(pos_ranks)
    expected_if_random = P * (P + 1) / 2  # Minimum possible sum
    
    U = rank_sum - expected_if_random
    auc = U / (P * N)
    
    return auc

Gini Coefficient Connection

In some domains—particularly insurance and credit scoring—you'll encounter the Gini coefficient (or Gini index) as an evaluation metric. The Gini coefficient is directly related to AUC.

Definition

The Gini coefficient measures how far the ROC curve is from the diagonal (random baseline):

$$\text{Gini} = 2 \times \text{AUC} - 1$$

Equivalently:

$$\text{AUC} = \frac{\text{Gini} + 1}{2}$$

Interpretation

AUC and Gini Correspondence
AUC	Gini	Interpretation
1.0	1.0	Perfect separation
0.9	0.8	Excellent model
0.8	0.6	Good model
0.7	0.4	Fair model
0.5	0.0	Random (no value)
0.0	-1.0	Perfect inverted (flip labels)

Why Two Metrics?

AUC's advantage: It has the direct probabilistic interpretation (pairwise ranking probability) and is more standard in machine learning literature.

Both carry identical information—use whichever your domain prefers.

Naming Confusion

Statistical Properties of AUC

AUC is a sample statistic—it varies depending on which examples appear in the test set. Understanding its statistical properties enables rigorous model comparison.

AUC is a U-Statistic

The Mann-Whitney representation shows that AUC is a U-statistic—a statistic computed from averages over pairs of observations. U-statistics have well-understood asymptotic properties:

Consistency: As sample size increases, AUC converges to the true population value
Asymptotic normality: For large samples, AUC is approximately normally distributed
Known variance formula: We can compute confidence intervals

Variance Estimation (DeLong Method)

The DeLong method (1988) provides an efficient way to estimate AUC variance, enabling confidence interval construction and hypothesis testing:

AUC Confidence Interval (DeLong)
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy import stats
 
def delong_variance(scores, labels):
    """
    Compute DeLong variance estimate for AUC.
    
    Based on: DeLong, DeLong, Clarke-Pearson (1988)
    "Comparing the Areas under Two or More Correlated
    Receiver Operating Characteristic Curves"
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    pos_scores = scores[labels == 1]
    neg_scores = scores[labels == 0]
    
    P = len(pos_scores)
    N = len(neg_scores)
    
    # Placement values: for each positive, what fraction of 
    # negatives are scored lower?
    V10 = np.zeros(P)  # Placements for positives
    for i, ps in enumerate(pos_scores):
        V10[i] = np.mean(ps > neg_scores) + 0.5 * np.mean(ps == neg_scores)
    
    # For each negative, what fraction of positives score higher?
    V01 = np.zeros(N)
    for j, ns in enumerate(neg_scores):
        V01[j] = np.mean(pos_scores > ns) + 0.5 * np.mean(pos_scores == ns)
    
    # AUC
    auc = np.mean(V10)  # S10
    
    # Variance components
    S10_var = np.var(V10, ddof=1) if P > 1 else 0
    S01_var = np.var(V01, ddof=1) if N > 1 else 0
    
    # DeLong variance of AUC
    var_auc = S10_var / P + S01_var / N
    
    return auc, var_auc
 
def auc_confidence_interval(scores, labels, alpha=0.05):
    """
    Compute AUC with confidence interval.
    
    Returns: (auc, lower_bound, upper_bound)
    """
    auc, var_auc = delong_variance(scores, labels)
    se = np.sqrt(var_auc)
    
    z = stats.norm.ppf(1 - alpha / 2)
    
    lower = max(0, auc - z * se)
    upper = min(1, auc + z * se)
    
    return auc, lower, upper
 
# Example usage
np.random.seed(42)
n = 200
scores = np.concatenate([
    np.random.normal(0.6, 0.2, 50),   # Positives
    np.random.normal(0.4, 0.2, 150)   # Negatives
])
labels = np.array([1]*50 + [0]*150)
 
auc, lower, upper = auc_confidence_interval(scores, labels)
print(f"AUC = {auc:.3f}")
print(f"95% CI: [{lower:.3f}, {upper:.3f}]")

Sample Size Effects

AUC variance depends heavily on sample size, particularly on the number of positive examples (often the minority class):

Positive Examples	Negative Examples	Approximate SE for AUC ~0.80
50	200	~0.04
100	400	~0.03
500	2000	~0.013
1000	4000	~0.009

Key insight: The minority class dominates variance. With only 50 positives, even AUC = 0.85 could have a 95% CI of [0.77, 0.93]. With 1000 positives, the same AUC might have CI [0.83, 0.87].

Report Confidence Intervals

Comparing AUCs Statistically

When comparing two models, observing that Model A has AUC = 0.82 and Model B has AUC = 0.79 isn't sufficient to claim A is better. We need statistical testing.

The Challenge: Correlated AUCs

When two models are evaluated on the same test set, their AUCs are correlated—they share the same examples. Ignoring this correlation inflates Type I error (false positives).

DeLong Test for Correlated AUCs

The DeLong test (1988) properly accounts for this correlation:

DeLong Test for Comparing AUCs
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from scipy import stats
 
def delong_test(scores1, scores2, labels):
    """
    Compare two classifiers' AUCs using DeLong's test.
    
    H0: AUC1 = AUC2 (no difference)
    H1: AUC1 ≠ AUC2 (two-sided test)
    
    Parameters:
    -----------
    scores1 : array-like
        Scores from first classifier
    scores2 : array-like
        Scores from second classifier (same examples)
    labels : array-like
        True labels
    
    Returns:
    --------
    z_stat : float
        Z-statistic for the difference
    p_value : float
        Two-sided p-value
    """
    scores1 = np.array(scores1)
    scores2 = np.array(scores2)
    labels = np.array(labels)
    
    pos_mask = labels == 1
    neg_mask = labels == 0
    
    pos_scores1 = scores1[pos_mask]
    pos_scores2 = scores2[pos_mask]
    neg_scores1 = scores1[neg_mask]
    neg_scores2 = scores2[neg_mask]
    
    P = len(pos_scores1)
    N = len(neg_scores1)
    
    # Placement values for each classifier
    def placements(pos_s, neg_s):
        V10 = np.array([np.mean(ps > neg_s) + 0.5 * np.mean(ps == neg_s) 
                        for ps in pos_s])
        V01 = np.array([np.mean(pos_s > ns) + 0.5 * np.mean(pos_s == ns) 
                        for ns in neg_s])
        return V10, V01
    
    V10_1, V01_1 = placements(pos_scores1, neg_scores1)
    V10_2, V01_2 = placements(pos_scores2, neg_scores2)
    
    auc1 = np.mean(V10_1)
    auc2 = np.mean(V10_2)
    
    # Covariance structure
    S10 = np.cov(V10_1, V10_2)
    S01 = np.cov(V01_1, V01_2)
    
    # Variance of (AUC1 - AUC2)
    S = S10 / P + S01 / N
    var_diff = S[0, 0] + S[1, 1] - 2 * S[0, 1]
    
    if var_diff <= 0:
        return 0.0, 1.0  # Identical, no difference
    
    z = (auc1 - auc2) / np.sqrt(var_diff)
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))
    
    return z, p_value, auc1, auc2
 
# Example usage
np.random.seed(42)
n = 300
true_signal = np.random.randn(n)
labels = (true_signal > 0).astype(int)
 
# Two models with different noise levels
scores1 = true_signal + np.random.randn(n) * 0.5  # Better model
scores2 = true_signal + np.random.randn(n) * 0.8  # Worse model
 
z, p, auc1, auc2 = delong_test(scores1, scores2, labels)
print(f"Model 1 AUC: {auc1:.3f}")
print(f"Model 2 AUC: {auc2:.3f}")
print(f"Difference: {auc1 - auc2:.3f}")
print(f"Z-statistic: {z:.3f}")
print(f"P-value: {p:.4f}")
print(f"Significant at α=0.05? {p < 0.05}")

Bootstrap Comparison (Alternative)

When the DeLong test's normality assumptions may not hold (small samples), bootstrap resampling provides a non-parametric alternative:

Bootstrap AUC Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from sklearn.metrics import roc_auc_score
from sklearn.utils import resample
 
def bootstrap_auc_comparison(scores1, scores2, labels, n_bootstrap=1000, 
                             confidence_level=0.95):
    """
    Compare AUCs using stratified bootstrap resampling.
    
    Returns confidence interval for (AUC1 - AUC2).
    """
    differences = []
    
    for _ in range(n_bootstrap):
        # Stratified resample (maintain class balance)
        idx = resample(range(len(labels)), stratify=labels)
        
        s1_boot = scores1[idx]
        s2_boot = scores2[idx]
        y_boot = labels[idx]
        
        auc1_boot = roc_auc_score(y_boot, s1_boot)
        auc2_boot = roc_auc_score(y_boot, s2_boot)
        
        differences.append(auc1_boot - auc2_boot)
    
    differences = np.array(differences)
    
    alpha = 1 - confidence_level
    lower = np.percentile(differences, 100 * alpha / 2)
    upper = np.percentile(differences, 100 * (1 - alpha / 2))
    
    # If CI doesn't contain 0, difference is significant
    significant = (lower > 0) or (upper < 0)
    
    return {
        'mean_diff': np.mean(differences),
        'std_diff': np.std(differences),
        'ci_lower': lower,
        'ci_upper': upper,
        'significant': significant
    }

Practical Significance vs Statistical Significance

When AUC Is Appropriate

AUC is widely used, but it's not universally appropriate. Understanding when AUC shines helps you apply it correctly.

Ideal Use Cases

AUC Works Well When...

•Threshold is unknown at training time — You're building a model but users will choose their own operating point based on their specific cost tradeoffs.
•Ranking quality matters — The goal is to order examples by risk/likelihood, not to make hard classifications (e.g., generating leads for manual review).
•Comparing models' discrimination power — You want to know which model better separates classes, independent of calibration or threshold choice.
•Class balance varies across deployments — The same model will be used in contexts with different positive rates, making threshold-specific metrics unstable.
•Early model development — You're comparing many architectures/features and need a quick, threshold-free metric for iteration.

Why AUC Is Class-Balance Invariant

A key strength of AUC: it's invariant to class imbalance. Unlike accuracy (which is dominated by the majority class), AUC treats positives and negatives symmetrically because:

TPR is computed only within positives
FPR is computed only within negatives
AUC compares positive-negative pairs, not raw counts

A dataset with 99% negatives produces the same AUC as one with 50% negatives if the classifier ranks examples identically.

Why AUC Is Score-Scale Invariant

AUC only cares about ranking, not absolute scores. Apply any strictly monotonic transformation to scores:

log(score)
score³
σ(score) (sigmoid)

The AUC remains unchanged because the ranking of examples doesn't change.

This invariance is both a strength and a weakness:

Strength: Compare models that output different score ranges
Weakness: A poorly calibrated model (scores ≠ probabilities) can have high AUC but misleading probability estimates

When AUC Misleads

AUC's very properties that make it useful can also make it misleading in specific contexts. Understanding these limitations is essential for proper model evaluation.

AUC Can Mislead When...

•Operating point is fixed — If you'll always use threshold = 0.5 (or any fixed value), AUC averages over irrelevant thresholds. Local ROC behavior matters more than global AUC.
•Class imbalance is extreme — While AUC is mathematically class-balance invariant, its practical interpretation weakens. When negatives vastly outnumber positives, FPR differences involve huge absolute numbers.
•False positives and false negatives have very different costs — AUC weights all FPR regions equally, but in fraud detection, you might only care about FPR < 0.01. High AUC might come from good performance in irrelevant FPR regions.
•Calibration matters — AUC doesn't measure whether predicted probabilities are accurate. A model with AUC = 0.90 might predict P(y=1) = 0.95 when the true rate is 0.30.
•ROC curves cross — When ROC curves of two models cross, neither dominates everywhere. AUC picks a 'winner' but the better model depends on the operating region you care about.

The 'All Thresholds' Fallacy

The Extreme Imbalance Problem

Consider fraud detection with 0.1% fraud (1 in 1000 transactions):

At FPR = 0.01, you're flagging 1% of legitimate transactions — 10 false alarms per fraud
At FPR = 0.001, you're flagging 0.1% — 1 false alarm per fraud

AUC treats improvements in the FPR range [0.01, 0.10] the same as [0.001, 0.01]. But operationally, the latter matters far more.

Solution: Use partial AUC (pAUC) that focuses on the relevant FPR region, or switch to Precision-Recall curves (covered in the next page).

When Curves Cross

Two Models with Crossing ROC Curves
FPR Range	Model A TPR	Model B TPR	Better Model
0.00 - 0.05	0.45	0.55	B (high precision region)
0.05 - 0.15	0.70	0.65	A
0.15 - 0.30	0.85	0.80	A
0.30 - 1.00	0.95	1.00	B (high recall region)
AUC	0.82	0.80	A wins AUC contest

If you need to operate at very low FPR (high precision required), Model B is better despite losing AUC. Always look at the ROC curves, not just AUC.

Partial AUC

When you only care about a specific region of the ROC curve, partial AUC (pAUC) provides a more focused metric.

Definition

Partial AUC integrates only over a restricted FPR range [α, β]:

$$\text{pAUC}(\alpha, \beta) = \int_{\alpha}^{\beta} \text{TPR}(\text{FPR}) , d(\text{FPR})$$

Common use cases:

pAUC(0, 0.1): Performance in the high-precision region
pAUC(0, 0.2): Practical operating region for imbalanced data

Standardized Partial AUC

Raw pAUC depends on the interval width. To compare across intervals, standardize:

$$\text{Standardized pAUC} = \frac{\text{pAUC}(\alpha, \beta)}{\beta - \alpha}$$

This represents the average TPR over the FPR range, lying in [0, 1].

McClish Normalization

McClish (1989) proposed a normalization that maps pAUC to [0, 1] where 0.5 represents random performance:

$$\text{McClish} = \frac{1}{2}\left(1 + \frac{\text{pAUC} - \text{min}}{\text{max} - \text{min}}\right)$$

where min = α(β-α)/2 (random baseline) and max = (β-α) (perfect classifier in that region).

Partial AUC Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score
 
def partial_auc(fpr, tpr, fpr_max=0.1, standardized=True):
    """
    Compute partial AUC for FPR in [0, fpr_max].
    
    Parameters:
    -----------
    fpr : array-like
        False positive rates from roc_curve
    tpr : array-like
        True positive rates from roc_curve
    fpr_max : float
        Maximum FPR to include (default 0.1)
    standardized : bool
        If True, return McClish-normalized pAUC in [0.5, 1]
    
    Returns:
    --------
    float : Partial AUC
    """
    fpr = np.array(fpr)
    tpr = np.array(tpr)
    
    # Find points within range
    mask = fpr <= fpr_max
    
    if not np.any(mask):
        return 0.5 if standardized else 0.0
    
    # Include boundary point through interpolation
    if mask[-1] and not mask[-2] if len(mask) > 1 else True:
        # Need to interpolate at fpr_max
        pass
    
    fpr_clipped = fpr[mask]
    tpr_clipped = tpr[mask]
    
    # Add endpoint if not exactly at fpr_max
    if fpr_clipped[-1] < fpr_max:
        # Linear interpolation for TPR at fpr_max
        next_idx = np.searchsorted(fpr, fpr_max)
        if next_idx < len(fpr):
            # Interpolate
            frac = (fpr_max - fpr[next_idx-1]) / (fpr[next_idx] - fpr[next_idx-1])
            tpr_at_max = tpr[next_idx-1] + frac * (tpr[next_idx] - tpr[next_idx-1])
            fpr_clipped = np.append(fpr_clipped, fpr_max)
            tpr_clipped = np.append(tpr_clipped, tpr_at_max)
    
    # Trapezoidal integration
    pauc = np.trapz(tpr_clipped, fpr_clipped)
    
    if standardized:
        # McClish normalization
        min_val = 0.5 * fpr_max * fpr_max  # Random baseline area
        max_val = fpr_max  # Perfect classifier area
        pauc_normalized = 0.5 * (1 + (pauc - min_val) / (max_val - min_val))
        return pauc_normalized
    
    return pauc
 
# Example: Compare at low FPR
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
 
X, y = make_classification(n_samples=2000, weights=[0.95, 0.05], 
                           n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
 
lr = LogisticRegression().fit(X_train, y_train)
gb = GradientBoostingClassifier().fit(X_train, y_train)
 
lr_probs = lr.predict_proba(X_test)[:, 1]
gb_probs = gb.predict_proba(X_test)[:, 1]
 
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
gb_fpr, gb_tpr, _ = roc_curve(y_test, gb_probs)
 
print("Full AUC:")
print(f"  Logistic Regression: {roc_auc_score(y_test, lr_probs):.3f}")
print(f"  Gradient Boosting:   {roc_auc_score(y_test, gb_probs):.3f}")
 
print("\nPartial AUC (FPR ≤ 0.1, standardized):")
print(f"  Logistic Regression: {partial_auc(lr_fpr, lr_tpr, 0.1):.3f}")
print(f"  Gradient Boosting:   {partial_auc(gb_fpr, gb_tpr, 0.1):.3f}")

scikit-learn Support

scikit-learn's roc_auc_score supports partial AUC via the max_fpr parameter: roc_auc_score(y_true, y_score, max_fpr=0.1). This returns McClish-normalized partial AUC.

Summary: AUC Interpretation

We've developed a comprehensive understanding of AUC—from mathematical foundations to practical guidance on appropriate use. Let's consolidate the key insights:

Key Takeaways

•AUC is the area under the ROC curve — geometrically simple, but with deep statistical meaning.
•AUC equals the pairwise ranking probability — the probability a random positive is scored higher than a random negative. This makes AUC a ranking metric, not a classification metric.
•AUC is invariant to class balance and score scale — useful properties for comparing models across different contexts, but they mean AUC ignores calibration.
•AUC is a sample statistic with variance — always report confidence intervals, especially for small test sets. Use DeLong or bootstrap methods.
•AUC averages over ALL thresholds — this can mislead when you only care about specific operating regions. Consider partial AUC for focused evaluation.
•Look at ROC curves, not just AUC — when curves cross, AUC may not reflect performance in your operating region.

What's next:

Page Complete

2 / 5