Loading learning content...
The ROC curve provides a complete picture of classifier behavior across all thresholds—but that completeness comes with complexity. Comparing two ROC curves visually works for simple cases, but what if curves cross? What if we need to rank 50 models programmatically? What if we need a single number for optimization or reporting?
This is where the Area Under the ROC Curve (AUC) becomes indispensable. AUC collapses an entire curve into a single scalar value between 0 and 1, enabling:
But reducing a curve to a number involves tradeoffs. When is AUC the right summary? When does it mislead? This page provides the deep understanding needed to answer these questions confidently.
By the end of this page, you will understand AUC's mathematical definition, its probabilistic interpretation as a ranking metric, efficient computation algorithms, statistical properties and confidence intervals, and critical guidance on when AUC is appropriate versus when alternative metrics are needed.
The Area Under the Curve (AUC) is exactly what the name suggests: the area between the ROC curve and the x-axis. But this simple geometric definition has profound implications.
Given an ROC curve as a function TPR(FPR), the AUC is:
$$\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) , d(\text{FPR})$$
Alternatively, parameterizing by threshold τ:
$$\text{AUC} = -\int_{\tau=-\infty}^{\tau=+\infty} \text{TPR}(\tau) \cdot \frac{d\text{FPR}(\tau)}{d\tau} , d\tau$$
In practice, we have discrete points on the ROC curve. The trapezoidal rule integrates the area under the step function:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as np def auc_trapezoidal(fpr, tpr): """ Compute AUC using the trapezoidal rule. Parameters: ----------- fpr : array-like False positive rates (sorted in ascending order) tpr : array-like Corresponding true positive rates Returns: -------- float : Area under the ROC curve """ fpr = np.array(fpr) tpr = np.array(tpr) # Ensure proper ordering order = np.argsort(fpr) fpr = fpr[order] tpr = tpr[order] # Trapezoidal integration # Area = sum of trapezoid areas between consecutive points # Trapezoid area = (base) × (average of heights) # = (fpr[i+1] - fpr[i]) × (tpr[i+1] + tpr[i]) / 2 auc = 0.0 for i in range(len(fpr) - 1): width = fpr[i + 1] - fpr[i] height = (tpr[i] + tpr[i + 1]) / 2 auc += width * height return auc # Equivalent one-liner with numpy:# auc = np.trapz(tpr, fpr) # Example usagefpr_example = [0.0, 0.1, 0.2, 0.5, 1.0]tpr_example = [0.0, 0.4, 0.6, 0.9, 1.0]print(f"AUC = {auc_trapezoidal(fpr_example, tpr_example):.4f}")| AUC Value | Interpretation |
|---|---|
| 1.0 | Perfect classifier — all positives ranked above all negatives |
| 0.9 - 1.0 | Excellent — strong separation between classes |
| 0.8 - 0.9 | Good — useful discrimination |
| 0.7 - 0.8 | Fair — moderate separation |
| 0.5 - 0.7 | Poor — weak discrimination, approaching random |
| 0.5 | Random — no better than chance |
| 0.0 - 0.5 | Worse than random — invert predictions to improve |
| 0.0 | Perfectly inverted — flip predictions for perfect classifier |
If you swap every positive and negative label, the ROC curve reflects across the diagonal, and AUC becomes 1 - AUC. A classifier with AUC = 0.3 becomes AUC = 0.7 after label swapping. This symmetry explains why AUC < 0.5 indicates useful signal (just inverted).
The most profound insight about AUC is its probabilistic interpretation. This isn't just a mathematical curiosity—it fundamentally shapes how we should think about AUC as a metric.
AUC equals the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:
$$\text{AUC} = P(\text{score}(x^+) > \text{score}(x^-))$$
where $x^+$ is a random positive instance and $x^-$ is a random negative instance.
Handling ties: $$\text{AUC} = P(\text{score}(x^+) > \text{score}(x^-)) + \frac{1}{2}P(\text{score}(x^+) = \text{score}(x^-))$$
This is exactly the Wilcoxon rank-sum statistic (Mann-Whitney U), connecting ROC analysis to non-parametric statistics.
AUC measures RANKING QUALITY, not classification accuracy.
If you randomly pick one positive and one negative example, AUC is the probability the classifier ranks the positive higher. This makes AUC ideal for comparing models on their ability to ORDER examples by class likelihood, independent of threshold choice.
The connection between the geometric AUC and the probabilistic interpretation can be proven by considering all positive-negative pairs:
For each positive example $x_i^+$ with score $s_i^+$, count how many negative examples $x_j^-$ with score $s_j^-$ satisfy $s_i^+ > s_j^-$.
Sum these counts across all positives.
Divide by total pairs (P × N).
This count-based calculation exactly equals the trapezoidal AUC.
Think of AUC as answering this question:
"If I show the model two examples—one that is truly positive and one that is truly negative—how often will it correctly say the positive one has a higher risk/probability/score?"
This interpretation is intuitive for stakeholders: "Our fraud model correctly ranks fraud above non-fraud 94% of the time."
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as np def auc_mann_whitney(scores, labels): """ Compute AUC using the Mann-Whitney U statistic interpretation. This directly computes the probability that a random positive is scored higher than a random negative. Time Complexity: O(P × N) - can be slow for large datasets """ scores = np.array(scores) labels = np.array(labels) pos_scores = scores[labels == 1] neg_scores = scores[labels == 0] P = len(pos_scores) N = len(neg_scores) if P == 0 or N == 0: return 0.5 # Undefined, return random baseline # Count pairs where positive > negative correct = 0 ties = 0 for pos_score in pos_scores: for neg_score in neg_scores: if pos_score > neg_score: correct += 1 elif pos_score == neg_score: ties += 1 # AUC = (correct + 0.5 * ties) / (P * N) return (correct + 0.5 * ties) / (P * N) # Efficient O(n log n) version using sortingdef auc_efficient(scores, labels): """ Compute AUC efficiently using sorting and ranking. Time Complexity: O(n log n) """ from scipy import stats scores = np.array(scores) labels = np.array(labels) # Use scipy's rankdata for efficient ranking ranks = stats.rankdata(scores) pos_ranks = ranks[labels == 1] P = len(pos_ranks) N = len(labels) - P if P == 0 or N == 0: return 0.5 # Mann-Whitney U statistic formula # Sum of ranks of positives minus expected sum if random rank_sum = np.sum(pos_ranks) expected_if_random = P * (P + 1) / 2 # Minimum possible sum U = rank_sum - expected_if_random auc = U / (P * N) return aucIn some domains—particularly insurance and credit scoring—you'll encounter the Gini coefficient (or Gini index) as an evaluation metric. The Gini coefficient is directly related to AUC.
The Gini coefficient measures how far the ROC curve is from the diagonal (random baseline):
$$\text{Gini} = 2 \times \text{AUC} - 1$$
Equivalently:
$$\text{AUC} = \frac{\text{Gini} + 1}{2}$$
| AUC | Gini | Interpretation |
|---|---|---|
| 1.0 | 1.0 | Perfect separation |
| 0.9 | 0.8 | Excellent model |
| 0.8 | 0.6 | Good model |
| 0.7 | 0.4 | Fair model |
| 0.5 | 0.0 | Random (no value) |
| 0.0 | -1.0 | Perfect inverted (flip labels) |
Gini's advantage: It's normalized with respect to random performance. A Gini of 0.4 means the model captures 40% of the possible lift above random, regardless of the problem's difficulty. This can be more intuitive for comparing models across different datasets.
AUC's advantage: It has the direct probabilistic interpretation (pairwise ranking probability) and is more standard in machine learning literature.
Both carry identical information—use whichever your domain prefers.
Be careful: the 'Gini coefficient' in model evaluation is different from the 'Gini impurity' used in decision tree splitting (which measures node impurity). Same name, different concepts. Context determines meaning.
AUC is a sample statistic—it varies depending on which examples appear in the test set. Understanding its statistical properties enables rigorous model comparison.
The Mann-Whitney representation shows that AUC is a U-statistic—a statistic computed from averages over pairs of observations. U-statistics have well-understood asymptotic properties:
The DeLong method (1988) provides an efficient way to estimate AUC variance, enabling confidence interval construction and hypothesis testing:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom scipy import stats def delong_variance(scores, labels): """ Compute DeLong variance estimate for AUC. Based on: DeLong, DeLong, Clarke-Pearson (1988) "Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves" """ scores = np.array(scores) labels = np.array(labels) pos_scores = scores[labels == 1] neg_scores = scores[labels == 0] P = len(pos_scores) N = len(neg_scores) # Placement values: for each positive, what fraction of # negatives are scored lower? V10 = np.zeros(P) # Placements for positives for i, ps in enumerate(pos_scores): V10[i] = np.mean(ps > neg_scores) + 0.5 * np.mean(ps == neg_scores) # For each negative, what fraction of positives score higher? V01 = np.zeros(N) for j, ns in enumerate(neg_scores): V01[j] = np.mean(pos_scores > ns) + 0.5 * np.mean(pos_scores == ns) # AUC auc = np.mean(V10) # S10 # Variance components S10_var = np.var(V10, ddof=1) if P > 1 else 0 S01_var = np.var(V01, ddof=1) if N > 1 else 0 # DeLong variance of AUC var_auc = S10_var / P + S01_var / N return auc, var_auc def auc_confidence_interval(scores, labels, alpha=0.05): """ Compute AUC with confidence interval. Returns: (auc, lower_bound, upper_bound) """ auc, var_auc = delong_variance(scores, labels) se = np.sqrt(var_auc) z = stats.norm.ppf(1 - alpha / 2) lower = max(0, auc - z * se) upper = min(1, auc + z * se) return auc, lower, upper # Example usagenp.random.seed(42)n = 200scores = np.concatenate([ np.random.normal(0.6, 0.2, 50), # Positives np.random.normal(0.4, 0.2, 150) # Negatives])labels = np.array([1]*50 + [0]*150) auc, lower, upper = auc_confidence_interval(scores, labels)print(f"AUC = {auc:.3f}")print(f"95% CI: [{lower:.3f}, {upper:.3f}]")AUC variance depends heavily on sample size, particularly on the number of positive examples (often the minority class):
| Positive Examples | Negative Examples | Approximate SE for AUC ~0.80 |
|---|---|---|
| 50 | 200 | ~0.04 |
| 100 | 400 | ~0.03 |
| 500 | 2000 | ~0.013 |
| 1000 | 4000 | ~0.009 |
Key insight: The minority class dominates variance. With only 50 positives, even AUC = 0.85 could have a 95% CI of [0.77, 0.93]. With 1000 positives, the same AUC might have CI [0.83, 0.87].
Never report a bare AUC value without context. For small test sets, always include confidence intervals or bootstrap estimates. An AUC of 0.78 with CI [0.65, 0.91] is very different from 0.78 with CI [0.76, 0.80].
When comparing two models, observing that Model A has AUC = 0.82 and Model B has AUC = 0.79 isn't sufficient to claim A is better. We need statistical testing.
When two models are evaluated on the same test set, their AUCs are correlated—they share the same examples. Ignoring this correlation inflates Type I error (false positives).
The DeLong test (1988) properly accounts for this correlation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom scipy import stats def delong_test(scores1, scores2, labels): """ Compare two classifiers' AUCs using DeLong's test. H0: AUC1 = AUC2 (no difference) H1: AUC1 ≠ AUC2 (two-sided test) Parameters: ----------- scores1 : array-like Scores from first classifier scores2 : array-like Scores from second classifier (same examples) labels : array-like True labels Returns: -------- z_stat : float Z-statistic for the difference p_value : float Two-sided p-value """ scores1 = np.array(scores1) scores2 = np.array(scores2) labels = np.array(labels) pos_mask = labels == 1 neg_mask = labels == 0 pos_scores1 = scores1[pos_mask] pos_scores2 = scores2[pos_mask] neg_scores1 = scores1[neg_mask] neg_scores2 = scores2[neg_mask] P = len(pos_scores1) N = len(neg_scores1) # Placement values for each classifier def placements(pos_s, neg_s): V10 = np.array([np.mean(ps > neg_s) + 0.5 * np.mean(ps == neg_s) for ps in pos_s]) V01 = np.array([np.mean(pos_s > ns) + 0.5 * np.mean(pos_s == ns) for ns in neg_s]) return V10, V01 V10_1, V01_1 = placements(pos_scores1, neg_scores1) V10_2, V01_2 = placements(pos_scores2, neg_scores2) auc1 = np.mean(V10_1) auc2 = np.mean(V10_2) # Covariance structure S10 = np.cov(V10_1, V10_2) S01 = np.cov(V01_1, V01_2) # Variance of (AUC1 - AUC2) S = S10 / P + S01 / N var_diff = S[0, 0] + S[1, 1] - 2 * S[0, 1] if var_diff <= 0: return 0.0, 1.0 # Identical, no difference z = (auc1 - auc2) / np.sqrt(var_diff) p_value = 2 * (1 - stats.norm.cdf(abs(z))) return z, p_value, auc1, auc2 # Example usagenp.random.seed(42)n = 300true_signal = np.random.randn(n)labels = (true_signal > 0).astype(int) # Two models with different noise levelsscores1 = true_signal + np.random.randn(n) * 0.5 # Better modelscores2 = true_signal + np.random.randn(n) * 0.8 # Worse model z, p, auc1, auc2 = delong_test(scores1, scores2, labels)print(f"Model 1 AUC: {auc1:.3f}")print(f"Model 2 AUC: {auc2:.3f}")print(f"Difference: {auc1 - auc2:.3f}")print(f"Z-statistic: {z:.3f}")print(f"P-value: {p:.4f}")print(f"Significant at α=0.05? {p < 0.05}")When the DeLong test's normality assumptions may not hold (small samples), bootstrap resampling provides a non-parametric alternative:
1234567891011121314151617181920212223242526272829303132333435363738394041
from sklearn.metrics import roc_auc_scorefrom sklearn.utils import resample def bootstrap_auc_comparison(scores1, scores2, labels, n_bootstrap=1000, confidence_level=0.95): """ Compare AUCs using stratified bootstrap resampling. Returns confidence interval for (AUC1 - AUC2). """ differences = [] for _ in range(n_bootstrap): # Stratified resample (maintain class balance) idx = resample(range(len(labels)), stratify=labels) s1_boot = scores1[idx] s2_boot = scores2[idx] y_boot = labels[idx] auc1_boot = roc_auc_score(y_boot, s1_boot) auc2_boot = roc_auc_score(y_boot, s2_boot) differences.append(auc1_boot - auc2_boot) differences = np.array(differences) alpha = 1 - confidence_level lower = np.percentile(differences, 100 * alpha / 2) upper = np.percentile(differences, 100 * (1 - alpha / 2)) # If CI doesn't contain 0, difference is significant significant = (lower > 0) or (upper < 0) return { 'mean_diff': np.mean(differences), 'std_diff': np.std(differences), 'ci_lower': lower, 'ci_upper': upper, 'significant': significant }With large test sets, even tiny AUC differences become statistically significant. But is a difference of 0.81 vs 0.80 practically meaningful? Consider both: statistical significance tells you the difference is real; practical significance tells you it matters.
AUC is widely used, but it's not universally appropriate. Understanding when AUC shines helps you apply it correctly.
A key strength of AUC: it's invariant to class imbalance. Unlike accuracy (which is dominated by the majority class), AUC treats positives and negatives symmetrically because:
A dataset with 99% negatives produces the same AUC as one with 50% negatives if the classifier ranks examples identically.
AUC only cares about ranking, not absolute scores. Apply any strictly monotonic transformation to scores:
The AUC remains unchanged because the ranking of examples doesn't change.
This invariance is both a strength and a weakness:
AUC's very properties that make it useful can also make it misleading in specific contexts. Understanding these limitations is essential for proper model evaluation.
AUC averages performance across all thresholds. But you'll only ever deploy at ONE threshold. If Model A dominates at thresholds you'll use (say, FPR < 0.05) while Model B dominates elsewhere, Model B might have higher AUC but be worse for your application. AUC can choose the wrong model.
Consider fraud detection with 0.1% fraud (1 in 1000 transactions):
AUC treats improvements in the FPR range [0.01, 0.10] the same as [0.001, 0.01]. But operationally, the latter matters far more.
Solution: Use partial AUC (pAUC) that focuses on the relevant FPR region, or switch to Precision-Recall curves (covered in the next page).
| FPR Range | Model A TPR | Model B TPR | Better Model |
|---|---|---|---|
| 0.00 - 0.05 | 0.45 | 0.55 | B (high precision region) |
| 0.05 - 0.15 | 0.70 | 0.65 | A |
| 0.15 - 0.30 | 0.85 | 0.80 | A |
| 0.30 - 1.00 | 0.95 | 1.00 | B (high recall region) |
| AUC | 0.82 | 0.80 | A wins AUC contest |
If you need to operate at very low FPR (high precision required), Model B is better despite losing AUC. Always look at the ROC curves, not just AUC.
When you only care about a specific region of the ROC curve, partial AUC (pAUC) provides a more focused metric.
Partial AUC integrates only over a restricted FPR range [α, β]:
$$\text{pAUC}(\alpha, \beta) = \int_{\alpha}^{\beta} \text{TPR}(\text{FPR}) , d(\text{FPR})$$
Common use cases:
Raw pAUC depends on the interval width. To compare across intervals, standardize:
$$\text{Standardized pAUC} = \frac{\text{pAUC}(\alpha, \beta)}{\beta - \alpha}$$
This represents the average TPR over the FPR range, lying in [0, 1].
McClish (1989) proposed a normalization that maps pAUC to [0, 1] where 0.5 represents random performance:
$$\text{McClish} = \frac{1}{2}\left(1 + \frac{\text{pAUC} - \text{min}}{\text{max} - \text{min}}\right)$$
where min = α(β-α)/2 (random baseline) and max = (β-α) (perfect classifier in that region).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom sklearn.metrics import roc_curve, roc_auc_score def partial_auc(fpr, tpr, fpr_max=0.1, standardized=True): """ Compute partial AUC for FPR in [0, fpr_max]. Parameters: ----------- fpr : array-like False positive rates from roc_curve tpr : array-like True positive rates from roc_curve fpr_max : float Maximum FPR to include (default 0.1) standardized : bool If True, return McClish-normalized pAUC in [0.5, 1] Returns: -------- float : Partial AUC """ fpr = np.array(fpr) tpr = np.array(tpr) # Find points within range mask = fpr <= fpr_max if not np.any(mask): return 0.5 if standardized else 0.0 # Include boundary point through interpolation if mask[-1] and not mask[-2] if len(mask) > 1 else True: # Need to interpolate at fpr_max pass fpr_clipped = fpr[mask] tpr_clipped = tpr[mask] # Add endpoint if not exactly at fpr_max if fpr_clipped[-1] < fpr_max: # Linear interpolation for TPR at fpr_max next_idx = np.searchsorted(fpr, fpr_max) if next_idx < len(fpr): # Interpolate frac = (fpr_max - fpr[next_idx-1]) / (fpr[next_idx] - fpr[next_idx-1]) tpr_at_max = tpr[next_idx-1] + frac * (tpr[next_idx] - tpr[next_idx-1]) fpr_clipped = np.append(fpr_clipped, fpr_max) tpr_clipped = np.append(tpr_clipped, tpr_at_max) # Trapezoidal integration pauc = np.trapz(tpr_clipped, fpr_clipped) if standardized: # McClish normalization min_val = 0.5 * fpr_max * fpr_max # Random baseline area max_val = fpr_max # Perfect classifier area pauc_normalized = 0.5 * (1 + (pauc - min_val) / (max_val - min_val)) return pauc_normalized return pauc # Example: Compare at low FPRfrom sklearn.datasets import make_classificationfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import train_test_split X, y = make_classification(n_samples=2000, weights=[0.95, 0.05], n_informative=5, random_state=42)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) lr = LogisticRegression().fit(X_train, y_train)gb = GradientBoostingClassifier().fit(X_train, y_train) lr_probs = lr.predict_proba(X_test)[:, 1]gb_probs = gb.predict_proba(X_test)[:, 1] lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)gb_fpr, gb_tpr, _ = roc_curve(y_test, gb_probs) print("Full AUC:")print(f" Logistic Regression: {roc_auc_score(y_test, lr_probs):.3f}")print(f" Gradient Boosting: {roc_auc_score(y_test, gb_probs):.3f}") print("\nPartial AUC (FPR ≤ 0.1, standardized):")print(f" Logistic Regression: {partial_auc(lr_fpr, lr_tpr, 0.1):.3f}")print(f" Gradient Boosting: {partial_auc(gb_fpr, gb_tpr, 0.1):.3f}")scikit-learn's roc_auc_score supports partial AUC via the max_fpr parameter: roc_auc_score(y_true, y_score, max_fpr=0.1). This returns McClish-normalized partial AUC.
We've developed a comprehensive understanding of AUC—from mathematical foundations to practical guidance on appropriate use. Let's consolidate the key insights:
What's next:
The next page covers Precision-Recall Curves—an alternative to ROC analysis that's often more informative for imbalanced problems. We'll explore their construction, the relationship between PR and ROC curves, and guidance on when to use each.
You now possess a rigorous understanding of AUC: its mathematical definition, probabilistic interpretation, statistical properties, and appropriate use cases. You can confidently use AUC for model comparison while understanding its limitations.