Loading content...
The Area Under the ROC Curve (AUC) is a beloved metric—threshold-independent, interpretable as a ranking probability, and robust to class imbalance. But AUC has a subtle flaw: it treats all parts of the ROC curve equally, even regions that are operationally irrelevant.
Consider a medical screening test:
Or consider fraud detection:
Partial AUC (pAUC) addresses this by computing the area under only the relevant portion of the ROC curve—typically the region where FPR is below some threshold. This focuses evaluation on the regime where the model will actually be deployed.
By the end of this page, you will understand pAUC from first principles—its definition, computation, normalization variants, relationship to full AUC, and appropriate use cases. You will be able to compute, normalize, and interpret pAUC for practical model evaluation in constrained operating regimes.
Before defining pAUC, let's establish the ROC curve foundation.
ROC Curve Definition:
For a binary classifier with continuous scores, the ROC curve plots:
as the classification threshold varies from $+\infty$ (predict all negative) to $-\infty$ (predict all positive).
Key Properties:
Full AUC:
$$\text{AUC} = \int_0^1 \text{TPR}(\text{FPR}) , d(\text{FPR})$$
Equivalently, AUC equals the probability that a randomly chosen positive example is scored higher than a randomly chosen negative example:
$$\text{AUC} = P(s^+ > s^-)$$
where $s^+$ and $s^-$ are scores for random positive and negative examples.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npfrom sklearn.metrics import roc_curve, aucimport matplotlib.pyplot as plt def generate_example_roc(): """Generate example ROC curve data for illustration.""" np.random.seed(42) n_pos = 100 n_neg = 900 # Positive scores: higher (good model) scores_pos = np.random.normal(0.7, 0.2, n_pos) # Negative scores: lower scores_neg = np.random.normal(0.3, 0.2, n_neg) y_true = np.concatenate([np.ones(n_pos), np.zeros(n_neg)]) y_scores = np.concatenate([scores_pos, scores_neg]) fpr, tpr, thresholds = roc_curve(y_true, y_scores) roc_auc = auc(fpr, tpr) return fpr, tpr, roc_auc fpr, tpr, roc_auc = generate_example_roc() print("ROC Curve Analysis")print("=" * 50)print(f"Full AUC: {roc_auc:.4f}") # Show key points on the curveprint("\nKey points on ROC curve:")print(f"{'FPR':<10} {'TPR':<10} {'Description'}")print("-" * 40) key_fprs = [0.01, 0.05, 0.10, 0.20, 0.50]for target_fpr in key_fprs: # Find closest actual FPR idx = np.argmin(np.abs(fpr - target_fpr)) print(f"{fpr[idx]:.4f} {tpr[idx]:.4f} FPR ≈ {target_fpr:.0%}") # Illustrate the problem with full AUCprint("\nThe Problem with Full AUC:")print("-" * 50)print("If we only care about FPR < 10%, full AUC includes:")print(f" - Relevant regime (FPR < 0.10): ~10% of the FPR range")print(f" - Irrelevant regime (FPR > 0.10): ~90% of the FPR range")print("Full AUC is dominated by the operationally irrelevant region!")Partial AUC (pAUC) computes the area under the ROC curve over a restricted FPR range.
Definition:
Given an FPR range $[\alpha, \beta]$ where $0 \leq \alpha < \beta \leq 1$:
$$\text{pAUC}(\alpha, \beta) = \int_{\alpha}^{\beta} \text{TPR}(\text{FPR}) , d(\text{FPR})$$
Typically, we're interested in low FPR regimes, so $\alpha = 0$ and we specify only $\beta$:
$$\text{pAUC}(0, \beta) = \int_{0}^{\beta} \text{TPR}(\text{FPR}) , d(\text{FPR})$$
Common choices for $\beta$:
Range of pAUC:
This unnormalized range makes direct interpretation difficult, motivating normalization.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import numpy as npfrom sklearn.metrics import roc_curve, auc def partial_auc(y_true, y_scores, max_fpr=0.1): """ Compute partial AUC up to a specified FPR threshold. Args: y_true: Binary labels y_scores: Predicted scores (higher = more likely positive) max_fpr: Maximum FPR to consider (beta) Returns: Dict with raw pAUC and normalized variants """ fpr, tpr, _ = roc_curve(y_true, y_scores) # Find the portion of the curve where FPR <= max_fpr stop_idx = np.searchsorted(fpr, max_fpr, side='right') # Interpolate to exactly max_fpr if needed if stop_idx < len(fpr) and fpr[stop_idx] > max_fpr: # Linear interpolation if stop_idx > 0: t = (max_fpr - fpr[stop_idx - 1]) / (fpr[stop_idx] - fpr[stop_idx - 1]) tpr_at_max = tpr[stop_idx - 1] + t * (tpr[stop_idx] - tpr[stop_idx - 1]) else: tpr_at_max = tpr[0] fpr_partial = np.append(fpr[:stop_idx], max_fpr) tpr_partial = np.append(tpr[:stop_idx], tpr_at_max) else: fpr_partial = fpr[:stop_idx] tpr_partial = tpr[:stop_idx] # Compute raw pAUC if len(fpr_partial) < 2: raw_pauc = 0.0 else: raw_pauc = auc(fpr_partial, tpr_partial) # Theoretical bounds min_pauc = 0 # TPR = 0 everywhere max_pauc = max_fpr # TPR = 1 everywhere random_pauc = max_fpr ** 2 / 2 # Area under diagonal return { 'raw_pauc': raw_pauc, 'max_fpr': max_fpr, 'min_possible': min_pauc, 'max_possible': max_pauc, 'random_baseline': random_pauc, 'fpr_partial': fpr_partial, 'tpr_partial': tpr_partial } # Examplenp.random.seed(42)n_pos, n_neg = 100, 900 # Good modelscores_pos = np.random.normal(0.7, 0.2, n_pos)scores_neg = np.random.normal(0.3, 0.2, n_neg)y_true = np.concatenate([np.ones(n_pos), np.zeros(n_neg)])y_scores = np.concatenate([scores_pos, scores_neg]) print("Partial AUC Analysis")print("=" * 60) for max_fpr in [0.05, 0.10, 0.20, 0.50, 1.0]: result = partial_auc(y_true, y_scores, max_fpr) print(f"\npAUC(0, {max_fpr:.2f}):") print(f" Raw pAUC: {result['raw_pauc']:.6f}") print(f" Random baseline: {result['random_baseline']:.6f}") print(f" Max possible: {result['max_possible']:.6f}") print(f" Ratio vs random: {result['raw_pauc'] / result['random_baseline']:.2f}x")Raw pAUC values depend heavily on the chosen β. pAUC(0, 0.05) ≈ 0.04 might be excellent, while pAUC(0, 0.5) = 0.4 might be mediocre. Always report β and consider normalization for comparability.
Raw pAUC is difficult to interpret because its range depends on $\beta$. Several normalization schemes address this.
1. McClish Normalization (Standardized pAUC):
The most common normalization, introduced by McClish (1989), maps pAUC to [0, 1]:
$$\text{pAUC}{\text{McClish}} = \frac{1}{2} \left( 1 + \frac{\text{pAUC} - \text{pAUC}{\text{min}}}{\text{pAUC}{\text{max}} - \text{pAUC}{\text{min}}} \right)$$
For the range $[0, \beta]$:
Simplifying:
$$\text{pAUC}_{\text{McClish}} = \frac{1}{2} \left( 1 + \frac{\text{pAUC}}{\beta} \right)$$
This has range [0.5, 1.0], where 0.5 corresponds to pAUC = 0 (worst) and 1.0 corresponds to perfect.
2. Simple Normalization:
Divide by the maximum possible pAUC:
$$\text{pAUC}_{\text{simple}} = \frac{\text{pAUC}}{\beta}$$
This has range [0, 1], directly interpretable as the fraction of maximum pAUC achieved.
3. Normalized-by-Random:
Compare to the random classifier:
$$\text{pAUC}_{\text{ratio}} = \frac{\text{pAUC}}{\beta^2/2}$$
This shows how many times better than random the model is in this region.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npfrom sklearn.metrics import roc_curve, auc def pauc_with_normalizations(y_true, y_scores, max_fpr=0.1): """ Compute pAUC with various normalization schemes. Args: y_true: Binary labels y_scores: Predicted scores max_fpr: Maximum FPR (beta) Returns: Dict with raw and normalized pAUC values """ fpr, tpr, _ = roc_curve(y_true, y_scores) # Interpolate to exact max_fpr stop_idx = np.searchsorted(fpr, max_fpr, side='right') if stop_idx > 0 and fpr[stop_idx - 1] < max_fpr: if stop_idx < len(fpr): t = (max_fpr - fpr[stop_idx - 1]) / (fpr[stop_idx] - fpr[stop_idx - 1]) tpr_interp = tpr[stop_idx - 1] + t * (tpr[stop_idx] - tpr[stop_idx - 1]) fpr_partial = np.append(fpr[:stop_idx], max_fpr) tpr_partial = np.append(tpr[:stop_idx], tpr_interp) else: fpr_partial = fpr[:stop_idx] tpr_partial = tpr[:stop_idx] else: fpr_partial = fpr[:stop_idx] tpr_partial = tpr[:stop_idx] raw_pauc = auc(fpr_partial, tpr_partial) if len(fpr_partial) >= 2 else 0.0 beta = max_fpr # Normalization schemes # 1. Simple: [0, 1] simple = raw_pauc / beta # 2. McClish: [0.5, 1.0] mcclish = 0.5 * (1 + raw_pauc / beta) # 3. Ratio to random random_pauc = beta ** 2 / 2 ratio_to_random = raw_pauc / random_pauc if random_pauc > 0 else 0 # 4. Above-random normalized (similar to Gini) # pAUC - random, normalized by max - random above_random = (raw_pauc - random_pauc) / (beta - random_pauc) return { 'raw': raw_pauc, 'simple_normalized': simple, 'mcclish_normalized': mcclish, 'ratio_to_random': ratio_to_random, 'above_random_normalized': above_random, 'max_fpr': beta } # Compare normalizations across different FPR rangesnp.random.seed(42)n_pos, n_neg = 100, 900 scores_pos = np.random.normal(0.7, 0.2, n_pos)scores_neg = np.random.normal(0.3, 0.2, n_neg)y_true = np.concatenate([np.ones(n_pos), np.zeros(n_neg)])y_scores = np.concatenate([scores_pos, scores_neg]) print("pAUC Normalization Comparison")print("=" * 80)print(f"{'Max FPR':<10} {'Raw':<10} {'Simple':<10} {'McClish':<10} " f"{'×Random':<10} {'Above-Rand':<12}")print("-" * 80) for max_fpr in [0.01, 0.05, 0.10, 0.20, 0.50]: result = pauc_with_normalizations(y_true, y_scores, max_fpr) print(f"{max_fpr:<10.2f} {result['raw']:<10.6f} " f"{result['simple_normalized']:<10.4f} " f"{result['mcclish_normalized']:<10.4f} " f"{result['ratio_to_random']:<10.2f} " f"{result['above_random_normalized']:<12.4f}") print("\nInterpretation:")print("-" * 80)print("Raw: Absolute area under curve in [0, beta] range")print("Simple: Fraction of maximum pAUC (range [0, 1])")print("McClish: Standardized to [0.5, 1.0] for comparability")print("×Random: Multiple of random classifier's pAUC")print("Above-Rand: Normalized improvement over random (like Gini for partial range)")| Method | Range | Random Value | Use Case |
|---|---|---|---|
| Raw pAUC | [0, β] | β²/2 | Precise computation, research |
| Simple (pAUC/β) | [0, 1] | β/2 | Intuitive interpretation |
| McClish | [0.5, 1.0] | 0.5 + β/4 | Comparison across β values |
| Ratio to Random | [0, 2/β] | 1.0 | Improvement over baseline |
| Above-Random | [-∞, 1] | 0 | Like Gini for partial range |
Sometimes we need to constrain both FPR and TPR. Two-Way pAUC (or restricted pAUC) evaluates the ROC curve within a rectangular region.
Definition:
Given FPR range $[\alpha_1, \beta_1]$ and TPR range $[\alpha_2, \beta_2]$:
$$\text{pAUC}{\text{2way}} = \int{\text{FPR} \in [\alpha_1, \beta_1], \text{TPR} \in [\alpha_2, \beta_2]} \text{ROC region}$$
Use Cases:
TPR constraint: In fraud detection, we might require TPR ≥ 0.9 (catch 90% of frauds). We only care about the ROC region where this is satisfied.
FPR + TPR constraints: In medical screening, we might need:
Only the rectangular region satisfying both is relevant.
Computation:
Two-way pAUC requires identifying the portion of the ROC curve that falls within both constraints, which may be empty if the constraints are unsatisfiable by the model.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
import numpy as npfrom sklearn.metrics import roc_curve, auc def two_way_pauc(y_true, y_scores, fpr_range=(0.0, 0.1), tpr_range=(0.8, 1.0)): """ Compute Two-Way Partial AUC. Computes area under ROC curve within specified FPR and TPR ranges. Args: y_true: Binary labels y_scores: Predicted scores fpr_range: (min_fpr, max_fpr) tuple tpr_range: (min_tpr, max_tpr) tuple Returns: Dict with two-way pAUC and diagnostics """ fpr, tpr, thresholds = roc_curve(y_true, y_scores) min_fpr, max_fpr = fpr_range min_tpr, max_tpr = tpr_range # Find points within both constraints mask = (fpr >= min_fpr) & (fpr <= max_fpr) & (tpr >= min_tpr) & (tpr <= max_tpr) fpr_filtered = fpr[mask] tpr_filtered = tpr[mask] if len(fpr_filtered) < 2: # Constraints not satisfiable or barely touched return { 'two_way_pauc': 0.0, 'fpr_range': fpr_range, 'tpr_range': tpr_range, 'points_in_region': len(fpr_filtered), 'feasible': len(fpr_filtered) >= 2 } # Compute area (simple trapezoidal for filtered region) raw_area = auc(fpr_filtered, tpr_filtered) # Subtract the lower TPR baseline within the FPR range # This accounts for the TPR floor fpr_width = fpr_filtered[-1] - fpr_filtered[0] baseline_area = min_tpr * fpr_width # The effective two-way pAUC measures area above the TPR floor effective_area = raw_area - baseline_area # Maximum possible (TPR = max_tpr throughout the FPR range) max_possible = (max_tpr - min_tpr) * (max_fpr - min_fpr) normalized = effective_area / max_possible if max_possible > 0 else 0.0 return { 'two_way_pauc': raw_area, 'effective_area': effective_area, 'normalized': normalized, 'fpr_range': fpr_range, 'tpr_range': tpr_range, 'points_in_region': len(fpr_filtered), 'feasible': True } # Example: Fraud detection with constraintsnp.random.seed(42)n_fraud = 100n_legit = 9900 # Model scoresscores_fraud = np.random.normal(0.75, 0.15, n_fraud)scores_legit = np.random.normal(0.25, 0.2, n_legit) y_true = np.concatenate([np.ones(n_fraud), np.zeros(n_legit)])y_scores = np.concatenate([scores_fraud, scores_legit]) print("Two-Way pAUC Analysis")print("=" * 60)print("Scenario: Fraud detection")print(" - Constraint 1: FPR ≤ 5% (can't investigate too many)")print(" - Constraint 2: TPR ≥ 80% (must catch most fraud)")print() # Standard pAUC (one-way)from sklearn.metrics import roc_auc_scorefull_auc = roc_auc_score(y_true, y_scores)print(f"Full AUC: {full_auc:.4f}") # One-way pAUC (FPR constraint only)fpr, tpr, _ = roc_curve(y_true, y_scores)idx = np.where(fpr <= 0.05)[0]one_way_pauc = auc(fpr[idx], tpr[idx]) if len(idx) >= 2 else 0print(f"One-way pAUC (FPR ≤ 5%): {one_way_pauc:.6f}") # Two-way pAUCresult = two_way_pauc(y_true, y_scores, fpr_range=(0.0, 0.05), tpr_range=(0.8, 1.0)) print(f"\nTwo-way pAUC (FPR ≤ 5%, TPR ≥ 80%):")print(f" Raw area in region: {result['two_way_pauc']:.6f}")print(f" Effective area: {result['effective_area']:.6f}")print(f" Normalized: {result['normalized']:.4f}")print(f" Points in region: {result['points_in_region']}")print(f" Constraints feasible: {result['feasible']}")Understanding how pAUC relates to other metrics helps guide metric selection.
pAUC vs. Full AUC:
| Aspect | Full AUC | pAUC |
|---|---|---|
| FPR range | [0, 1] | [α, β] (typically [0, β]) |
| Interpretation | Overall discrimination | Discrimination in target regime |
| Sensitivity to | Entire ROC curve | Only specified region |
| Operational relevance | May include irrelevant regions | Focuses on deployable region |
Key Insight:
Full AUC and pAUC can disagree on model rankings. Model A might have higher full AUC but lower pAUC in the target region, meaning Model B is better where it matters.
pAUC vs. TPR@FPR (Sensitivity at Specificity):
TPR@FPR is a point metric—the TPR at a single FPR threshold. pAUC is an area metric covering a range.
| Metric | What it measures | Pros | Cons |
|---|---|---|---|
| TPR@FPR=0.05 | Single point on ROC | Simple, specific | Single point = high variance |
| pAUC(0, 0.05) | Area under FPR ≤ 5% | Averages over range | Less specific |
When Models Cross:
If two ROC curves cross within the target region, pAUC gives an average view. Point metrics (TPR@FPR) might prefer one model at some thresholds and another at different thresholds.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import numpy as npfrom sklearn.metrics import roc_curve, auc, roc_auc_score def compare_models_pauc_vs_auc(models, y_true, max_fpr=0.05): """ Compare models on full AUC vs pAUC. Demonstrates cases where rankings differ. """ results = [] for name, y_scores in models.items(): fpr, tpr, _ = roc_curve(y_true, y_scores) full_auc = roc_auc_score(y_true, y_scores) # pAUC idx = np.where(fpr <= max_fpr)[0] if len(idx) >= 2: pauc = auc(fpr[idx], tpr[idx]) else: pauc = 0.0 # TPR at target FPR idx_at_fpr = np.searchsorted(fpr, max_fpr) if idx_at_fpr > 0: tpr_at_fpr = tpr[idx_at_fpr - 1] else: tpr_at_fpr = 0.0 results.append({ 'name': name, 'full_auc': full_auc, 'pauc': pauc, 'tpr_at_fpr': tpr_at_fpr }) return results # Create models with different characteristicsnp.random.seed(42)n_pos, n_neg = 200, 1800y_true = np.concatenate([np.ones(n_pos), np.zeros(n_neg)]) # Model A: Good overall, weaker at low FPRscores_a_pos = np.random.normal(0.65, 0.2, n_pos)scores_a_neg = np.random.normal(0.35, 0.2, n_neg)y_scores_a = np.concatenate([scores_a_pos, scores_a_neg]) # Model B: Slightly lower overall AUC, but stronger at low FPR# Has a "fat tail" of very high-confidence positivesscores_b_pos = np.concatenate([ np.random.normal(0.9, 0.05, int(n_pos * 0.2)), # 20% very high confidence np.random.normal(0.5, 0.2, int(n_pos * 0.8)) # 80% moderate])scores_b_neg = np.random.normal(0.4, 0.15, n_neg)y_scores_b = np.concatenate([scores_b_pos, scores_b_neg]) models = { 'Model A (balanced)': y_scores_a, 'Model B (high-conf tail)': y_scores_b} print("Comparing Models: Full AUC vs pAUC")print("=" * 70)print() for max_fpr in [0.01, 0.05, 0.10]: results = compare_models_pauc_vs_auc(models, y_true, max_fpr) print(f"Max FPR = {max_fpr:.0%}") print("-" * 60) print(f"{'Model':<25} {'Full AUC':<12} {'pAUC':<12} {'TPR@FPR':<12}") print("-" * 60) for r in results: print(f"{r['name']:<25} {r['full_auc']:<12.4f} " f"{r['pauc']:<12.6f} {r['tpr_at_fpr']:<12.4f}") # Determine winner for each metric full_auc_winner = max(results, key=lambda x: x['full_auc'])['name'] pauc_winner = max(results, key=lambda x: x['pauc'])['name'] print() print(f"Full AUC winner: {full_auc_winner.split()[0]} {full_auc_winner.split()[1]}") print(f"pAUC winner: {pauc_winner.split()[0]} {pauc_winner.split()[1]}") if full_auc_winner != pauc_winner: print(" >>> DIFFERENT WINNERS! pAUC reveals different operational performance.") print()Be wary of models marketed on full AUC when you'll operate in a constrained FPR regime. A model with 'state-of-the-art AUC = 0.98' might have pAUC(0, 0.05) worse than a simpler model with full AUC = 0.94. Always evaluate on the metric that reflects your operational constraints.
| Domain | Typical β (max FPR) | Rationale |
|---|---|---|
| Cancer Screening | 0.01 - 0.05 | False alarms cause anxiety, unnecessary biopsies |
| Fraud Detection | 0.01 - 0.10 | Can't manually investigate more than ~1-10% of transactions |
| Security Screening (Airport) | 0.001 - 0.01 | Massive passenger volume; even 1% FPR is millions of delays |
| Spam Filtering | 0.001 - 0.01 | False positives lose important emails |
| Disease Diagnosis | 0.05 - 0.20 | Cost of false positives depends on follow-up procedures |
| Credit Scoring | 0.05 - 0.15 | Regulatory and business constraints on rejection rates |
Case Study: Mammography Screening
In breast cancer screening:
Case Study: Credit Card Fraud
In real-time fraud detection:
Case Study: Rare Disease Screening
pAUC has unique statistical properties that affect confidence intervals and significance testing.
Variance of pAUC:
The variance of pAUC depends on:
Intuition: Narrower FPR ranges use fewer negative examples to estimate the curve, increasing variance. pAUC(0, 0.01) has higher variance than pAUC(0, 0.10).
Confidence Intervals:
Bootstrap is the most common approach:
Significance Testing:
To compare two models on pAUC:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import numpy as npfrom sklearn.metrics import roc_curve, auc def pauc_with_bootstrap_ci(y_true, y_scores, max_fpr=0.1, n_bootstrap=2000, confidence=0.95): """ Compute pAUC with bootstrap confidence interval. Args: y_true: Binary labels y_scores: Predicted scores max_fpr: Maximum FPR for partial AUC n_bootstrap: Number of bootstrap iterations confidence: Confidence level for CI Returns: Dict with pAUC estimate and confidence interval """ def compute_pauc(y_t, y_s, max_f): fpr, tpr, _ = roc_curve(y_t, y_s) idx = np.where(fpr <= max_f)[0] return auc(fpr[idx], tpr[idx]) if len(idx) >= 2 else 0.0 # Point estimate pauc_estimate = compute_pauc(y_true, y_scores, max_fpr) # Bootstrap rng = np.random.default_rng(42) n = len(y_true) bootstrap_paucs = [] for _ in range(n_bootstrap): idx = rng.choice(n, size=n, replace=True) y_t_boot = y_true[idx] y_s_boot = y_scores[idx] # Ensure both classes present if len(np.unique(y_t_boot)) < 2: continue pauc_boot = compute_pauc(y_t_boot, y_s_boot, max_fpr) bootstrap_paucs.append(pauc_boot) bootstrap_paucs = np.array(bootstrap_paucs) alpha = 1 - confidence ci_lower = np.percentile(bootstrap_paucs, 100 * alpha / 2) ci_upper = np.percentile(bootstrap_paucs, 100 * (1 - alpha / 2)) return { 'pauc': pauc_estimate, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 'bootstrap_std': np.std(bootstrap_paucs), 'max_fpr': max_fpr, 'n_bootstrap_valid': len(bootstrap_paucs) } def compare_models_pauc_significance(y_true, y_scores_a, y_scores_b, max_fpr=0.1, n_bootstrap=5000): """ Compare two models' pAUC with significance testing. """ def compute_pauc(y_t, y_s, max_f): fpr, tpr, _ = roc_curve(y_t, y_s) idx = np.where(fpr <= max_f)[0] return auc(fpr[idx], tpr[idx]) if len(idx) >= 2 else 0.0 pauc_a = compute_pauc(y_true, y_scores_a, max_fpr) pauc_b = compute_pauc(y_true, y_scores_b, max_fpr) observed_diff = pauc_a - pauc_b # Paired bootstrap rng = np.random.default_rng(42) n = len(y_true) boot_diffs = [] for _ in range(n_bootstrap): idx = rng.choice(n, size=n, replace=True) y_t_boot = y_true[idx] if len(np.unique(y_t_boot)) < 2: continue pauc_a_boot = compute_pauc(y_t_boot, y_scores_a[idx], max_fpr) pauc_b_boot = compute_pauc(y_t_boot, y_scores_b[idx], max_fpr) boot_diffs.append(pauc_a_boot - pauc_b_boot) boot_diffs = np.array(boot_diffs) ci_lower = np.percentile(boot_diffs, 2.5) ci_upper = np.percentile(boot_diffs, 97.5) significant = ci_lower > 0 or ci_upper < 0 return { 'pauc_a': pauc_a, 'pauc_b': pauc_b, 'difference': observed_diff, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 'significant_95': significant } # Examplenp.random.seed(42)n_pos, n_neg = 100, 900 scores_pos_a = np.random.normal(0.7, 0.2, n_pos)scores_neg_a = np.random.normal(0.3, 0.2, n_neg)y_true = np.concatenate([np.ones(n_pos), np.zeros(n_neg)])y_scores_a = np.concatenate([scores_pos_a, scores_neg_a]) # Slightly better modelscores_pos_b = np.random.normal(0.72, 0.18, n_pos)scores_neg_b = np.random.normal(0.28, 0.18, n_neg)y_scores_b = np.concatenate([scores_pos_b, scores_neg_b]) print("pAUC with Bootstrap Confidence Intervals")print("=" * 60) for max_fpr in [0.05, 0.10]: result = pauc_with_bootstrap_ci(y_true, y_scores_a, max_fpr) print(f"\nMax FPR = {max_fpr:.0%}:") print(f" pAUC = {result['pauc']:.6f}") print(f" 95% CI: [{result['ci_lower']:.6f}, {result['ci_upper']:.6f}]") print(f" Bootstrap std: {result['bootstrap_std']:.6f}") print("\nModel Comparison:")comparison = compare_models_pauc_significance(y_true, y_scores_a, y_scores_b, 0.1)print(f" Model A pAUC: {comparison['pauc_a']:.6f}")print(f" Model B pAUC: {comparison['pauc_b']:.6f}")print(f" Difference (A - B): {comparison['difference']:.6f}")print(f" 95% CI: [{comparison['ci_lower']:.6f}, {comparison['ci_upper']:.6f}]")print(f" Significant: {comparison['significant_95']}")We have established pAUC as the essential metric for evaluating classifiers in constrained operating regimes. Let's consolidate our understanding:
Mathematical Summary:
$$\text{pAUC}(0, \beta) = \int_{0}^{\beta} \text{TPR}(\text{FPR}) , d(\text{FPR})$$
Normalizations:
Module Complete:
With pAUC, we conclude our exploration of ranking metrics. You now have a comprehensive toolkit for evaluating ranked systems:
| Metric | Best For |
|---|---|
| P@k, R@k | Simple cutoff evaluation |
| MAP | Binary relevance, all positions |
| NDCG | Graded relevance |
| MRR | Single-answer retrieval |
| pAUC | Constrained FPR operation |
You have completed Module 5: Ranking Metrics. You now possess a comprehensive understanding of how to evaluate ranked retrieval, recommendation, and classification systems—from simple Precision@k to sophisticated partial AUC. Choose metrics that align with your operational constraints and always report with statistical rigor.