Loading learning content...
Just as AUC summarizes the ROC curve, Average Precision (AP) summarizes the Precision-Recall curve as a single scalar value. This summary is essential for:
But AP is more nuanced than AUC. There are multiple valid definitions, each with different properties. Understanding these subtleties is crucial for correct application and interpretation.
By the end of this page, you will understand the intuitive meaning of Average Precision, its formal mathematical definitions, different computation methods and their tradeoffs, variants like AP@k and mAP, and practical guidance on interpretation and use.
Before diving into formulas, let's build intuition for what Average Precision measures.
Imagine you rank all examples by their classifier score, highest first. Now process them one by one:
Average Precision is the mean of these recorded precisions.
This means AP asks: "When I find a positive, how good is my precision at that moment?"
Consider two classifiers:
Classifier A ranks positives early:
Rank: 1 2 3 4 5 6 7 8 9 10
Label: + + + - - + - - - - (4 positives)
When encountering positives:
AP = (1.00 + 1.00 + 1.00 + 0.67) / 4 = 0.917
Classifier B ranks positives late:
Rank: 1 2 3 4 5 6 7 8 9 10
Label: - - - + - - + + - + (4 positives)
When encountering positives:
AP = (0.25 + 0.29 + 0.38 + 0.40) / 4 = 0.330
AP rewards classifiers that rank positives before negatives. The earlier positives appear in the ranking, the higher the precision when they're encountered, and the higher the AP. A perfect classifier (all positives ranked first) achieves AP = 1.0.
Average Precision has several equivalent and approximately equivalent formulations. Understanding these helps when implementing or interpreting different libraries' results.
The most intuitive definition:
$$\text{AP} = \frac{1}{P} \sum_{k=1}^{n} \text{Precision}(k) \cdot \mathbb{1}[y_k = 1]$$
where:
In words: Sum the precision values at positions where positives occur, divide by total positives.
Equivalently, AP is the area under the PR curve:
$$\text{AP} = \int_0^1 P(R) , dR$$
where P(R) is precision as a function of recall.
Another equivalent form:
$$\text{AP} = \sum_{k=1}^{n} \text{Precision}(k) \cdot \Delta\text{Recall}(k)$$
where ΔRecall(k) = Recall(k) - Recall(k-1) = 1/P if y_k = 1, else 0.
Interpretation: Each positive contributes its precision, weighted by the recall increment (1/P for each positive).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as np def average_precision_manual(scores, labels): """ Compute Average Precision from scratch. Definition: Mean precision at positions where positives occur. """ scores = np.array(scores) labels = np.array(labels) # Number of positives P = np.sum(labels == 1) if P == 0: return 0.0 # Sort by descending score sorted_indices = np.argsort(-scores) sorted_labels = labels[sorted_indices] # Compute AP tp = 0 precision_sum = 0.0 for i, label in enumerate(sorted_labels): if label == 1: tp += 1 precision_at_k = tp / (i + 1) precision_sum += precision_at_k return precision_sum / P def average_precision_integral(precision, recall): """ Compute AP as area under the PR curve using trapezoidal rule. Note: This matches sklearn's implementation when precision/recall are computed properly. """ # Sort by recall (should already be sorted) sorted_indices = np.argsort(recall) recall = np.array(recall)[sorted_indices] precision = np.array(precision)[sorted_indices] # Compute area using trapezoidal rule # For step function interpretation, can also use: # AP = sum(precision * diff(recall)) where diff prepends 0 recall_diff = np.diff(np.concatenate([[0], recall])) ap = np.sum(precision * recall_diff) return ap # Examplescores = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4]labels = [1, 0, 1, 0, 1, 0] # 3 positives ap = average_precision_manual(scores, labels)print(f"Average Precision: {ap:.4f}") # Trace through:# Position 1 (score=0.9, label=1): Prec = 1/1 = 1.00 ✓# Position 2 (score=0.8, label=0): Skip# Position 3 (score=0.7, label=1): Prec = 2/3 = 0.67 ✓# Position 4 (score=0.6, label=0): Skip# Position 5 (score=0.5, label=1): Prec = 3/5 = 0.60 ✓# Position 6 (score=0.4, label=0): Skip # AP = (1.00 + 0.67 + 0.60) / 3 = 0.756All three definitions produce the same result when applied to the raw (non-interpolated) PR curve. Differences arise only when interpolation is used, as we'll see in the variants section.
Different communities have used different interpolation methods, leading to slightly different AP values. Understanding these variants is crucial for reproducibility.
The most straightforward approach uses all observed precision-recall points:
$$\text{AP}{\text{all}} = \frac{1}{P} \sum{k: y_k = 1} \text{Precision}(k)$$
This is what scikit-learn's average_precision_score computes.
Historically, information retrieval used interpolation at 11 recall levels:
$$\text{AP}{11} = \frac{1}{11} \sum{r \in {0, 0.1, ..., 1.0}} P_{\text{interp}}(r)$$
where $P_{\text{interp}}(r) = \max_{r' \geq r} P(r')$ — the maximum precision at any recall ≥ r.
Modern object detection uses interpolation at all unique recall levels:
$$\text{AP}{\text{interp}} = \sum{i} (R_i - R_{i-1}) \cdot P_{\text{interp}}(R_i)$$
where $P_{\text{interp}}(R_i) = \max_{R \geq R_i} P(R)$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
import numpy as npfrom sklearn.metrics import precision_recall_curve, average_precision_score def ap_all_points(scores, labels): """ AP without interpolation (sklearn's method). """ return average_precision_score(labels, scores) def ap_11_point(scores, labels): """ 11-point interpolated AP (legacy TREC/VOC pre-2010). """ precision, recall, _ = precision_recall_curve(labels, scores) ap = 0.0 for r in np.linspace(0, 1, 11): # Max precision at recall >= r prec_at_r = precision[recall >= r] if len(prec_at_r) > 0: ap += np.max(prec_at_r) return ap / 11 def ap_all_point_interp(scores, labels): """ All-point interpolated AP (VOC 2010+, COCO). At each recall level, use max precision at any recall >= that level. Then integrate. """ precision, recall, _ = precision_recall_curve(labels, scores) # Sort by recall order = np.argsort(recall) recall = recall[order] precision = precision[order] # Compute interpolated precision (monotonically decreasing) # p_interp[i] = max(p[i:]) precision_interp = np.maximum.accumulate(precision[::-1])[::-1] # Compute area under interpolated curve recall_diff = np.diff(np.concatenate([[0], recall])) ap = np.sum(precision_interp * recall_diff) return ap # Compare methodsnp.random.seed(42)n = 200pos_scores = np.random.beta(5, 2, 20) # 20 positivesneg_scores = np.random.beta(2, 5, 180) # 180 negativesscores = np.concatenate([pos_scores, neg_scores])labels = np.array([1]*20 + [0]*180) print("AP Computation Methods Comparison:")print(f" All-points (sklearn): {ap_all_points(scores, labels):.4f}")print(f" 11-point interpolation: {ap_11_point(scores, labels):.4f}")print(f" All-point interpolation: {ap_all_point_interp(scores, labels):.4f}")| Benchmark/Library | Method | Notes |
|---|---|---|
| scikit-learn | All-points, no interpolation | Trapezoidal integration on raw points |
| TREC (early) | 11-point interpolation | Standard recall levels |
| Pascal VOC 2007 | 11-point interpolation | Object detection |
| Pascal VOC 2010+ | All-point interpolation | More accurate for ranking |
| MS COCO | 101-point interpolation | At recall 0, 0.01, ..., 1.00 |
| ImageNet | All-point interpolation | Similar to VOC 2010+ |
When comparing to published results, ensure you use the SAME AP computation method as the benchmark. A model might show AP = 0.45 with one method and AP = 0.42 with another. Always check the evaluation protocol.
Both AP and ROC-AUC summarize classifier performance as a scalar, but they measure different things and behave differently.
| Property | AP (PR-AUC) | AUC-ROC |
|---|---|---|
| Underlying curve | Precision vs Recall | TPR vs FPR |
| Sensitive to class imbalance? | Yes — decreases with more negatives | No — invariant |
| Random baseline | ≈ base rate (varies) | 0.5 (fixed) |
| Focus | Positive predictions quality | Ranking/discrimination |
| When scores matter most | Among positives and neighbors | Across entire score range |
| Affected by true negatives? | No (ignores TN) | Yes (via FPR) |
AP and AUC often tell different stories, especially with imbalanced data:
Scenario: Rare Event Detection (1% positive rate)
A classifier achieves:
Interpretation:
The AUC looks at all pairs equally; AP focuses on what happens when you make positive predictions (which is often what matters operationally).
There's no exact formula relating AP to AUC, but bounds exist:
$$\text{AP} \geq \frac{\pi}{\pi + (1-\pi) \cdot \text{LR}_{\text{min}}}$$
where π = base rate and LR_min = minimum likelihood ratio.
In practice, for a given AUC:
123456789101112131415161718192021222324252627282930313233343536373839404142
import numpy as npfrom sklearn.metrics import roc_auc_score, average_precision_score def demonstrate_ap_auc_divergence(): """ Show how AP and AUC diverge under class imbalance. """ np.random.seed(42) # Fixed discrimination (same separation between class score distributions) pos_mean, neg_mean = 0.7, 0.3 std = 0.2 results = [] for pos_rate in [0.50, 0.20, 0.10, 0.05, 0.01]: n_total = 10000 n_pos = int(n_total * pos_rate) n_neg = n_total - n_pos pos_scores = np.clip(np.random.normal(pos_mean, std, n_pos), 0, 1) neg_scores = np.clip(np.random.normal(neg_mean, std, n_neg), 0, 1) scores = np.concatenate([pos_scores, neg_scores]) labels = np.array([1]*n_pos + [0]*n_neg) auc = roc_auc_score(labels, scores) ap = average_precision_score(labels, scores) results.append({ 'pos_rate': pos_rate, 'auc': auc, 'ap': ap, 'ratio': ap / auc }) print("Positive Rate | AUC-ROC | Avg Prec | AP/AUC Ratio") print("-" * 55) for r in results: print(f" {r['pos_rate']:.0%} | {r['auc']:.3f} | {r['ap']:.3f} | {r['ratio']:.2f}") demonstrate_ap_auc_divergence()For imbalanced problems, always report BOTH metrics. AUC tells stakeholders about discrimination ability; AP tells them about operational precision when making positive predictions. Together, they provide a complete picture.
In multi-class or multi-label settings, Mean Average Precision (mAP) extends AP to aggregate performance across classes.
$$\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c$$
where C is the number of classes and AP_c is the Average Precision for class c.
Object Detection: Each object class (car, person, bicycle, ...) has its own AP. mAP summarizes detection performance across all classes.
Multi-Label Classification: Each label is treated as a separate binary classification. mAP aggregates across labels.
Information Retrieval: Each query has its own AP (for retrieved documents). mAP aggregates across queries.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
import numpy as npfrom sklearn.metrics import average_precision_score def mean_average_precision(y_true, y_scores, class_axis=1): """ Compute Mean Average Precision for multi-label classification. Parameters: ----------- y_true : array of shape (n_samples, n_classes) Binary ground truth labels y_scores : array of shape (n_samples, n_classes) Predicted scores for each class Returns: -------- mAP : float Mean Average Precision per_class_ap : array AP for each class """ y_true = np.array(y_true) y_scores = np.array(y_scores) n_classes = y_true.shape[1] per_class_ap = [] for c in range(n_classes): if np.sum(y_true[:, c]) > 0: # Skip classes with no positives ap_c = average_precision_score(y_true[:, c], y_scores[:, c]) per_class_ap.append(ap_c) mAP = np.mean(per_class_ap) return mAP, np.array(per_class_ap) # Example: Multi-label classification with 5 classesnp.random.seed(42)n_samples = 500n_classes = 5 # Simulate predictions (some classes harder than others)y_true = np.random.binomial(1, 0.1, (n_samples, n_classes)) # Sparse labelsbase_scores = np.random.rand(n_samples, n_classes)# Make scores correlate with true labelsy_scores = base_scores + y_true * np.random.uniform(0.2, 0.5, n_classes)y_scores = np.clip(y_scores, 0, 1) mAP, per_class_ap = mean_average_precision(y_true, y_scores) print(f"Mean Average Precision: {mAP:.4f}")print("\nPer-class AP:")for i, ap in enumerate(per_class_ap): print(f" Class {i}: {ap:.4f}")In object detection (COCO, VOC), mAP also depends on IoU (Intersection over Union) thresholds. COCO reports mAP@[.5:.95] — the mAP averaged over IoU thresholds from 0.5 to 0.95 in steps of 0.05. This provides a stricter evaluation than mAP@0.5 alone.
Sometimes we only care about performance in the top-k predictions. Ranking metrics derived from AP address this.
Simply the precision in the top k predictions:
$$\text{Precision@}k = \frac{|{\text{positives in top } k}|}{k}$$
Use case: Search results — users typically only view the first page (k ≈ 10).
Fraction of all positives found in top k:
$$\text{Recall@}k = \frac{|{\text{positives in top } k}|}{P}$$
Use case: Ensuring coverage — finding most relevant items in top results.
AP computed only over the first k predictions:
$$\text{AP@}k = \frac{1}{\min(P, k)} \sum_{i=1}^{k} \text{Precision}(i) \cdot \mathbb{1}[y_i = 1]$$
Note: The denominator is min(P, k) because we can't expect more positives than exist.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as np def precision_at_k(scores, labels, k): """Precision in top k predictions.""" sorted_indices = np.argsort(-scores) top_k_labels = labels[sorted_indices[:k]] return np.sum(top_k_labels) / k def recall_at_k(scores, labels, k): """Fraction of positives found in top k.""" P = np.sum(labels) if P == 0: return 0.0 sorted_indices = np.argsort(-scores) top_k_labels = labels[sorted_indices[:k]] return np.sum(top_k_labels) / P def average_precision_at_k(scores, labels, k): """Average Precision restricted to top k.""" sorted_indices = np.argsort(-scores) sorted_labels = labels[sorted_indices] P = np.sum(labels) if P == 0: return 0.0 tp = 0 precision_sum = 0.0 for i in range(min(k, len(sorted_labels))): if sorted_labels[i] == 1: tp += 1 precision_sum += tp / (i + 1) return precision_sum / min(P, k) # Example: Recommendation systemnp.random.seed(42)n_items = 100n_relevant = 10 # 10 items are relevant scores = np.random.rand(n_items)labels = np.zeros(n_items)relevant_indices = np.random.choice(n_items, n_relevant, replace=False)labels[relevant_indices] = 1 # Bias scores toward relevant items (imperfect model)scores[relevant_indices] += np.random.uniform(0.2, 0.6, n_relevant)scores = np.clip(scores, 0, 1) print("Ranking Metrics at Different k Values:")print("-" * 50)print(f"{'k':<5} | {'P@k':<8} | {'R@k':<8} | {'AP@k':<8}")print("-" * 50) for k in [5, 10, 20, 50]: p_at_k = precision_at_k(scores, labels, k) r_at_k = recall_at_k(scores, labels, k) ap_at_k = average_precision_at_k(scores, labels, k) print(f"{k:<5} | {p_at_k:<8.3f} | {r_at_k:<8.3f} | {ap_at_k:<8.3f}") print(f"\nFull AP: {average_precision_at_k(scores, labels, n_items):.3f}")Choose k based on your application: How many results/predictions will users actually see or act on? For search: k ≈ 10-20. For recommendation: k might be 5-10. For automated pipelines: k might equal your processing capacity.
Like any sample statistic, AP has variance and sampling distribution properties that affect interpretation.
AP variance increases with:
Bootstrap resampling provides robust confidence intervals:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import numpy as npfrom sklearn.metrics import average_precision_scorefrom sklearn.utils import resample def ap_bootstrap_ci(scores, labels, n_bootstrap=1000, confidence_level=0.95, stratified=True): """ Compute AP confidence interval via bootstrap. Parameters: ----------- scores : array-like Classifier scores labels : array-like True binary labels n_bootstrap : int Number of bootstrap samples confidence_level : float Confidence level (e.g., 0.95 for 95% CI) stratified : bool Whether to maintain class balance in resampling Returns: -------- dict with ap_point, ci_lower, ci_upper, se """ scores = np.array(scores) labels = np.array(labels) ap_point = average_precision_score(labels, scores) bootstrap_aps = [] for _ in range(n_bootstrap): if stratified: idx = resample(range(len(labels)), stratify=labels, replace=True) else: idx = resample(range(len(labels)), replace=True) ap_boot = average_precision_score(labels[idx], scores[idx]) bootstrap_aps.append(ap_boot) bootstrap_aps = np.array(bootstrap_aps) alpha = 1 - confidence_level ci_lower = np.percentile(bootstrap_aps, 100 * alpha / 2) ci_upper = np.percentile(bootstrap_aps, 100 * (1 - alpha / 2)) se = np.std(bootstrap_aps) return { 'ap_point': ap_point, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 'se': se, 'bootstrap_aps': bootstrap_aps } # Example with different sample sizesnp.random.seed(42) print("AP Confidence Intervals vs Sample Size:")print("-" * 60)print(f"{'N':<8} | {'N_pos':<8} | {'AP':<8} | {'95% CI':<18} | {'SE':<8}")print("-" * 60) for n in [100, 500, 2000]: n_pos = int(n * 0.1) # 10% positive rate pos_scores = np.random.beta(5, 2, n_pos) neg_scores = np.random.beta(2, 5, n - n_pos) scores = np.concatenate([pos_scores, neg_scores]) labels = np.array([1]*n_pos + [0]*(n-n_pos)) result = ap_bootstrap_ci(scores, labels, n_bootstrap=500) ci_str = f"[{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]" print(f"{n:<8} | {n_pos:<8} | {result['ap_point']:.3f} | {ci_str:<18} | {result['se']:.4f}")To test if one classifier has significantly higher AP than another, use paired bootstrap testing:
12345678910111213141516171819202122232425262728293031323334353637383940414243
def ap_paired_bootstrap_test(scores1, scores2, labels, n_bootstrap=1000): """ Test if classifier 1 has significantly higher AP than classifier 2. Uses paired bootstrap on the same data. Returns: -------- dict with observed difference, p-value, and CI for difference """ ap1 = average_precision_score(labels, scores1) ap2 = average_precision_score(labels, scores2) observed_diff = ap1 - ap2 bootstrap_diffs = [] for _ in range(n_bootstrap): idx = resample(range(len(labels)), stratify=labels) ap1_boot = average_precision_score(labels[idx], scores1[idx]) ap2_boot = average_precision_score(labels[idx], scores2[idx]) bootstrap_diffs.append(ap1_boot - ap2_boot) bootstrap_diffs = np.array(bootstrap_diffs) # Two-sided p-value: proportion of bootstrap diffs on wrong side of 0 if observed_diff >= 0: p_value = 2 * np.mean(bootstrap_diffs <= 0) else: p_value = 2 * np.mean(bootstrap_diffs >= 0) p_value = min(p_value, 1.0) return { 'ap1': ap1, 'ap2': ap2, 'diff': observed_diff, 'diff_ci_lower': np.percentile(bootstrap_diffs, 2.5), 'diff_ci_upper': np.percentile(bootstrap_diffs, 97.5), 'p_value': p_value, 'significant': p_value < 0.05 }With small test sets (especially few positives), AP confidence intervals are wide. An observed AP difference of 0.05 might not be significant with 50 positives, but highly significant with 500. Always report uncertainty.
Understanding what AP values mean in practice helps communicate results to stakeholders.
Unlike AUC where 0.9 is "excellent" universally, AP values depend heavily on base rate:
| Base Rate | "Random" AP | "Good" AP | "Excellent" AP |
|---|---|---|---|
| 50% | ~0.50 | 0.75+ | 0.90+ |
| 10% | ~0.10 | 0.40+ | 0.70+ |
| 1% | ~0.01 | 0.20+ | 0.50+ |
| 0.1% | ~0.001 | 0.05+ | 0.20+ |
Key insight: You must compare AP to the base rate. AP = 0.30 with 1% positives is excellent; with 30% positives, it's close to random.
AP can be interpreted relative to random via "lift":
$$\text{Lift} = \frac{\text{AP}}{\text{Base Rate}}$$
Example: With 2% base rate and AP = 0.40:
AP ≈ the average precision you'd observe if you processed positive predictions in score order. So AP = 0.40 roughly means: 'If I work through predictions from highest to lowest score, I'll be right about 40% of the time on average.' This operational framing often resonates with business stakeholders.
We've developed comprehensive understanding of Average Precision—from intuition to computation to practical interpretation. Let's consolidate the key insights:
What's next:
The next page covers Curve Comparison—systematic methods for comparing ROC and PR curves between classifiers, including visual comparison techniques, statistical tests, and decision frameworks for model selection when curves cross.
You now understand Average Precision thoroughly: its meaning, computation methods, relationship to AUC, extensions like mAP and AP@k, and practical interpretation. You can confidently use AP for model evaluation on imbalanced data.