Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

4 / 5

Average Precision

Summarizing the PR Curve

Just as AUC summarizes the ROC curve, Average Precision (AP) summarizes the Precision-Recall curve as a single scalar value. This summary is essential for:

Ranking multiple models by performance
Optimization targets in hyperparameter tuning
Clear reporting to stakeholders
Statistical comparison between classifiers

But AP is more nuanced than AUC. There are multiple valid definitions, each with different properties. Understanding these subtleties is crucial for correct application and interpretation.

What You Will Learn

By the end of this page, you will understand the intuitive meaning of Average Precision, its formal mathematical definitions, different computation methods and their tradeoffs, variants like AP@k and mAP, and practical guidance on interpretation and use.

Intuitive Understanding

Before diving into formulas, let's build intuition for what Average Precision measures.

The Ranking Perspective

Imagine you rank all examples by their classifier score, highest first. Now process them one by one:

When you encounter a positive example: Record the current precision (TP so far / total predictions so far)
When you encounter a negative example: Don't record anything

Average Precision is the mean of these recorded precisions.

This means AP asks: "When I find a positive, how good is my precision at that moment?"

Why This Makes Sense

Consider two classifiers:

Classifier A ranks positives early:

Rank:     1  2  3  4  5  6  7  8  9  10
Label:    +  +  +  -  -  +  -  -  -  -   (4 positives)

When encountering positives:

Rank 1: Precision = 1/1 = 1.00
Rank 2: Precision = 2/2 = 1.00
Rank 3: Precision = 3/3 = 1.00
Rank 6: Precision = 4/6 = 0.67

AP = (1.00 + 1.00 + 1.00 + 0.67) / 4 = 0.917

Classifier B ranks positives late:

Rank:     1  2  3  4  5  6  7  8  9  10
Label:    -  -  -  +  -  -  +  +  -  +   (4 positives)

When encountering positives:

Rank 4: Precision = 1/4 = 0.25
Rank 7: Precision = 2/7 = 0.29
Rank 8: Precision = 3/8 = 0.38
Rank 10: Precision = 4/10 = 0.40

AP = (0.25 + 0.29 + 0.38 + 0.40) / 4 = 0.330

The Core Insight

AP rewards classifiers that rank positives before negatives. The earlier positives appear in the ranking, the higher the precision when they're encountered, and the higher the AP. A perfect classifier (all positives ranked first) achieves AP = 1.0.

Mathematical Definition

Average Precision has several equivalent and approximately equivalent formulations. Understanding these helps when implementing or interpreting different libraries' results.

Definition 1: Sum Over Positives

The most intuitive definition:

$$\text{AP} = \frac{1}{P} \sum_{k=1}^{n} \text{Precision}(k) \cdot \mathbb{1}[y_k = 1]$$

where:

n = total examples
P = number of positives
Precision(k) = precision at position k (after including k examples)
y_k = label of k-th ranked example
𝟙[y_k = 1] = 1 if k-th example is positive, 0 otherwise

In words: Sum the precision values at positions where positives occur, divide by total positives.

Definition 2: Integral Under PR Curve

Equivalently, AP is the area under the PR curve:

$$\text{AP} = \int_0^1 P(R) , dR$$

where P(R) is precision as a function of recall.

Definition 3: Weighted Recall Increments

Another equivalent form:

$$\text{AP} = \sum_{k=1}^{n} \text{Precision}(k) \cdot \Delta\text{Recall}(k)$$

where ΔRecall(k) = Recall(k) - Recall(k-1) = 1/P if y_k = 1, else 0.

Interpretation: Each positive contributes its precision, weighted by the recall increment (1/P for each positive).

Average Precision Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
 
def average_precision_manual(scores, labels):
    """
    Compute Average Precision from scratch.
    
    Definition: Mean precision at positions where positives occur.
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Number of positives
    P = np.sum(labels == 1)
    
    if P == 0:
        return 0.0
    
    # Sort by descending score
    sorted_indices = np.argsort(-scores)
    sorted_labels = labels[sorted_indices]
    
    # Compute AP
    tp = 0
    precision_sum = 0.0
    
    for i, label in enumerate(sorted_labels):
        if label == 1:
            tp += 1
            precision_at_k = tp / (i + 1)
            precision_sum += precision_at_k
    
    return precision_sum / P
 
def average_precision_integral(precision, recall):
    """
    Compute AP as area under the PR curve using trapezoidal rule.
    
    Note: This matches sklearn's implementation when precision/recall
    are computed properly.
    """
    # Sort by recall (should already be sorted)
    sorted_indices = np.argsort(recall)
    recall = np.array(recall)[sorted_indices]
    precision = np.array(precision)[sorted_indices]
    
    # Compute area using trapezoidal rule
    # For step function interpretation, can also use:
    # AP = sum(precision * diff(recall)) where diff prepends 0
    
    recall_diff = np.diff(np.concatenate([[0], recall]))
    ap = np.sum(precision * recall_diff)
    
    return ap
 
# Example
scores = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4]
labels = [1, 0, 1, 0, 1, 0]  # 3 positives
 
ap = average_precision_manual(scores, labels)
print(f"Average Precision: {ap:.4f}")
 
# Trace through:
# Position 1 (score=0.9, label=1): Prec = 1/1 = 1.00 ✓
# Position 2 (score=0.8, label=0): Skip
# Position 3 (score=0.7, label=1): Prec = 2/3 = 0.67 ✓
# Position 4 (score=0.6, label=0): Skip
# Position 5 (score=0.5, label=1): Prec = 3/5 = 0.60 ✓
# Position 6 (score=0.4, label=0): Skip
 
# AP = (1.00 + 0.67 + 0.60) / 3 = 0.756

Equivalence of Definitions

All three definitions produce the same result when applied to the raw (non-interpolated) PR curve. Differences arise only when interpolation is used, as we'll see in the variants section.

Interpolation Variants

Different communities have used different interpolation methods, leading to slightly different AP values. Understanding these variants is crucial for reproducibility.

No Interpolation (All Points)

The most straightforward approach uses all observed precision-recall points:

$$\text{AP}{\text{all}} = \frac{1}{P} \sum{k: y_k = 1} \text{Precision}(k)$$

This is what scikit-learn's average_precision_score computes.

11-Point Interpolation (Legacy TREC/Pascal VOC pre-2010)

Historically, information retrieval used interpolation at 11 recall levels:

$$\text{AP}{11} = \frac{1}{11} \sum{r \in {0, 0.1, ..., 1.0}} P_{\text{interp}}(r)$$

where $P_{\text{interp}}(r) = \max_{r' \geq r} P(r')$ — the maximum precision at any recall ≥ r.

All-Point Interpolation (Pascal VOC 2010+, COCO)

Modern object detection uses interpolation at all unique recall levels:

$$\text{AP}{\text{interp}} = \sum{i} (R_i - R_{i-1}) \cdot P_{\text{interp}}(R_i)$$

where $P_{\text{interp}}(R_i) = \max_{R \geq R_i} P(R)$

AP Interpolation Variants
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from sklearn.metrics import precision_recall_curve, average_precision_score
 
def ap_all_points(scores, labels):
    """
    AP without interpolation (sklearn's method).
    """
    return average_precision_score(labels, scores)
 
def ap_11_point(scores, labels):
    """
    11-point interpolated AP (legacy TREC/VOC pre-2010).
    """
    precision, recall, _ = precision_recall_curve(labels, scores)
    
    ap = 0.0
    for r in np.linspace(0, 1, 11):
        # Max precision at recall >= r
        prec_at_r = precision[recall >= r]
        if len(prec_at_r) > 0:
            ap += np.max(prec_at_r)
    
    return ap / 11
 
def ap_all_point_interp(scores, labels):
    """
    All-point interpolated AP (VOC 2010+, COCO).
    
    At each recall level, use max precision at any recall >= that level.
    Then integrate.
    """
    precision, recall, _ = precision_recall_curve(labels, scores)
    
    # Sort by recall
    order = np.argsort(recall)
    recall = recall[order]
    precision = precision[order]
    
    # Compute interpolated precision (monotonically decreasing)
    # p_interp[i] = max(p[i:])
    precision_interp = np.maximum.accumulate(precision[::-1])[::-1]
    
    # Compute area under interpolated curve
    recall_diff = np.diff(np.concatenate([[0], recall]))
    ap = np.sum(precision_interp * recall_diff)
    
    return ap
 
# Compare methods
np.random.seed(42)
n = 200
pos_scores = np.random.beta(5, 2, 20)  # 20 positives
neg_scores = np.random.beta(2, 5, 180)  # 180 negatives
scores = np.concatenate([pos_scores, neg_scores])
labels = np.array([1]*20 + [0]*180)
 
print("AP Computation Methods Comparison:")
print(f"  All-points (sklearn):    {ap_all_points(scores, labels):.4f}")
print(f"  11-point interpolation:  {ap_11_point(scores, labels):.4f}")
print(f"  All-point interpolation: {ap_all_point_interp(scores, labels):.4f}")

AP Variants Across Benchmarks
Benchmark/Library	Method	Notes
scikit-learn	All-points, no interpolation	Trapezoidal integration on raw points
TREC (early)	11-point interpolation	Standard recall levels
Pascal VOC 2007	11-point interpolation	Object detection
Pascal VOC 2010+	All-point interpolation	More accurate for ranking
MS COCO	101-point interpolation	At recall 0, 0.01, ..., 1.00
ImageNet	All-point interpolation	Similar to VOC 2010+

Benchmark Compatibility

When comparing to published results, ensure you use the SAME AP computation method as the benchmark. A model might show AP = 0.45 with one method and AP = 0.42 with another. Always check the evaluation protocol.

Relationship to AUC

Both AP and ROC-AUC summarize classifier performance as a scalar, but they measure different things and behave differently.

Fundamental Differences

AP vs AUC-ROC Comparison
Property	AP (PR-AUC)	AUC-ROC
Underlying curve	Precision vs Recall	TPR vs FPR
Sensitive to class imbalance?	Yes — decreases with more negatives	No — invariant
Random baseline	≈ base rate (varies)	0.5 (fixed)
Focus	Positive predictions quality	Ranking/discrimination
When scores matter most	Among positives and neighbors	Across entire score range
Affected by true negatives?	No (ignores TN)	Yes (via FPR)

When They Diverge

AP and AUC often tell different stories, especially with imbalanced data:

Scenario: Rare Event Detection (1% positive rate)

A classifier achieves:

AUC = 0.92 — excellent ranking overall
AP = 0.35 — many false positives among positive predictions

Interpretation:

AUC = 0.92 means: 92% of random pos-neg pairs are correctly ranked
AP = 0.35 means: When acting on positive predictions, expect many false alarms

The AUC looks at all pairs equally; AP focuses on what happens when you make positive predictions (which is often what matters operationally).

Mathematical Relationship

There's no exact formula relating AP to AUC, but bounds exist:

$$\text{AP} \geq \frac{\pi}{\pi + (1-\pi) \cdot \text{LR}_{\text{min}}}$$

where π = base rate and LR_min = minimum likelihood ratio.

In practice, for a given AUC:

AP ≈ AUC when classes are balanced
AP << AUC when positives are rare
AP > AUC is possible when positives are majority (rare scenario)

AP vs AUC Under Imbalance
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score
 
def demonstrate_ap_auc_divergence():
    """
    Show how AP and AUC diverge under class imbalance.
    """
    np.random.seed(42)
    
    # Fixed discrimination (same separation between class score distributions)
    pos_mean, neg_mean = 0.7, 0.3
    std = 0.2
    
    results = []
    
    for pos_rate in [0.50, 0.20, 0.10, 0.05, 0.01]:
        n_total = 10000
        n_pos = int(n_total * pos_rate)
        n_neg = n_total - n_pos
        
        pos_scores = np.clip(np.random.normal(pos_mean, std, n_pos), 0, 1)
        neg_scores = np.clip(np.random.normal(neg_mean, std, n_neg), 0, 1)
        
        scores = np.concatenate([pos_scores, neg_scores])
        labels = np.array([1]*n_pos + [0]*n_neg)
        
        auc = roc_auc_score(labels, scores)
        ap = average_precision_score(labels, scores)
        
        results.append({
            'pos_rate': pos_rate,
            'auc': auc,
            'ap': ap,
            'ratio': ap / auc
        })
    
    print("Positive Rate | AUC-ROC | Avg Prec | AP/AUC Ratio")
    print("-" * 55)
    for r in results:
        print(f"    {r['pos_rate']:.0%}      |  {r['auc']:.3f}  |  {r['ap']:.3f}   |   {r['ratio']:.2f}")
 
demonstrate_ap_auc_divergence()

When to Report Each

For imbalanced problems, always report BOTH metrics. AUC tells stakeholders about discrimination ability; AP tells them about operational precision when making positive predictions. Together, they provide a complete picture.

Mean Average Precision (mAP)

In multi-class or multi-label settings, Mean Average Precision (mAP) extends AP to aggregate performance across classes.

Definition

$$\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c$$

where C is the number of classes and AP_c is the Average Precision for class c.

Use Cases

Object Detection: Each object class (car, person, bicycle, ...) has its own AP. mAP summarizes detection performance across all classes.
Multi-Label Classification: Each label is treated as a separate binary classification. mAP aggregates across labels.
Information Retrieval: Each query has its own AP (for retrieved documents). mAP aggregates across queries.

Computation

Mean Average Precision
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from sklearn.metrics import average_precision_score
 
def mean_average_precision(y_true, y_scores, class_axis=1):
    """
    Compute Mean Average Precision for multi-label classification.
    
    Parameters:
    -----------
    y_true : array of shape (n_samples, n_classes)
        Binary ground truth labels
    y_scores : array of shape (n_samples, n_classes)
        Predicted scores for each class
    
    Returns:
    --------
    mAP : float
        Mean Average Precision
    per_class_ap : array
        AP for each class
    """
    y_true = np.array(y_true)
    y_scores = np.array(y_scores)
    
    n_classes = y_true.shape[1]
    per_class_ap = []
    
    for c in range(n_classes):
        if np.sum(y_true[:, c]) > 0:  # Skip classes with no positives
            ap_c = average_precision_score(y_true[:, c], y_scores[:, c])
            per_class_ap.append(ap_c)
    
    mAP = np.mean(per_class_ap)
    
    return mAP, np.array(per_class_ap)
 
# Example: Multi-label classification with 5 classes
np.random.seed(42)
n_samples = 500
n_classes = 5
 
# Simulate predictions (some classes harder than others)
y_true = np.random.binomial(1, 0.1, (n_samples, n_classes))  # Sparse labels
base_scores = np.random.rand(n_samples, n_classes)
# Make scores correlate with true labels
y_scores = base_scores + y_true * np.random.uniform(0.2, 0.5, n_classes)
y_scores = np.clip(y_scores, 0, 1)
 
mAP, per_class_ap = mean_average_precision(y_true, y_scores)
 
print(f"Mean Average Precision: {mAP:.4f}")
print("\nPer-class AP:")
for i, ap in enumerate(per_class_ap):
    print(f"  Class {i}: {ap:.4f}")

mAP in Object Detection

In object detection (COCO, VOC), mAP also depends on IoU (Intersection over Union) thresholds. COCO reports mAP@[.5:.95] — the mAP averaged over IoU thresholds from 0.5 to 0.95 in steps of 0.05. This provides a stricter evaluation than mAP@0.5 alone.

AP@k and Related Metrics

Sometimes we only care about performance in the top-k predictions. Ranking metrics derived from AP address this.

Precision@k

Simply the precision in the top k predictions:

$$\text{Precision@}k = \frac{|{\text{positives in top } k}|}{k}$$

Use case: Search results — users typically only view the first page (k ≈ 10).

Recall@k

Fraction of all positives found in top k:

$$\text{Recall@}k = \frac{|{\text{positives in top } k}|}{P}$$

Use case: Ensuring coverage — finding most relevant items in top results.

Average Precision@k (AP@k)

AP computed only over the first k predictions:

$$\text{AP@}k = \frac{1}{\min(P, k)} \sum_{i=1}^{k} \text{Precision}(i) \cdot \mathbb{1}[y_i = 1]$$

Note: The denominator is min(P, k) because we can't expect more positives than exist.

Precision, Recall, and AP at k
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
 
def precision_at_k(scores, labels, k):
    """Precision in top k predictions."""
    sorted_indices = np.argsort(-scores)
    top_k_labels = labels[sorted_indices[:k]]
    return np.sum(top_k_labels) / k
 
def recall_at_k(scores, labels, k):
    """Fraction of positives found in top k."""
    P = np.sum(labels)
    if P == 0:
        return 0.0
    sorted_indices = np.argsort(-scores)
    top_k_labels = labels[sorted_indices[:k]]
    return np.sum(top_k_labels) / P
 
def average_precision_at_k(scores, labels, k):
    """Average Precision restricted to top k."""
    sorted_indices = np.argsort(-scores)
    sorted_labels = labels[sorted_indices]
    
    P = np.sum(labels)
    if P == 0:
        return 0.0
    
    tp = 0
    precision_sum = 0.0
    
    for i in range(min(k, len(sorted_labels))):
        if sorted_labels[i] == 1:
            tp += 1
            precision_sum += tp / (i + 1)
    
    return precision_sum / min(P, k)
 
# Example: Recommendation system
np.random.seed(42)
n_items = 100
n_relevant = 10  # 10 items are relevant
 
scores = np.random.rand(n_items)
labels = np.zeros(n_items)
relevant_indices = np.random.choice(n_items, n_relevant, replace=False)
labels[relevant_indices] = 1
 
# Bias scores toward relevant items (imperfect model)
scores[relevant_indices] += np.random.uniform(0.2, 0.6, n_relevant)
scores = np.clip(scores, 0, 1)
 
print("Ranking Metrics at Different k Values:")
print("-" * 50)
print(f"{'k':<5} | {'P@k':<8} | {'R@k':<8} | {'AP@k':<8}")
print("-" * 50)
 
for k in [5, 10, 20, 50]:
    p_at_k = precision_at_k(scores, labels, k)
    r_at_k = recall_at_k(scores, labels, k)
    ap_at_k = average_precision_at_k(scores, labels, k)
    print(f"{k:<5} | {p_at_k:<8.3f} | {r_at_k:<8.3f} | {ap_at_k:<8.3f}")
 
print(f"\nFull AP: {average_precision_at_k(scores, labels, n_items):.3f}")

Choosing k

Choose k based on your application: How many results/predictions will users actually see or act on? For search: k ≈ 10-20. For recommendation: k might be 5-10. For automated pipelines: k might equal your processing capacity.

Statistical Properties of AP

Like any sample statistic, AP has variance and sampling distribution properties that affect interpretation.

Variance and Confidence Intervals

AP variance increases with:

Fewer positive examples (minority class)
More similar scores (harder ranking)
Class imbalance (extreme base rates)

Bootstrap resampling provides robust confidence intervals:

AP Confidence Intervals
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.metrics import average_precision_score
from sklearn.utils import resample
 
def ap_bootstrap_ci(scores, labels, n_bootstrap=1000, 
                    confidence_level=0.95, stratified=True):
    """
    Compute AP confidence interval via bootstrap.
    
    Parameters:
    -----------
    scores : array-like
        Classifier scores
    labels : array-like  
        True binary labels
    n_bootstrap : int
        Number of bootstrap samples
    confidence_level : float
        Confidence level (e.g., 0.95 for 95% CI)
    stratified : bool
        Whether to maintain class balance in resampling
    
    Returns:
    --------
    dict with ap_point, ci_lower, ci_upper, se
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    ap_point = average_precision_score(labels, scores)
    
    bootstrap_aps = []
    for _ in range(n_bootstrap):
        if stratified:
            idx = resample(range(len(labels)), stratify=labels, 
                          replace=True)
        else:
            idx = resample(range(len(labels)), replace=True)
        
        ap_boot = average_precision_score(labels[idx], scores[idx])
        bootstrap_aps.append(ap_boot)
    
    bootstrap_aps = np.array(bootstrap_aps)
    
    alpha = 1 - confidence_level
    ci_lower = np.percentile(bootstrap_aps, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_aps, 100 * (1 - alpha / 2))
    se = np.std(bootstrap_aps)
    
    return {
        'ap_point': ap_point,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'se': se,
        'bootstrap_aps': bootstrap_aps
    }
 
# Example with different sample sizes
np.random.seed(42)
 
print("AP Confidence Intervals vs Sample Size:")
print("-" * 60)
print(f"{'N':<8} | {'N_pos':<8} | {'AP':<8} | {'95% CI':<18} | {'SE':<8}")
print("-" * 60)
 
for n in [100, 500, 2000]:
    n_pos = int(n * 0.1)  # 10% positive rate
    
    pos_scores = np.random.beta(5, 2, n_pos)
    neg_scores = np.random.beta(2, 5, n - n_pos)
    scores = np.concatenate([pos_scores, neg_scores])
    labels = np.array([1]*n_pos + [0]*(n-n_pos))
    
    result = ap_bootstrap_ci(scores, labels, n_bootstrap=500)
    
    ci_str = f"[{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]"
    print(f"{n:<8} | {n_pos:<8} | {result['ap_point']:.3f}   | {ci_str:<18} | {result['se']:.4f}")

Comparing Two Classifiers

To test if one classifier has significantly higher AP than another, use paired bootstrap testing:

AP Comparison Test
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def ap_paired_bootstrap_test(scores1, scores2, labels, 
                              n_bootstrap=1000):
    """
    Test if classifier 1 has significantly higher AP than classifier 2.
    Uses paired bootstrap on the same data.
    
    Returns:
    --------
    dict with observed difference, p-value, and CI for difference
    """
    ap1 = average_precision_score(labels, scores1)
    ap2 = average_precision_score(labels, scores2)
    observed_diff = ap1 - ap2
    
    bootstrap_diffs = []
    
    for _ in range(n_bootstrap):
        idx = resample(range(len(labels)), stratify=labels)
        
        ap1_boot = average_precision_score(labels[idx], scores1[idx])
        ap2_boot = average_precision_score(labels[idx], scores2[idx])
        
        bootstrap_diffs.append(ap1_boot - ap2_boot)
    
    bootstrap_diffs = np.array(bootstrap_diffs)
    
    # Two-sided p-value: proportion of bootstrap diffs on wrong side of 0
    if observed_diff >= 0:
        p_value = 2 * np.mean(bootstrap_diffs <= 0)
    else:
        p_value = 2 * np.mean(bootstrap_diffs >= 0)
    
    p_value = min(p_value, 1.0)
    
    return {
        'ap1': ap1,
        'ap2': ap2,
        'diff': observed_diff,
        'diff_ci_lower': np.percentile(bootstrap_diffs, 2.5),
        'diff_ci_upper': np.percentile(bootstrap_diffs, 97.5),
        'p_value': p_value,
        'significant': p_value < 0.05
    }

Sample Size Matters

With small test sets (especially few positives), AP confidence intervals are wide. An observed AP difference of 0.05 might not be significant with 50 positives, but highly significant with 500. Always report uncertainty.

Practical Interpretation

Understanding what AP values mean in practice helps communicate results to stakeholders.

Context-Dependent Interpretation

Unlike AUC where 0.9 is "excellent" universally, AP values depend heavily on base rate:

Base Rate	"Random" AP	"Good" AP	"Excellent" AP
50%	~0.50	0.75+	0.90+
10%	~0.10	0.40+	0.70+
1%	~0.01	0.20+	0.50+
0.1%	~0.001	0.05+	0.20+

Key insight: You must compare AP to the base rate. AP = 0.30 with 1% positives is excellent; with 30% positives, it's close to random.

The 'Lift' Interpretation

AP can be interpreted relative to random via "lift":

$$\text{Lift} = \frac{\text{AP}}{\text{Base Rate}}$$

Lift = 1: No better than random
Lift = 5: AP is 5× the base rate
Lift = 10: AP is 10× the base rate

Example: With 2% base rate and AP = 0.40:

Lift = 0.40 / 0.02 = 20×
Interpretation: The classifier's average precision is 20× better than random

Communicating AP to Stakeholders

•Always provide base rate context: 'With only 2% of transactions being fraud, achieving AP = 0.35 represents a 17× improvement over random.'
•Use operational interpretation: 'On average, when the model flags a fraud, it's correct about 35% of the time—dramatically better than the 2% we'd get by random guessing.'
•Show the PR curve: A single number can't capture the precision-recall tradeoff. Show the curve to discuss operating point options.
•Compare to baselines: 'The new model (AP = 0.48) outperforms the previous model (AP = 0.32) by 50% relative.'
•Include confidence intervals: 'AP = 0.45 with 95% CI [0.41, 0.49]' is more informative than just 'AP = 0.45'.

The Operational Translation

AP ≈ the average precision you'd observe if you processed positive predictions in score order. So AP = 0.40 roughly means: 'If I work through predictions from highest to lowest score, I'll be right about 40% of the time on average.' This operational framing often resonates with business stakeholders.

Summary: Average Precision

We've developed comprehensive understanding of Average Precision—from intuition to computation to practical interpretation. Let's consolidate the key insights:

Key Takeaways

•AP is the mean precision at each positive — it rewards classifiers that rank positives early in their predictions.
•AP equals the area under the PR curve — providing a single-number summary of the precision-recall tradeoff.
•Multiple computation methods exist — all-points (sklearn), 11-point interpolation (legacy), and all-point interpolation (COCO/VOC). Match the method to your benchmark.
•AP depends on class balance — unlike AUC, AP's 'random baseline' equals the base rate. Always compare to base rate for interpretation.
•mAP extends to multi-class — averaging per-class AP provides aggregate performance across object categories or labels.
•Report uncertainty — bootstrap confidence intervals should accompany AP values, especially with small test sets.

What's next:

The next page covers Curve Comparison—systematic methods for comparing ROC and PR curves between classifiers, including visual comparison techniques, statistical tests, and decision frameworks for model selection when curves cross.

Page Complete

You now understand Average Precision thoroughly: its meaning, computation methods, relationship to AUC, extensions like mAP and AP@k, and practical interpretation. You can confidently use AP for model evaluation on imbalanced data.

4 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

4 / 5

Average Precision

Summarizing the PR Curve

Just as AUC summarizes the ROC curve, Average Precision (AP) summarizes the Precision-Recall curve as a single scalar value. This summary is essential for:

Ranking multiple models by performance
Optimization targets in hyperparameter tuning
Clear reporting to stakeholders
Statistical comparison between classifiers

But AP is more nuanced than AUC. There are multiple valid definitions, each with different properties. Understanding these subtleties is crucial for correct application and interpretation.

What You Will Learn

Intuitive Understanding

Before diving into formulas, let's build intuition for what Average Precision measures.

The Ranking Perspective

Imagine you rank all examples by their classifier score, highest first. Now process them one by one:

When you encounter a positive example: Record the current precision (TP so far / total predictions so far)
When you encounter a negative example: Don't record anything

Average Precision is the mean of these recorded precisions.

This means AP asks: "When I find a positive, how good is my precision at that moment?"

Why This Makes Sense

Consider two classifiers:

Classifier A ranks positives early:

Rank:     1  2  3  4  5  6  7  8  9  10
Label:    +  +  +  -  -  +  -  -  -  -   (4 positives)

When encountering positives:

Rank 1: Precision = 1/1 = 1.00
Rank 2: Precision = 2/2 = 1.00
Rank 3: Precision = 3/3 = 1.00
Rank 6: Precision = 4/6 = 0.67

AP = (1.00 + 1.00 + 1.00 + 0.67) / 4 = 0.917

Classifier B ranks positives late:

Rank:     1  2  3  4  5  6  7  8  9  10
Label:    -  -  -  +  -  -  +  +  -  +   (4 positives)

When encountering positives:

Rank 4: Precision = 1/4 = 0.25
Rank 7: Precision = 2/7 = 0.29
Rank 8: Precision = 3/8 = 0.38
Rank 10: Precision = 4/10 = 0.40

AP = (0.25 + 0.29 + 0.38 + 0.40) / 4 = 0.330

The Core Insight

Mathematical Definition

Average Precision has several equivalent and approximately equivalent formulations. Understanding these helps when implementing or interpreting different libraries' results.

Definition 1: Sum Over Positives

The most intuitive definition:

$$\text{AP} = \frac{1}{P} \sum_{k=1}^{n} \text{Precision}(k) \cdot \mathbb{1}[y_k = 1]$$

where:

n = total examples
P = number of positives
Precision(k) = precision at position k (after including k examples)
y_k = label of k-th ranked example
𝟙[y_k = 1] = 1 if k-th example is positive, 0 otherwise

In words: Sum the precision values at positions where positives occur, divide by total positives.

Definition 2: Integral Under PR Curve

Equivalently, AP is the area under the PR curve:

$$\text{AP} = \int_0^1 P(R) , dR$$

where P(R) is precision as a function of recall.

Definition 3: Weighted Recall Increments

Another equivalent form:

$$\text{AP} = \sum_{k=1}^{n} \text{Precision}(k) \cdot \Delta\text{Recall}(k)$$

where ΔRecall(k) = Recall(k) - Recall(k-1) = 1/P if y_k = 1, else 0.

Interpretation: Each positive contributes its precision, weighted by the recall increment (1/P for each positive).

Average Precision Computation
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
 
def average_precision_manual(scores, labels):
    """
    Compute Average Precision from scratch.
    
    Definition: Mean precision at positions where positives occur.
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Number of positives
    P = np.sum(labels == 1)
    
    if P == 0:
        return 0.0
    
    # Sort by descending score
    sorted_indices = np.argsort(-scores)
    sorted_labels = labels[sorted_indices]
    
    # Compute AP
    tp = 0
    precision_sum = 0.0
    
    for i, label in enumerate(sorted_labels):
        if label == 1:
            tp += 1
            precision_at_k = tp / (i + 1)
            precision_sum += precision_at_k
    
    return precision_sum / P
 
def average_precision_integral(precision, recall):
    """
    Compute AP as area under the PR curve using trapezoidal rule.
    
    Note: This matches sklearn's implementation when precision/recall
    are computed properly.
    """
    # Sort by recall (should already be sorted)
    sorted_indices = np.argsort(recall)
    recall = np.array(recall)[sorted_indices]
    precision = np.array(precision)[sorted_indices]
    
    # Compute area using trapezoidal rule
    # For step function interpretation, can also use:
    # AP = sum(precision * diff(recall)) where diff prepends 0
    
    recall_diff = np.diff(np.concatenate([[0], recall]))
    ap = np.sum(precision * recall_diff)
    
    return ap
 
# Example
scores = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4]
labels = [1, 0, 1, 0, 1, 0]  # 3 positives
 
ap = average_precision_manual(scores, labels)
print(f"Average Precision: {ap:.4f}")
 
# Trace through:
# Position 1 (score=0.9, label=1): Prec = 1/1 = 1.00 ✓
# Position 2 (score=0.8, label=0): Skip
# Position 3 (score=0.7, label=1): Prec = 2/3 = 0.67 ✓
# Position 4 (score=0.6, label=0): Skip
# Position 5 (score=0.5, label=1): Prec = 3/5 = 0.60 ✓
# Position 6 (score=0.4, label=0): Skip
 
# AP = (1.00 + 0.67 + 0.60) / 3 = 0.756

Equivalence of Definitions

All three definitions produce the same result when applied to the raw (non-interpolated) PR curve. Differences arise only when interpolation is used, as we'll see in the variants section.

Interpolation Variants

Different communities have used different interpolation methods, leading to slightly different AP values. Understanding these variants is crucial for reproducibility.

No Interpolation (All Points)

The most straightforward approach uses all observed precision-recall points:

$$\text{AP}{\text{all}} = \frac{1}{P} \sum{k: y_k = 1} \text{Precision}(k)$$

This is what scikit-learn's average_precision_score computes.

11-Point Interpolation (Legacy TREC/Pascal VOC pre-2010)

Historically, information retrieval used interpolation at 11 recall levels:

$$\text{AP}{11} = \frac{1}{11} \sum{r \in {0, 0.1, ..., 1.0}} P_{\text{interp}}(r)$$

where $P_{\text{interp}}(r) = \max_{r' \geq r} P(r')$ — the maximum precision at any recall ≥ r.

All-Point Interpolation (Pascal VOC 2010+, COCO)

Modern object detection uses interpolation at all unique recall levels:

$$\text{AP}{\text{interp}} = \sum{i} (R_i - R_{i-1}) \cdot P_{\text{interp}}(R_i)$$

where $P_{\text{interp}}(R_i) = \max_{R \geq R_i} P(R)$

AP Interpolation Variants
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import numpy as np
from sklearn.metrics import precision_recall_curve, average_precision_score
 
def ap_all_points(scores, labels):
    """
    AP without interpolation (sklearn's method).
    """
    return average_precision_score(labels, scores)
 
def ap_11_point(scores, labels):
    """
    11-point interpolated AP (legacy TREC/VOC pre-2010).
    """
    precision, recall, _ = precision_recall_curve(labels, scores)
    
    ap = 0.0
    for r in np.linspace(0, 1, 11):
        # Max precision at recall >= r
        prec_at_r = precision[recall >= r]
        if len(prec_at_r) > 0:
            ap += np.max(prec_at_r)
    
    return ap / 11
 
def ap_all_point_interp(scores, labels):
    """
    All-point interpolated AP (VOC 2010+, COCO).
    
    At each recall level, use max precision at any recall >= that level.
    Then integrate.
    """
    precision, recall, _ = precision_recall_curve(labels, scores)
    
    # Sort by recall
    order = np.argsort(recall)
    recall = recall[order]
    precision = precision[order]
    
    # Compute interpolated precision (monotonically decreasing)
    # p_interp[i] = max(p[i:])
    precision_interp = np.maximum.accumulate(precision[::-1])[::-1]
    
    # Compute area under interpolated curve
    recall_diff = np.diff(np.concatenate([[0], recall]))
    ap = np.sum(precision_interp * recall_diff)
    
    return ap
 
# Compare methods
np.random.seed(42)
n = 200
pos_scores = np.random.beta(5, 2, 20)  # 20 positives
neg_scores = np.random.beta(2, 5, 180)  # 180 negatives
scores = np.concatenate([pos_scores, neg_scores])
labels = np.array([1]*20 + [0]*180)
 
print("AP Computation Methods Comparison:")
print(f"  All-points (sklearn):    {ap_all_points(scores, labels):.4f}")
print(f"  11-point interpolation:  {ap_11_point(scores, labels):.4f}")
print(f"  All-point interpolation: {ap_all_point_interp(scores, labels):.4f}")

AP Variants Across Benchmarks
Benchmark/Library	Method	Notes
scikit-learn	All-points, no interpolation	Trapezoidal integration on raw points
TREC (early)	11-point interpolation	Standard recall levels
Pascal VOC 2007	11-point interpolation	Object detection
Pascal VOC 2010+	All-point interpolation	More accurate for ranking
MS COCO	101-point interpolation	At recall 0, 0.01, ..., 1.00
ImageNet	All-point interpolation	Similar to VOC 2010+

Benchmark Compatibility

Relationship to AUC

Both AP and ROC-AUC summarize classifier performance as a scalar, but they measure different things and behave differently.

Fundamental Differences

AP vs AUC-ROC Comparison
Property	AP (PR-AUC)	AUC-ROC
Underlying curve	Precision vs Recall	TPR vs FPR
Sensitive to class imbalance?	Yes — decreases with more negatives	No — invariant
Random baseline	≈ base rate (varies)	0.5 (fixed)
Focus	Positive predictions quality	Ranking/discrimination
When scores matter most	Among positives and neighbors	Across entire score range
Affected by true negatives?	No (ignores TN)	Yes (via FPR)

When They Diverge

AP and AUC often tell different stories, especially with imbalanced data:

Scenario: Rare Event Detection (1% positive rate)

A classifier achieves:

AUC = 0.92 — excellent ranking overall
AP = 0.35 — many false positives among positive predictions

Interpretation:

AUC = 0.92 means: 92% of random pos-neg pairs are correctly ranked
AP = 0.35 means: When acting on positive predictions, expect many false alarms

The AUC looks at all pairs equally; AP focuses on what happens when you make positive predictions (which is often what matters operationally).

Mathematical Relationship

There's no exact formula relating AP to AUC, but bounds exist:

$$\text{AP} \geq \frac{\pi}{\pi + (1-\pi) \cdot \text{LR}_{\text{min}}}$$

where π = base rate and LR_min = minimum likelihood ratio.

In practice, for a given AUC:

AP ≈ AUC when classes are balanced
AP << AUC when positives are rare
AP > AUC is possible when positives are majority (rare scenario)

AP vs AUC Under Imbalance
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from sklearn.metrics import roc_auc_score, average_precision_score
 
def demonstrate_ap_auc_divergence():
    """
    Show how AP and AUC diverge under class imbalance.
    """
    np.random.seed(42)
    
    # Fixed discrimination (same separation between class score distributions)
    pos_mean, neg_mean = 0.7, 0.3
    std = 0.2
    
    results = []
    
    for pos_rate in [0.50, 0.20, 0.10, 0.05, 0.01]:
        n_total = 10000
        n_pos = int(n_total * pos_rate)
        n_neg = n_total - n_pos
        
        pos_scores = np.clip(np.random.normal(pos_mean, std, n_pos), 0, 1)
        neg_scores = np.clip(np.random.normal(neg_mean, std, n_neg), 0, 1)
        
        scores = np.concatenate([pos_scores, neg_scores])
        labels = np.array([1]*n_pos + [0]*n_neg)
        
        auc = roc_auc_score(labels, scores)
        ap = average_precision_score(labels, scores)
        
        results.append({
            'pos_rate': pos_rate,
            'auc': auc,
            'ap': ap,
            'ratio': ap / auc
        })
    
    print("Positive Rate | AUC-ROC | Avg Prec | AP/AUC Ratio")
    print("-" * 55)
    for r in results:
        print(f"    {r['pos_rate']:.0%}      |  {r['auc']:.3f}  |  {r['ap']:.3f}   |   {r['ratio']:.2f}")
 
demonstrate_ap_auc_divergence()

When to Report Each

Mean Average Precision (mAP)

In multi-class or multi-label settings, Mean Average Precision (mAP) extends AP to aggregate performance across classes.

Definition

$$\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c$$

where C is the number of classes and AP_c is the Average Precision for class c.

Use Cases

Object Detection: Each object class (car, person, bicycle, ...) has its own AP. mAP summarizes detection performance across all classes.
Multi-Label Classification: Each label is treated as a separate binary classification. mAP aggregates across labels.
Information Retrieval: Each query has its own AP (for retrieved documents). mAP aggregates across queries.

Computation

Mean Average Precision
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import numpy as np
from sklearn.metrics import average_precision_score
 
def mean_average_precision(y_true, y_scores, class_axis=1):
    """
    Compute Mean Average Precision for multi-label classification.
    
    Parameters:
    -----------
    y_true : array of shape (n_samples, n_classes)
        Binary ground truth labels
    y_scores : array of shape (n_samples, n_classes)
        Predicted scores for each class
    
    Returns:
    --------
    mAP : float
        Mean Average Precision
    per_class_ap : array
        AP for each class
    """
    y_true = np.array(y_true)
    y_scores = np.array(y_scores)
    
    n_classes = y_true.shape[1]
    per_class_ap = []
    
    for c in range(n_classes):
        if np.sum(y_true[:, c]) > 0:  # Skip classes with no positives
            ap_c = average_precision_score(y_true[:, c], y_scores[:, c])
            per_class_ap.append(ap_c)
    
    mAP = np.mean(per_class_ap)
    
    return mAP, np.array(per_class_ap)
 
# Example: Multi-label classification with 5 classes
np.random.seed(42)
n_samples = 500
n_classes = 5
 
# Simulate predictions (some classes harder than others)
y_true = np.random.binomial(1, 0.1, (n_samples, n_classes))  # Sparse labels
base_scores = np.random.rand(n_samples, n_classes)
# Make scores correlate with true labels
y_scores = base_scores + y_true * np.random.uniform(0.2, 0.5, n_classes)
y_scores = np.clip(y_scores, 0, 1)
 
mAP, per_class_ap = mean_average_precision(y_true, y_scores)
 
print(f"Mean Average Precision: {mAP:.4f}")
print("\nPer-class AP:")
for i, ap in enumerate(per_class_ap):
    print(f"  Class {i}: {ap:.4f}")

mAP in Object Detection

AP@k and Related Metrics

Sometimes we only care about performance in the top-k predictions. Ranking metrics derived from AP address this.

Precision@k

Simply the precision in the top k predictions:

$$\text{Precision@}k = \frac{|{\text{positives in top } k}|}{k}$$

Use case: Search results — users typically only view the first page (k ≈ 10).

Recall@k

Fraction of all positives found in top k:

$$\text{Recall@}k = \frac{|{\text{positives in top } k}|}{P}$$

Use case: Ensuring coverage — finding most relevant items in top results.

Average Precision@k (AP@k)

AP computed only over the first k predictions:

$$\text{AP@}k = \frac{1}{\min(P, k)} \sum_{i=1}^{k} \text{Precision}(i) \cdot \mathbb{1}[y_i = 1]$$

Note: The denominator is min(P, k) because we can't expect more positives than exist.

Precision, Recall, and AP at k
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
 
def precision_at_k(scores, labels, k):
    """Precision in top k predictions."""
    sorted_indices = np.argsort(-scores)
    top_k_labels = labels[sorted_indices[:k]]
    return np.sum(top_k_labels) / k
 
def recall_at_k(scores, labels, k):
    """Fraction of positives found in top k."""
    P = np.sum(labels)
    if P == 0:
        return 0.0
    sorted_indices = np.argsort(-scores)
    top_k_labels = labels[sorted_indices[:k]]
    return np.sum(top_k_labels) / P
 
def average_precision_at_k(scores, labels, k):
    """Average Precision restricted to top k."""
    sorted_indices = np.argsort(-scores)
    sorted_labels = labels[sorted_indices]
    
    P = np.sum(labels)
    if P == 0:
        return 0.0
    
    tp = 0
    precision_sum = 0.0
    
    for i in range(min(k, len(sorted_labels))):
        if sorted_labels[i] == 1:
            tp += 1
            precision_sum += tp / (i + 1)
    
    return precision_sum / min(P, k)
 
# Example: Recommendation system
np.random.seed(42)
n_items = 100
n_relevant = 10  # 10 items are relevant
 
scores = np.random.rand(n_items)
labels = np.zeros(n_items)
relevant_indices = np.random.choice(n_items, n_relevant, replace=False)
labels[relevant_indices] = 1
 
# Bias scores toward relevant items (imperfect model)
scores[relevant_indices] += np.random.uniform(0.2, 0.6, n_relevant)
scores = np.clip(scores, 0, 1)
 
print("Ranking Metrics at Different k Values:")
print("-" * 50)
print(f"{'k':<5} | {'P@k':<8} | {'R@k':<8} | {'AP@k':<8}")
print("-" * 50)
 
for k in [5, 10, 20, 50]:
    p_at_k = precision_at_k(scores, labels, k)
    r_at_k = recall_at_k(scores, labels, k)
    ap_at_k = average_precision_at_k(scores, labels, k)
    print(f"{k:<5} | {p_at_k:<8.3f} | {r_at_k:<8.3f} | {ap_at_k:<8.3f}")
 
print(f"\nFull AP: {average_precision_at_k(scores, labels, n_items):.3f}")

Choosing k

Statistical Properties of AP

Like any sample statistic, AP has variance and sampling distribution properties that affect interpretation.

Variance and Confidence Intervals

AP variance increases with:

Fewer positive examples (minority class)
More similar scores (harder ranking)
Class imbalance (extreme base rates)

Bootstrap resampling provides robust confidence intervals:

AP Confidence Intervals
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import numpy as np
from sklearn.metrics import average_precision_score
from sklearn.utils import resample
 
def ap_bootstrap_ci(scores, labels, n_bootstrap=1000, 
                    confidence_level=0.95, stratified=True):
    """
    Compute AP confidence interval via bootstrap.
    
    Parameters:
    -----------
    scores : array-like
        Classifier scores
    labels : array-like  
        True binary labels
    n_bootstrap : int
        Number of bootstrap samples
    confidence_level : float
        Confidence level (e.g., 0.95 for 95% CI)
    stratified : bool
        Whether to maintain class balance in resampling
    
    Returns:
    --------
    dict with ap_point, ci_lower, ci_upper, se
    """
    scores = np.array(scores)
    labels = np.array(labels)
    
    ap_point = average_precision_score(labels, scores)
    
    bootstrap_aps = []
    for _ in range(n_bootstrap):
        if stratified:
            idx = resample(range(len(labels)), stratify=labels, 
                          replace=True)
        else:
            idx = resample(range(len(labels)), replace=True)
        
        ap_boot = average_precision_score(labels[idx], scores[idx])
        bootstrap_aps.append(ap_boot)
    
    bootstrap_aps = np.array(bootstrap_aps)
    
    alpha = 1 - confidence_level
    ci_lower = np.percentile(bootstrap_aps, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_aps, 100 * (1 - alpha / 2))
    se = np.std(bootstrap_aps)
    
    return {
        'ap_point': ap_point,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'se': se,
        'bootstrap_aps': bootstrap_aps
    }
 
# Example with different sample sizes
np.random.seed(42)
 
print("AP Confidence Intervals vs Sample Size:")
print("-" * 60)
print(f"{'N':<8} | {'N_pos':<8} | {'AP':<8} | {'95% CI':<18} | {'SE':<8}")
print("-" * 60)
 
for n in [100, 500, 2000]:
    n_pos = int(n * 0.1)  # 10% positive rate
    
    pos_scores = np.random.beta(5, 2, n_pos)
    neg_scores = np.random.beta(2, 5, n - n_pos)
    scores = np.concatenate([pos_scores, neg_scores])
    labels = np.array([1]*n_pos + [0]*(n-n_pos))
    
    result = ap_bootstrap_ci(scores, labels, n_bootstrap=500)
    
    ci_str = f"[{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]"
    print(f"{n:<8} | {n_pos:<8} | {result['ap_point']:.3f}   | {ci_str:<18} | {result['se']:.4f}")

Comparing Two Classifiers

To test if one classifier has significantly higher AP than another, use paired bootstrap testing:

AP Comparison Test
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def ap_paired_bootstrap_test(scores1, scores2, labels, 
                              n_bootstrap=1000):
    """
    Test if classifier 1 has significantly higher AP than classifier 2.
    Uses paired bootstrap on the same data.
    
    Returns:
    --------
    dict with observed difference, p-value, and CI for difference
    """
    ap1 = average_precision_score(labels, scores1)
    ap2 = average_precision_score(labels, scores2)
    observed_diff = ap1 - ap2
    
    bootstrap_diffs = []
    
    for _ in range(n_bootstrap):
        idx = resample(range(len(labels)), stratify=labels)
        
        ap1_boot = average_precision_score(labels[idx], scores1[idx])
        ap2_boot = average_precision_score(labels[idx], scores2[idx])
        
        bootstrap_diffs.append(ap1_boot - ap2_boot)
    
    bootstrap_diffs = np.array(bootstrap_diffs)
    
    # Two-sided p-value: proportion of bootstrap diffs on wrong side of 0
    if observed_diff >= 0:
        p_value = 2 * np.mean(bootstrap_diffs <= 0)
    else:
        p_value = 2 * np.mean(bootstrap_diffs >= 0)
    
    p_value = min(p_value, 1.0)
    
    return {
        'ap1': ap1,
        'ap2': ap2,
        'diff': observed_diff,
        'diff_ci_lower': np.percentile(bootstrap_diffs, 2.5),
        'diff_ci_upper': np.percentile(bootstrap_diffs, 97.5),
        'p_value': p_value,
        'significant': p_value < 0.05
    }

Sample Size Matters

Practical Interpretation

Understanding what AP values mean in practice helps communicate results to stakeholders.

Context-Dependent Interpretation

Unlike AUC where 0.9 is "excellent" universally, AP values depend heavily on base rate:

Base Rate	"Random" AP	"Good" AP	"Excellent" AP
50%	~0.50	0.75+	0.90+
10%	~0.10	0.40+	0.70+
1%	~0.01	0.20+	0.50+
0.1%	~0.001	0.05+	0.20+

Key insight: You must compare AP to the base rate. AP = 0.30 with 1% positives is excellent; with 30% positives, it's close to random.

The 'Lift' Interpretation

AP can be interpreted relative to random via "lift":

$$\text{Lift} = \frac{\text{AP}}{\text{Base Rate}}$$

Lift = 1: No better than random
Lift = 5: AP is 5× the base rate
Lift = 10: AP is 10× the base rate

Example: With 2% base rate and AP = 0.40:

Lift = 0.40 / 0.02 = 20×
Interpretation: The classifier's average precision is 20× better than random

Communicating AP to Stakeholders

•Always provide base rate context: 'With only 2% of transactions being fraud, achieving AP = 0.35 represents a 17× improvement over random.'
•Use operational interpretation: 'On average, when the model flags a fraud, it's correct about 35% of the time—dramatically better than the 2% we'd get by random guessing.'
•Show the PR curve: A single number can't capture the precision-recall tradeoff. Show the curve to discuss operating point options.
•Compare to baselines: 'The new model (AP = 0.48) outperforms the previous model (AP = 0.32) by 50% relative.'
•Include confidence intervals: 'AP = 0.45 with 95% CI [0.41, 0.49]' is more informative than just 'AP = 0.45'.

The Operational Translation

Summary: Average Precision

We've developed comprehensive understanding of Average Precision—from intuition to computation to practical interpretation. Let's consolidate the key insights:

Key Takeaways

•AP is the mean precision at each positive — it rewards classifiers that rank positives early in their predictions.
•AP equals the area under the PR curve — providing a single-number summary of the precision-recall tradeoff.
•Multiple computation methods exist — all-points (sklearn), 11-point interpolation (legacy), and all-point interpolation (COCO/VOC). Match the method to your benchmark.
•AP depends on class balance — unlike AUC, AP's 'random baseline' equals the base rate. Always compare to base rate for interpretation.
•mAP extends to multi-class — averaging per-class AP provides aggregate performance across object categories or labels.
•Report uncertainty — bootstrap confidence intervals should accompany AP values, especially with small test sets.

What's next:

Page Complete

4 / 5