Ranking Metrics - Learning Module

Loading content...

0/245

Mean Average Precision

The Integration of Precision Across Rankings

Precision@k suffers from a fundamental limitation: it requires choosing a specific cutoff $k$, and different choices can lead to contradictory conclusions about which ranking system is superior. A system might excel at P@5 but falter at P@10, leaving practitioners uncertain about true quality.

Mean Average Precision (MAP) resolves this dilemma through an elegant mathematical construction. Rather than evaluating precision at a single arbitrary cutoff, MAP computes precision at every position where a relevant item appears and averages these values. This produces a single summary statistic that simultaneously captures:

How many relevant items are retrieved (recall)
How early they appear in the ranking (precision distribution)
The overall shape of the precision-recall curve (area under the curve)

MAP has become the de facto standard metric in information retrieval research, serving as the primary evaluation measure in the Text REtrieval Conference (TREC) for decades. Understanding MAP deeply is essential for anyone working in search, recommendation, or document retrieval.

What You Will Learn

By the end of this page, you will understand MAP from first principles—its definition as an average of Average Precision values, its relationship to the precision-recall curve, its statistical properties, and its practical implementation. You will be able to compute, interpret, and critically evaluate MAP in both research and production contexts.

Average Precision (AP) for a Single Query

Before defining MAP, we must first understand Average Precision (AP)—the metric computed for a single query. AP is the heart of MAP; MAP is simply the mean of AP values across queries.

The Core Idea:

Instead of computing precision at one cutoff, AP computes precision at every position where a relevant item is found, then averages these values:

$$\text{AP} = \frac{1}{|R|} \sum_{k=1}^{n} \text{P@}k \cdot \text{rel}_k$$

where:

$n$ is the total number of items in the ranked list
$\text{P@}k$ is precision at position $k$
$\text{rel}_k \in {0, 1}$ indicates whether the item at position $k$ is relevant
$|R|$ is the total number of relevant items

The factor $\text{rel}_k$ acts as a filter: we only include precision values at positions where relevant items appear. The division by $|R|$ normalizes the sum.

Expanded Form:

$$\text{AP} = \frac{1}{|R|} \sum_{k : \text{rel}k = 1} \text{P@}k = \frac{1}{|R|} \sum{i=1}^{|R|} \text{P@rank}(r_i)$$

where $\text{rank}(r_i)$ is the position of the $i$-th relevant item in the ranked list.

Average Precision Calculation Example
Rank k	Document	Relevant?	P@k	P@k × rel_k	Running Sum
1	Doc A	✓ Yes	1/1 = 1.000	1.000	1.000
2	Doc B	✗ No	1/2 = 0.500	0.000	1.000
3	Doc C	✓ Yes	2/3 = 0.667	0.667	1.667
4	Doc D	✗ No	2/4 = 0.500	0.000	1.667
5	Doc E	✓ Yes	3/5 = 0.600	0.600	2.267
6	Doc F	✗ No	3/6 = 0.500	0.000	2.267
7	Doc G	✓ Yes	4/7 = 0.571	0.571	2.838
8	Doc H	✗ No	4/8 = 0.500	0.000	2.838

Computing AP from the Example:

In this example, we have:

Total relevant items $|R| = 4$ (assume these are all the relevant documents)
Positions where relevant items appear: 1, 3, 5, 7
Precision values at those positions: 1.000, 0.667, 0.600, 0.571

$$\text{AP} = \frac{1.000 + 0.667 + 0.600 + 0.571}{4} = \frac{2.838}{4} = 0.7095$$

Intuition:

AP rewards rankings where relevant items appear early. Compare:

Perfect ranking (relevant items at positions 1,2,3,4): AP = (1 + 1 + 1 + 1)/4 = 1.0
Our example (relevant at 1,3,5,7): AP = 0.7095
Worst ranking (relevant at last positions): AP would be much lower

The earlier relevant items appear, the higher the precision values at those positions, and thus the higher AP.

Missing Relevant Items

If some relevant items don't appear in the ranked list at all (e.g., the list is truncated), AP treats them as having precision 0 at infinite rank. The denominator remains |R| (total relevant), not just the relevant items found. This penalizes systems that fail to retrieve all relevant items.

AP as Area Under the Precision-Recall Curve

One of the most beautiful properties of Average Precision is its geometric interpretation: AP approximates the area under the precision-recall curve.

The Precision-Recall Curve for Rankings:

As we traverse the ranked list from position 1 to n, we can plot:

x-axis: Recall at each position (cumulative relevant / total relevant)
y-axis: Precision at each position (cumulative relevant / position)

This traces out a stepped curve. Importantly:

The curve can only move right (recall increases or stays flat)
The curve typically moves down and right (precision drops as we include more items)
Vertical jumps occur when we hit an irrelevant item (recall stays, precision drops)
Diagonal movements occur when we hit a relevant item (both recall and precision engage)

AP as a Riemann Sum:

The AP formula can be rewritten as:

$$\text{AP} = \sum_{k=1}^{n} \text{P@}k \cdot \Delta\text{R@}k$$

where $\Delta\text{R@}k = \text{R@}k - \text{R@}(k-1)$ is the change in recall at position $k$.

Since $\Delta\text{R@}k = \frac{\text{rel}_k}{|R|}$ (i.e., $1/|R|$ when relevant, 0 otherwise), this becomes:

$$\text{AP} = \sum_{k : \text{rel}k=1} \text{P@}k \cdot \frac{1}{|R|} = \frac{1}{|R|} \sum{k : \text{rel}_k=1} \text{P@}k$$

This is exactly the original AP formula—proving that AP computes the area under the stepped precision-recall curve using a right-endpoint Riemann sum.

ap_pr_curve_relationship.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
 
def compute_pr_curve_and_ap(relevance: list[int], total_relevant: int):
    """
    Compute precision-recall curve and demonstrate AP as area under curve.
    
    Args:
        relevance: Binary relevance of ranked items (1=relevant, 0=not)
        total_relevant: Total relevant items |R|
    
    Returns:
        Dictionary with precision, recall, AP, and area verification
    """
    n = len(relevance)
    relevance = np.array(relevance)
    
    # Compute precision and recall at each position
    cumsum_rel = np.cumsum(relevance)
    positions = np.arange(1, n + 1)
    
    precision = cumsum_rel / positions
    recall = cumsum_rel / total_relevant
    
    # Compute AP: sum of P@k at positions where item is relevant
    ap_terms = precision * relevance
    ap = ap_terms.sum() / total_relevant
    
    # Compute area under PR curve using rectangles
    # For each relevant item, add rectangle: width = delta_recall, height = precision
    delta_recall = relevance / total_relevant
    area_riemann = np.sum(precision * delta_recall)
    
    # Verify they match
    print(f"AP (formula):    {ap:.6f}")
    print(f"Area (Riemann):  {area_riemann:.6f}")
    print(f"Match: {np.isclose(ap, area_riemann)}")
    
    return {
        'precision': precision,
        'recall': recall,
        'ap': ap,
        'area': area_riemann
    }
 
def plot_pr_curve_with_area(relevance: list[int], total_relevant: int):
    """Visualize PR curve and show AP as area under curve."""
    results = compute_pr_curve_and_ap(relevance, total_relevant)
    
    # Prepend (0, 1) point for complete curve
    recall = np.concatenate([[0], results['recall']])
    precision = np.concatenate([[1], results['precision']])
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Plot stepped PR curve
    ax.step(recall, precision, where='post', linewidth=2, color='blue', label='PR Curve')
    
    # Fill area under curve
    ax.fill_between(recall, precision, step='post', alpha=0.3, color='blue',
                    label=f'AP = {results["ap"]:.4f}')
    
    # Mark points where relevant items occur
    rel_positions = np.where(np.array(relevance) == 1)[0]
    for pos in rel_positions:
        ax.scatter(results['recall'][pos], results['precision'][pos], 
                   color='red', s=100, zorder=5)
    
    ax.set_xlabel('Recall', fontsize=12)
    ax.set_ylabel('Precision', fontsize=12)
    ax.set_title('Precision-Recall Curve with AP as Area Under Curve', fontsize=14)
    ax.set_xlim([0, 1.05])
    ax.set_ylim([0, 1.05])
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('pr_curve_with_ap.png', dpi=150)
    plt.show()
 
# Example: Demonstrate with a realistic ranking
relevance = [1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0]  # 6 relevant in top 15
total_relevant = 8  # But 8 relevant exist total
 
print("Demonstrating AP as Area Under PR Curve")
print("=" * 50)
results = compute_pr_curve_and_ap(relevance, total_relevant)
 
# Note: We only found 6 of 8 relevant items
# The 2 missing items contribute 0 to AP but increase denominator
print(f"\nRelevant found in ranking: {sum(relevance)}")
print(f"Total relevant (|R|): {total_relevant}")
print(f"Recall achieved: {results['recall'][-1]:.3f}")

The Interpolated AP

Some implementations use 'interpolated AP' where precision at each recall level is replaced by the maximum precision at any recall level ≥ that value. This produces a non-decreasing precision curve. The 11-point interpolated AP (measuring at recall = 0.0, 0.1, 0.2, ..., 1.0) was common historically but is less used today. Modern evaluations typically use non-interpolated AP.

From AP to MAP: Aggregating Across Queries

A ranking system handles many queries, each with its own ranked list and set of relevant items. Mean Average Precision (MAP) averages AP across all queries:

$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$

where $Q$ is the set of queries and $\text{AP}(q)$ is the average precision for query $q$.

Why Average, Not Sum?

Averaging normalizes for the number of queries, making MAP comparable across different test sets. A system evaluated on 1000 queries should have a similar MAP expectation as one evaluated on 100 queries (assuming similar quality).

Weighting Considerations:

The standard MAP gives equal weight to each query. This means:

A query with 100 relevant documents contributes equally to a query with 2 relevant documents
Easy queries (where all relevant items are found) contribute equally to hard queries

Some alternatives weight queries by:

Number of relevant items ("hard" queries with many relevant items matter more)
Query frequency in real usage (common queries matter more)
Business importance (high-revenue queries matter more)

The standard unweighted MAP remains most common in research settings.

map_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from typing import List, Dict, Any
 
def average_precision(relevance: List[int], total_relevant: int) -> float:
    """
    Compute Average Precision for a single query.
    
    Args:
        relevance: Binary relevance of ranked items [1, 0, 1, ...]
        total_relevant: Total relevant items |R| for this query
    
    Returns:
        AP value in [0, 1], or 0 if no relevant items
    """
    if total_relevant == 0:
        return 0.0
    
    relevance = np.array(relevance)
    n = len(relevance)
    
    # Cumulative sum of relevant items at each position
    cumsum = np.cumsum(relevance)
    
    # Precision at each position
    positions = np.arange(1, n + 1)
    precision = cumsum / positions
    
    # Sum precision at positions where item is relevant
    precision_sum = np.sum(precision * relevance)
    
    # Divide by total relevant (not just those found)
    return precision_sum / total_relevant
 
 
def mean_average_precision(query_results: List[Dict[str, Any]]) -> Dict[str, float]:
    """
    Compute MAP across multiple queries.
    
    Args:
        query_results: List of dicts with 'relevance' and 'total_relevant'
    
    Returns:
        Dict with MAP and per-query AP values
    """
    ap_values = []
    
    for qr in query_results:
        ap = average_precision(qr['relevance'], qr['total_relevant'])
        ap_values.append(ap)
    
    map_score = np.mean(ap_values)
    
    return {
        'MAP': map_score,
        'per_query_AP': ap_values,
        'num_queries': len(ap_values),
        'std_AP': np.std(ap_values),
        'min_AP': np.min(ap_values),
        'max_AP': np.max(ap_values)
    }
 
 
# Example: Multiple queries with varying difficulty
queries = [
    {
        'query_id': 'q1',
        'relevance': [1, 1, 0, 1, 0],      # Easy: 3 relevant, all found early
        'total_relevant': 3
    },
    {
        'query_id': 'q2', 
        'relevance': [0, 0, 1, 0, 1, 0, 1],  # Medium: relevant items scattered
        'total_relevant': 4                   # Note: 1 relevant not in top 7
    },
    {
        'query_id': 'q3',
        'relevance': [0, 0, 0, 0, 1, 1],     # Hard: relevant items ranked low
        'total_relevant': 3                    # Note: 1 relevant not found
    },
    {
        'query_id': 'q4',
        'relevance': [1, 1, 1, 1, 1],        # Perfect: all relevant at top
        'total_relevant': 5
    }
]
 
print("Mean Average Precision Computation")
print("=" * 60)
 
for q in queries:
    ap = average_precision(q['relevance'], q['total_relevant'])
    found = sum(q['relevance'])
    total = q['total_relevant']
    print(f"{q['query_id']}: AP = {ap:.4f} (found {found}/{total} relevant)")
 
results = mean_average_precision(queries)
print(f"\nMAP = {results['MAP']:.4f}")
print(f"Std(AP) = {results['std_AP']:.4f}")
print(f"AP range: [{results['min_AP']:.4f}, {results['max_AP']:.4f}]")

MAP Interpretation Guide
MAP Range	Interpretation	Quality Assessment
0.9 - 1.0	Exceptional: Most relevant items ranked first	State-of-the-art performance
0.7 - 0.9	Strong: Good precision across recall levels	Publication-quality results
0.5 - 0.7	Moderate: Room for improvement	Acceptable for many applications
0.3 - 0.5	Weak: Significant relevant items ranked poorly	Needs model improvement
< 0.3	Poor: Little better than random	Fundamental issues

Mathematical Properties of AP and MAP

Understanding the mathematical properties of AP and MAP helps predict their behavior and avoid misinterpretation.

Property 1: Bounds

$$0 \leq \text{AP} \leq 1$$

AP = 1 iff all relevant items are ranked before all irrelevant items (perfect ranking)
AP = 0 only if no relevant items appear in the ranked list (impossible if |R| > 0 and all items are ranked)

Property 2: Range of AP Given Partial Retrieval

If only $r$ of $|R|$ relevant items appear in the ranking:

$$\text{AP}_{\text{max}} = \frac{r}{|R|}$$ (achieved when all $r$ are at positions 1 through $r$)

The $|R| - r$ missing items contribute 0 to the numerator but remain in the denominator.

Property 3: Sensitivity to Ranking Changes

Swapping adjacent items affects AP:

Swapping relevant ↔ irrelevant: AP changes (promoted relevant item helps)
Swapping two relevant items: AP unchanged
Swapping two irrelevant items: AP unchanged

Property 4: Relationship to Break-Even Point

The Break-Even Point (BEP) is the recall level where precision equals recall. On the PR curve, this is where the curve crosses the diagonal. AP is influenced by but not equal to BEP.

ap_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from itertools import permutations
 
def ap_quick(relevance, total_relevant):
    """Quick AP computation."""
    relevance = np.array(relevance)
    if total_relevant == 0:
        return 0.0
    cumsum = np.cumsum(relevance)
    precision = cumsum / np.arange(1, len(relevance) + 1)
    return np.sum(precision * relevance) / total_relevant
 
# Property 1: Demonstrate bounds
print("Property 1: AP Bounds")
print("-" * 40)
 
# Perfect ranking: all relevant first
perfect = [1, 1, 1, 1, 0, 0, 0, 0]
print(f"Perfect ranking [1,1,1,1,0,0,0,0]: AP = {ap_quick(perfect, 4):.4f}")
 
# Worst possible: all relevant last
worst = [0, 0, 0, 0, 1, 1, 1, 1]
print(f"Worst ranking [0,0,0,0,1,1,1,1]: AP = {ap_quick(worst, 4):.4f}")
 
# Property 2: Partial retrieval bounds
print("\nProperty 2: Partial Retrieval")
print("-" * 40)
 
# Only 2 of 4 relevant found
partial = [1, 0, 1, 0, 0]  # Found 2 at positions 1 and 3
total_relevant = 4  # But 4 exist
ap_partial = ap_quick(partial, total_relevant)
ap_max_theoretical = 2 / 4  # Best if both at positions 1,2
print(f"Found 2 of 4 relevant at positions 1,3: AP = {ap_partial:.4f}")
print(f"Max AP with 2 of 4 found (positions 1,2): {ap_quick([1,1,0,0,0], 4):.4f}")
print(f"Theoretical max = r/|R| = {ap_max_theoretical:.4f}")
 
# Property 3: Sensitivity to swaps
print("\nProperty 3: Sensitivity to Swaps")
print("-" * 40)
 
original = [1, 0, 1, 0, 1, 0]
print(f"Original [1,0,1,0,1,0]: AP = {ap_quick(original, 3):.4f}")
 
# Swap positions 1 and 2 (relevant <-> irrelevant)
swapped_rel_irrel = [0, 1, 1, 0, 1, 0]
print(f"Swap pos 1,2 [0,1,1,0,1,0]: AP = {ap_quick(swapped_rel_irrel, 3):.4f}")
 
# Swap positions 1 and 3 (relevant <-> relevant)
swapped_rel_rel = [1, 0, 1, 0, 1, 0]  # Same as original!
print(f"Swap two relevants: AP unchanged = {ap_quick(swapped_rel_rel, 3):.4f}")
 
# Swap positions 2 and 4 (irrelevant <-> irrelevant)
swapped_irrel_irrel = [1, 0, 1, 0, 1, 0]  # Same as original!
print(f"Swap two irrelevants: AP unchanged = {ap_quick(swapped_irrel_irrel, 3):.4f}")
 
# Property 4: Exhaustive search for short list
print("\nProperty 4: AP Distribution Over All Rankings")
print("-" * 40)
 
# All rankings of 3 relevant + 2 irrelevant items
# This shows the range of possible AP values
items = [1, 1, 1, 0, 0]  # 3 relevant, 2 irrelevant
aps = set()
 
for perm in permutations(items):
    ap = ap_quick(list(perm), 3)
    aps.add(round(ap, 4))
 
aps = sorted(aps)
print(f"For 3 relevant + 2 irrelevant items:")
print(f"Number of distinct AP values: {len(aps)}")
print(f"Range: [{min(aps):.4f}, {max(aps):.4f}]")
print(f"All values: {aps}")

Relationship to Other Metrics

MAP exists within an ecosystem of ranking metrics. Understanding its relationships clarifies when to use each metric.

AP vs. P@k and R@k

Aspect	P@k / R@k	AP
Cutoff dependency	Requires choosing k	No cutoff needed
Position weighting	All positions equal within k	Positions weighted by precision at that point
Single summary	Need to report multiple k	Single value summarizes all
Interpretability	Intuitive (fraction at k)	Less intuitive (average of precisions)

AP vs. NDCG

Aspect	AP	NDCG
Relevance model	Binary (relevant or not)	Graded (levels of relevance)
Position discounting	Implicit via precision	Explicit logarithmic discount
Normalization	By total relevant	By ideal ranking score
Use case	Document retrieval	Graded recommendations

AP and the F1-Relationship

At each relevant position $k$, the precision P@k and corresponding recall R@k are related. The AP can be viewed as giving heavier weight to precision values where recall has also improved (each relevant item adds to recall).

AP vs. MRR (Mean Reciprocal Rank)

Aspect	AP	MRR
What it measures	All relevant items	Only first relevant item
Use case	Need all relevant items	Only one answer needed
Formula	Average of P@k at relevant positions	1/rank of first relevant
Example domain	Document retrieval	Question answering

When to Use MAP vs. Alternatives

Use MAP when: (1) relevance is binary, (2) all relevant items matter, (3) you want a single summary metric, (4) comparing against IR literature. Use NDCG when relevance is graded. Use MRR when only the first result matters. Use P@k/R@k when you need metrics at specific cutoffs that match your UI.

Production-Grade Implementation

Implementing MAP correctly in production requires careful handling of edge cases, efficiency considerations, and proper statistical reporting.

production_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import numpy as np
from typing import List, Set, Optional, Tuple
from dataclasses import dataclass
import warnings
 
@dataclass
class APResult:
    """Container for Average Precision result with diagnostics."""
    ap: float
    precision_at_relevant: List[float]
    recall_at_relevant: List[float]
    relevant_found: int
    total_relevant: int
    ranking_length: int
 
@dataclass
class MAPResult:
    """Container for MAP result with comprehensive statistics."""
    map_score: float
    ap_values: List[float]
    num_queries: int
    std: float
    ci_lower: float
    ci_upper: float
    queries_with_no_relevant: int
 
class MAPEvaluator:
    """
    Production-grade MAP evaluator with comprehensive edge case handling.
    """
    
    @staticmethod
    def average_precision(
        ranked_items: List[str],
        relevant_items: Set[str],
        cutoff: Optional[int] = None
    ) -> APResult:
        """
        Compute Average Precision for a single query.
        
        Args:
            ranked_items: Items in ranked order (list of IDs)
            relevant_items: Set of relevant item IDs
            cutoff: Optional maximum rank to consider (None = all)
        
        Returns:
            APResult with AP value and diagnostics
        """
        if len(relevant_items) == 0:
            warnings.warn("No relevant items for this query")
            return APResult(
                ap=0.0,
                precision_at_relevant=[],
                recall_at_relevant=[],
                relevant_found=0,
                total_relevant=0,
                ranking_length=len(ranked_items)
            )
        
        # Apply cutoff if specified
        if cutoff is not None:
            ranked_items = ranked_items[:cutoff]
        
        # Convert to binary relevance array
        relevance = np.array([item in relevant_items for item in ranked_items], dtype=int)
        
        n = len(relevance)
        total_relevant = len(relevant_items)
        
        if n == 0:
            return APResult(
                ap=0.0,
                precision_at_relevant=[],
                recall_at_relevant=[],
                relevant_found=0,
                total_relevant=total_relevant,
                ranking_length=0
            )
        
        # Compute precision and recall at each position
        cumsum = np.cumsum(relevance)
        positions = np.arange(1, n + 1)
        precision = cumsum / positions
        recall = cumsum / total_relevant
        
        # Extract values at relevant positions
        relevant_mask = relevance == 1
        precision_at_relevant = precision[relevant_mask].tolist()
        recall_at_relevant = recall[relevant_mask].tolist()
        
        # Compute AP
        ap = np.sum(precision * relevance) / total_relevant
        
        return APResult(
            ap=float(ap),
            precision_at_relevant=precision_at_relevant,
            recall_at_relevant=recall_at_relevant,
            relevant_found=int(cumsum[-1]) if n > 0 else 0,
            total_relevant=total_relevant,
            ranking_length=n
        )
    
    @staticmethod
    def mean_average_precision(
        ap_results: List[APResult],
        confidence_level: float = 0.95
    ) -> MAPResult:
        """
        Compute MAP with confidence intervals.
        
        Args:
            ap_results: List of APResult objects from individual queries
            confidence_level: Confidence level for CI (default 95%)
        
        Returns:
            MAPResult with MAP and statistical information
        """
        if len(ap_results) == 0:
            raise ValueError("Cannot compute MAP with no queries")
        
        ap_values = [r.ap for r in ap_results]
        ap_array = np.array(ap_values)
        
        # Count queries with no relevant items
        no_relevant = sum(1 for r in ap_results if r.total_relevant == 0)
        
        # Compute statistics
        map_score = float(np.mean(ap_array))
        std = float(np.std(ap_array, ddof=1)) if len(ap_array) > 1 else 0.0
        
        # Confidence interval using t-distribution
        n = len(ap_array)
        if n > 1:
            from scipy import stats
            t_value = stats.t.ppf((1 + confidence_level) / 2, n - 1)
            margin = t_value * std / np.sqrt(n)
            ci_lower = max(0.0, map_score - margin)
            ci_upper = min(1.0, map_score + margin)
        else:
            ci_lower = ci_upper = map_score
        
        return MAPResult(
            map_score=map_score,
            ap_values=ap_values,
            num_queries=n,
            std=std,
            ci_lower=ci_lower,
            ci_upper=ci_upper,
            queries_with_no_relevant=no_relevant
        )
 
# Demo usage
def demo():
    evaluator = MAPEvaluator()
    
    # Simulate search engine evaluation
    test_queries = [
        {
            'ranked': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'],
            'relevant': {'d1', 'd3', 'd5', 'd7'}  # AP should be high
        },
        {
            'ranked': ['d2', 'd4', 'd1', 'd6', 'd3', 'd8', 'd5', 'd7'],
            'relevant': {'d1', 'd3', 'd5', 'd7'}  # Same relevant, worse ranking
        },
        {
            'ranked': ['d1', 'd2', 'd3'],
            'relevant': {'d1', 'd2', 'd3', 'd4', 'd5'}  # Only partial retrieval
        },
    ]
    
    ap_results = []
    for i, q in enumerate(test_queries):
        result = evaluator.average_precision(q['ranked'], q['relevant'])
        print(f"Query {i+1}:")
        print(f"  AP = {result.ap:.4f}")
        print(f"  Found {result.relevant_found}/{result.total_relevant} relevant")
        print(f"  Precision at relevant positions: {[f'{p:.3f}' for p in result.precision_at_relevant]}")
        ap_results.append(result)
    
    # Compute MAP
    map_result = evaluator.mean_average_precision(ap_results)
    print(f"\nMAP = {map_result.map_score:.4f} ± {1.96*map_result.std/np.sqrt(map_result.num_queries):.4f}")
    print(f"95% CI: [{map_result.ci_lower:.4f}, {map_result.ci_upper:.4f}]")
 
if __name__ == "__main__":
    demo()

Statistical Testing and Significance

When comparing two ranking systems, raw MAP differences aren't enough—we need to assess statistical significance.

Why Statistical Testing Matters:

If System A has MAP = 0.412 and System B has MAP = 0.398, is A truly better? The difference (0.014) could be:

A genuine improvement indicating a better model
Random variation due to finite query sample
An artifact of the specific queries in the test set

Common Statistical Tests for MAP:

Paired t-test: Compares mean AP differences across queries
- Assumption: AP differences are normally distributed
- Pro: Simple, widely understood
- Con: Normality assumption often violated
Wilcoxon signed-rank test: Non-parametric alternative
- Pro: No normality assumption
- Con: Less powerful when data is actually normal
Bootstrap confidence intervals: Resampling approach
- Pro: No distributional assumptions; provides CI for difference
- Con: Computationally more expensive
Randomization test: Permutation-based significance
- Pro: Exact p-values; no assumptions
- Con: Computationally intensive

map_statistical_testing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
from scipy import stats
 
def compare_maps_paired_ttest(ap_system_a: list, ap_system_b: list) -> dict:
    """
    Compare two systems using paired t-test on per-query AP.
    
    Args:
        ap_system_a: List of AP values for system A (one per query)
        ap_system_b: List of AP values for system B (one per query)
    
    Returns:
        Dictionary with test results
    """
    assert len(ap_system_a) == len(ap_system_b), "Must have same queries"
    
    ap_a = np.array(ap_system_a)
    ap_b = np.array(ap_system_b)
    differences = ap_a - ap_b
    
    # Paired t-test
    t_stat, p_value = stats.ttest_rel(ap_a, ap_b)
    
    # Effect size (Cohen's d for paired samples)
    d = np.mean(differences) / np.std(differences, ddof=1)
    
    return {
        'map_a': np.mean(ap_a),
        'map_b': np.mean(ap_b),
        'mean_difference': np.mean(differences),
        't_statistic': t_stat,
        'p_value': p_value,
        'cohens_d': d,
        'significant_05': p_value < 0.05,
        'significant_01': p_value < 0.01
    }
 
def compare_maps_bootstrap(ap_system_a: list, ap_system_b: list, 
                          n_bootstrap: int = 10000) -> dict:
    """
    Compare two systems using bootstrap confidence intervals.
    """
    ap_a = np.array(ap_system_a)
    ap_b = np.array(ap_system_b)
    n = len(ap_a)
    
    # Bootstrap the difference in MAP
    rng = np.random.default_rng(42)
    bootstrap_diffs = []
    
    for _ in range(n_bootstrap):
        indices = rng.choice(n, size=n, replace=True)
        diff = np.mean(ap_a[indices]) - np.mean(ap_b[indices])
        bootstrap_diffs.append(diff)
    
    bootstrap_diffs = np.array(bootstrap_diffs)
    
    # Confidence intervals
    ci_95 = np.percentile(bootstrap_diffs, [2.5, 97.5])
    ci_99 = np.percentile(bootstrap_diffs, [0.5, 99.5])
    
    # "Significant" if CI doesn't contain 0
    sig_95 = ci_95[0] > 0 or ci_95[1] < 0
    
    return {
        'map_a': np.mean(ap_a),
        'map_b': np.mean(ap_b),
        'mean_difference': np.mean(ap_a) - np.mean(ap_b),
        'bootstrap_mean_diff': np.mean(bootstrap_diffs),
        'ci_95': tuple(ci_95),
        'ci_99': tuple(ci_99),
        'significant_95': sig_95
    }
 
# Example: Compare two search systems
np.random.seed(42)
n_queries = 50
 
# System A: Generally better, with some variance
ap_a = np.clip(np.random.normal(0.55, 0.15, n_queries), 0, 1)
 
# System B: Slightly worse
ap_b = np.clip(np.random.normal(0.48, 0.18, n_queries), 0, 1)
 
print("System Comparison: Paired t-test")
print("=" * 50)
ttest_result = compare_maps_paired_ttest(ap_a.tolist(), ap_b.tolist())
for key, value in ttest_result.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")
 
print("\nSystem Comparison: Bootstrap")
print("=" * 50)
boot_result = compare_maps_bootstrap(ap_a.tolist(), ap_b.tolist())
for key, value in boot_result.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    elif isinstance(value, tuple):
        print(f"{key}: ({value[0]:.4f}, {value[1]:.4f})")
    else:
        print(f"{key}: {value}")

Multiple Comparison Correction

When comparing multiple systems or multiple metrics, apply correction for multiple comparisons (e.g., Bonferroni, Holm-Bonferroni, or FDR control). Testing 10 systems pairwise involves 45 comparisons; expecting ~2 false positives at α=0.05 without correction.

Common Pitfalls and Best Practices

MAP is widely used but often miscomputed or misinterpreted. Here are critical points to avoid errors:

Common Pitfalls

•Dividing by found relevant instead of total relevant: The AP denominator must be |R| (all relevant items), not just those found in the ranking. Dividing by found items artificially inflates AP for systems with poor recall.
•Ignoring queries with no relevant items: Some datasets include queries with |R| = 0. Decide whether to exclude them (common) or assign AP = 0 (penalizes the evaluation).
•Using interpolated AP when non-interpolated is expected: Interpolated AP gives different (usually higher) values. Know which variant your benchmark uses.
•Truncating rankings before computing AP: If you cut rankings to top-100, all relevant items beyond position 100 are treated as not found. Document your cutoff.
•Averaging MAP across heterogeneous query types: If some queries are fundamentally harder (e.g., navigational vs. informational), consider stratified analysis.

Best Practices

•Report confidence intervals: MAP alone doesn't convey uncertainty. Always report CI or standard error.
•Document your denominator: Explicitly state whether |R| includes items beyond your ranking cutoff.
•Report complementary metrics: MAP with P@10 and R@100 gives a more complete picture than MAP alone.
•Use consistent evaluation: Same queries, same relevance judgments, same preprocessing across all systems compared.
•Perform significance testing: Don't claim one system beats another without statistical evidence.
•Analyze per-query performance: Distribution of AP values often reveals more than MAP alone (bimodal distributions suggest different query types).

Summary: Mean Average Precision

We have established Mean Average Precision as the gold-standard metric for ranked retrieval evaluation. Let's consolidate our understanding:

Key Takeaways

•Average Precision (AP) computes the mean of precision values at each position where a relevant item is found: $\text{AP} = \frac{1}{|R|} \sum_{k: \text{rel}_k=1} P@k$
•MAP is the mean of AP across queries: $\text{MAP} = \frac{1}{|Q|} \sum_{q} \text{AP}(q)$
•AP equals the area under the precision-recall curve — a beautiful geometric interpretation
•AP rewards relevant items appearing early — precision at position 1 is worth more than at position 100
•The denominator is always total relevant |R| — not just items found, penalizing systems with poor recall
•Statistical significance testing is essential — use paired t-tests, Wilcoxon, or bootstrap when comparing systems
•Report MAP with confidence intervals — single point estimates don't capture uncertainty

Mathematical Summary:

$$\text{AP}(q) = \frac{1}{|R_q|} \sum_{k=1}^{n} P@k \cdot \mathbb{1}[\text{rel}_k = 1]$$

$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$

What's Next:

With MAP established, we're ready to examine Normalized Discounted Cumulative Gain (NDCG)—a metric that extends to graded relevance and uses explicit position discounting, making it the standard for recommendation and graded retrieval evaluation.

Page Complete

You now understand Mean Average Precision as the gold-standard metric for binary-relevance ranking evaluation. This metric elegantly synthesizes precision and recall into a single summary statistic that rewards good rankings at all recall levels.

Mean Average Precision

The Integration of Precision Across Rankings

How many relevant items are retrieved (recall)
How early they appear in the ranking (precision distribution)
The overall shape of the precision-recall curve (area under the curve)

What You Will Learn

Average Precision (AP) for a Single Query

Before defining MAP, we must first understand Average Precision (AP)—the metric computed for a single query. AP is the heart of MAP; MAP is simply the mean of AP values across queries.

The Core Idea:

Instead of computing precision at one cutoff, AP computes precision at every position where a relevant item is found, then averages these values:

$$\text{AP} = \frac{1}{|R|} \sum_{k=1}^{n} \text{P@}k \cdot \text{rel}_k$$

where:

$n$ is the total number of items in the ranked list
$\text{P@}k$ is precision at position $k$
$\text{rel}_k \in {0, 1}$ indicates whether the item at position $k$ is relevant
$|R|$ is the total number of relevant items

The factor $\text{rel}_k$ acts as a filter: we only include precision values at positions where relevant items appear. The division by $|R|$ normalizes the sum.

Expanded Form:

$$\text{AP} = \frac{1}{|R|} \sum_{k : \text{rel}k = 1} \text{P@}k = \frac{1}{|R|} \sum{i=1}^{|R|} \text{P@rank}(r_i)$$

where $\text{rank}(r_i)$ is the position of the $i$-th relevant item in the ranked list.

Average Precision Calculation Example
Rank k	Document	Relevant?	P@k	P@k × rel_k	Running Sum
1	Doc A	✓ Yes	1/1 = 1.000	1.000	1.000
2	Doc B	✗ No	1/2 = 0.500	0.000	1.000
3	Doc C	✓ Yes	2/3 = 0.667	0.667	1.667
4	Doc D	✗ No	2/4 = 0.500	0.000	1.667
5	Doc E	✓ Yes	3/5 = 0.600	0.600	2.267
6	Doc F	✗ No	3/6 = 0.500	0.000	2.267
7	Doc G	✓ Yes	4/7 = 0.571	0.571	2.838
8	Doc H	✗ No	4/8 = 0.500	0.000	2.838

Computing AP from the Example:

In this example, we have:

Total relevant items $|R| = 4$ (assume these are all the relevant documents)
Positions where relevant items appear: 1, 3, 5, 7
Precision values at those positions: 1.000, 0.667, 0.600, 0.571

$$\text{AP} = \frac{1.000 + 0.667 + 0.600 + 0.571}{4} = \frac{2.838}{4} = 0.7095$$

Intuition:

AP rewards rankings where relevant items appear early. Compare:

Perfect ranking (relevant items at positions 1,2,3,4): AP = (1 + 1 + 1 + 1)/4 = 1.0
Our example (relevant at 1,3,5,7): AP = 0.7095
Worst ranking (relevant at last positions): AP would be much lower

The earlier relevant items appear, the higher the precision values at those positions, and thus the higher AP.

Missing Relevant Items

AP as Area Under the Precision-Recall Curve

One of the most beautiful properties of Average Precision is its geometric interpretation: AP approximates the area under the precision-recall curve.

The Precision-Recall Curve for Rankings:

As we traverse the ranked list from position 1 to n, we can plot:

x-axis: Recall at each position (cumulative relevant / total relevant)
y-axis: Precision at each position (cumulative relevant / position)

This traces out a stepped curve. Importantly:

The curve can only move right (recall increases or stays flat)
The curve typically moves down and right (precision drops as we include more items)
Vertical jumps occur when we hit an irrelevant item (recall stays, precision drops)
Diagonal movements occur when we hit a relevant item (both recall and precision engage)

AP as a Riemann Sum:

The AP formula can be rewritten as:

$$\text{AP} = \sum_{k=1}^{n} \text{P@}k \cdot \Delta\text{R@}k$$

where $\Delta\text{R@}k = \text{R@}k - \text{R@}(k-1)$ is the change in recall at position $k$.

Since $\Delta\text{R@}k = \frac{\text{rel}_k}{|R|}$ (i.e., $1/|R|$ when relevant, 0 otherwise), this becomes:

$$\text{AP} = \sum_{k : \text{rel}k=1} \text{P@}k \cdot \frac{1}{|R|} = \frac{1}{|R|} \sum{k : \text{rel}_k=1} \text{P@}k$$

This is exactly the original AP formula—proving that AP computes the area under the stepped precision-recall curve using a right-endpoint Riemann sum.

ap_pr_curve_relationship.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import matplotlib.pyplot as plt
 
def compute_pr_curve_and_ap(relevance: list[int], total_relevant: int):
    """
    Compute precision-recall curve and demonstrate AP as area under curve.
    
    Args:
        relevance: Binary relevance of ranked items (1=relevant, 0=not)
        total_relevant: Total relevant items |R|
    
    Returns:
        Dictionary with precision, recall, AP, and area verification
    """
    n = len(relevance)
    relevance = np.array(relevance)
    
    # Compute precision and recall at each position
    cumsum_rel = np.cumsum(relevance)
    positions = np.arange(1, n + 1)
    
    precision = cumsum_rel / positions
    recall = cumsum_rel / total_relevant
    
    # Compute AP: sum of P@k at positions where item is relevant
    ap_terms = precision * relevance
    ap = ap_terms.sum() / total_relevant
    
    # Compute area under PR curve using rectangles
    # For each relevant item, add rectangle: width = delta_recall, height = precision
    delta_recall = relevance / total_relevant
    area_riemann = np.sum(precision * delta_recall)
    
    # Verify they match
    print(f"AP (formula):    {ap:.6f}")
    print(f"Area (Riemann):  {area_riemann:.6f}")
    print(f"Match: {np.isclose(ap, area_riemann)}")
    
    return {
        'precision': precision,
        'recall': recall,
        'ap': ap,
        'area': area_riemann
    }
 
def plot_pr_curve_with_area(relevance: list[int], total_relevant: int):
    """Visualize PR curve and show AP as area under curve."""
    results = compute_pr_curve_and_ap(relevance, total_relevant)
    
    # Prepend (0, 1) point for complete curve
    recall = np.concatenate([[0], results['recall']])
    precision = np.concatenate([[1], results['precision']])
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Plot stepped PR curve
    ax.step(recall, precision, where='post', linewidth=2, color='blue', label='PR Curve')
    
    # Fill area under curve
    ax.fill_between(recall, precision, step='post', alpha=0.3, color='blue',
                    label=f'AP = {results["ap"]:.4f}')
    
    # Mark points where relevant items occur
    rel_positions = np.where(np.array(relevance) == 1)[0]
    for pos in rel_positions:
        ax.scatter(results['recall'][pos], results['precision'][pos], 
                   color='red', s=100, zorder=5)
    
    ax.set_xlabel('Recall', fontsize=12)
    ax.set_ylabel('Precision', fontsize=12)
    ax.set_title('Precision-Recall Curve with AP as Area Under Curve', fontsize=14)
    ax.set_xlim([0, 1.05])
    ax.set_ylim([0, 1.05])
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('pr_curve_with_ap.png', dpi=150)
    plt.show()
 
# Example: Demonstrate with a realistic ranking
relevance = [1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0]  # 6 relevant in top 15
total_relevant = 8  # But 8 relevant exist total
 
print("Demonstrating AP as Area Under PR Curve")
print("=" * 50)
results = compute_pr_curve_and_ap(relevance, total_relevant)
 
# Note: We only found 6 of 8 relevant items
# The 2 missing items contribute 0 to AP but increase denominator
print(f"\nRelevant found in ranking: {sum(relevance)}")
print(f"Total relevant (|R|): {total_relevant}")
print(f"Recall achieved: {results['recall'][-1]:.3f}")

The Interpolated AP

From AP to MAP: Aggregating Across Queries

A ranking system handles many queries, each with its own ranked list and set of relevant items. Mean Average Precision (MAP) averages AP across all queries:

$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$

where $Q$ is the set of queries and $\text{AP}(q)$ is the average precision for query $q$.

Why Average, Not Sum?

Weighting Considerations:

The standard MAP gives equal weight to each query. This means:

A query with 100 relevant documents contributes equally to a query with 2 relevant documents
Easy queries (where all relevant items are found) contribute equally to hard queries

Some alternatives weight queries by:

Number of relevant items ("hard" queries with many relevant items matter more)
Query frequency in real usage (common queries matter more)
Business importance (high-revenue queries matter more)

The standard unweighted MAP remains most common in research settings.

map_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
import numpy as np
from typing import List, Dict, Any
 
def average_precision(relevance: List[int], total_relevant: int) -> float:
    """
    Compute Average Precision for a single query.
    
    Args:
        relevance: Binary relevance of ranked items [1, 0, 1, ...]
        total_relevant: Total relevant items |R| for this query
    
    Returns:
        AP value in [0, 1], or 0 if no relevant items
    """
    if total_relevant == 0:
        return 0.0
    
    relevance = np.array(relevance)
    n = len(relevance)
    
    # Cumulative sum of relevant items at each position
    cumsum = np.cumsum(relevance)
    
    # Precision at each position
    positions = np.arange(1, n + 1)
    precision = cumsum / positions
    
    # Sum precision at positions where item is relevant
    precision_sum = np.sum(precision * relevance)
    
    # Divide by total relevant (not just those found)
    return precision_sum / total_relevant
 
 
def mean_average_precision(query_results: List[Dict[str, Any]]) -> Dict[str, float]:
    """
    Compute MAP across multiple queries.
    
    Args:
        query_results: List of dicts with 'relevance' and 'total_relevant'
    
    Returns:
        Dict with MAP and per-query AP values
    """
    ap_values = []
    
    for qr in query_results:
        ap = average_precision(qr['relevance'], qr['total_relevant'])
        ap_values.append(ap)
    
    map_score = np.mean(ap_values)
    
    return {
        'MAP': map_score,
        'per_query_AP': ap_values,
        'num_queries': len(ap_values),
        'std_AP': np.std(ap_values),
        'min_AP': np.min(ap_values),
        'max_AP': np.max(ap_values)
    }
 
 
# Example: Multiple queries with varying difficulty
queries = [
    {
        'query_id': 'q1',
        'relevance': [1, 1, 0, 1, 0],      # Easy: 3 relevant, all found early
        'total_relevant': 3
    },
    {
        'query_id': 'q2', 
        'relevance': [0, 0, 1, 0, 1, 0, 1],  # Medium: relevant items scattered
        'total_relevant': 4                   # Note: 1 relevant not in top 7
    },
    {
        'query_id': 'q3',
        'relevance': [0, 0, 0, 0, 1, 1],     # Hard: relevant items ranked low
        'total_relevant': 3                    # Note: 1 relevant not found
    },
    {
        'query_id': 'q4',
        'relevance': [1, 1, 1, 1, 1],        # Perfect: all relevant at top
        'total_relevant': 5
    }
]
 
print("Mean Average Precision Computation")
print("=" * 60)
 
for q in queries:
    ap = average_precision(q['relevance'], q['total_relevant'])
    found = sum(q['relevance'])
    total = q['total_relevant']
    print(f"{q['query_id']}: AP = {ap:.4f} (found {found}/{total} relevant)")
 
results = mean_average_precision(queries)
print(f"\nMAP = {results['MAP']:.4f}")
print(f"Std(AP) = {results['std_AP']:.4f}")
print(f"AP range: [{results['min_AP']:.4f}, {results['max_AP']:.4f}]")

MAP Interpretation Guide
MAP Range	Interpretation	Quality Assessment
0.9 - 1.0	Exceptional: Most relevant items ranked first	State-of-the-art performance
0.7 - 0.9	Strong: Good precision across recall levels	Publication-quality results
0.5 - 0.7	Moderate: Room for improvement	Acceptable for many applications
0.3 - 0.5	Weak: Significant relevant items ranked poorly	Needs model improvement
< 0.3	Poor: Little better than random	Fundamental issues

Mathematical Properties of AP and MAP

Understanding the mathematical properties of AP and MAP helps predict their behavior and avoid misinterpretation.

Property 1: Bounds

$$0 \leq \text{AP} \leq 1$$

AP = 1 iff all relevant items are ranked before all irrelevant items (perfect ranking)
AP = 0 only if no relevant items appear in the ranked list (impossible if |R| > 0 and all items are ranked)

Property 2: Range of AP Given Partial Retrieval

If only $r$ of $|R|$ relevant items appear in the ranking:

$$\text{AP}_{\text{max}} = \frac{r}{|R|}$$ (achieved when all $r$ are at positions 1 through $r$)

The $|R| - r$ missing items contribute 0 to the numerator but remain in the denominator.

Property 3: Sensitivity to Ranking Changes

Swapping adjacent items affects AP:

Swapping relevant ↔ irrelevant: AP changes (promoted relevant item helps)
Swapping two relevant items: AP unchanged
Swapping two irrelevant items: AP unchanged

Property 4: Relationship to Break-Even Point

The Break-Even Point (BEP) is the recall level where precision equals recall. On the PR curve, this is where the curve crosses the diagonal. AP is influenced by but not equal to BEP.

ap_properties.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from itertools import permutations
 
def ap_quick(relevance, total_relevant):
    """Quick AP computation."""
    relevance = np.array(relevance)
    if total_relevant == 0:
        return 0.0
    cumsum = np.cumsum(relevance)
    precision = cumsum / np.arange(1, len(relevance) + 1)
    return np.sum(precision * relevance) / total_relevant
 
# Property 1: Demonstrate bounds
print("Property 1: AP Bounds")
print("-" * 40)
 
# Perfect ranking: all relevant first
perfect = [1, 1, 1, 1, 0, 0, 0, 0]
print(f"Perfect ranking [1,1,1,1,0,0,0,0]: AP = {ap_quick(perfect, 4):.4f}")
 
# Worst possible: all relevant last
worst = [0, 0, 0, 0, 1, 1, 1, 1]
print(f"Worst ranking [0,0,0,0,1,1,1,1]: AP = {ap_quick(worst, 4):.4f}")
 
# Property 2: Partial retrieval bounds
print("\nProperty 2: Partial Retrieval")
print("-" * 40)
 
# Only 2 of 4 relevant found
partial = [1, 0, 1, 0, 0]  # Found 2 at positions 1 and 3
total_relevant = 4  # But 4 exist
ap_partial = ap_quick(partial, total_relevant)
ap_max_theoretical = 2 / 4  # Best if both at positions 1,2
print(f"Found 2 of 4 relevant at positions 1,3: AP = {ap_partial:.4f}")
print(f"Max AP with 2 of 4 found (positions 1,2): {ap_quick([1,1,0,0,0], 4):.4f}")
print(f"Theoretical max = r/|R| = {ap_max_theoretical:.4f}")
 
# Property 3: Sensitivity to swaps
print("\nProperty 3: Sensitivity to Swaps")
print("-" * 40)
 
original = [1, 0, 1, 0, 1, 0]
print(f"Original [1,0,1,0,1,0]: AP = {ap_quick(original, 3):.4f}")
 
# Swap positions 1 and 2 (relevant <-> irrelevant)
swapped_rel_irrel = [0, 1, 1, 0, 1, 0]
print(f"Swap pos 1,2 [0,1,1,0,1,0]: AP = {ap_quick(swapped_rel_irrel, 3):.4f}")
 
# Swap positions 1 and 3 (relevant <-> relevant)
swapped_rel_rel = [1, 0, 1, 0, 1, 0]  # Same as original!
print(f"Swap two relevants: AP unchanged = {ap_quick(swapped_rel_rel, 3):.4f}")
 
# Swap positions 2 and 4 (irrelevant <-> irrelevant)
swapped_irrel_irrel = [1, 0, 1, 0, 1, 0]  # Same as original!
print(f"Swap two irrelevants: AP unchanged = {ap_quick(swapped_irrel_irrel, 3):.4f}")
 
# Property 4: Exhaustive search for short list
print("\nProperty 4: AP Distribution Over All Rankings")
print("-" * 40)
 
# All rankings of 3 relevant + 2 irrelevant items
# This shows the range of possible AP values
items = [1, 1, 1, 0, 0]  # 3 relevant, 2 irrelevant
aps = set()
 
for perm in permutations(items):
    ap = ap_quick(list(perm), 3)
    aps.add(round(ap, 4))
 
aps = sorted(aps)
print(f"For 3 relevant + 2 irrelevant items:")
print(f"Number of distinct AP values: {len(aps)}")
print(f"Range: [{min(aps):.4f}, {max(aps):.4f}]")
print(f"All values: {aps}")

Relationship to Other Metrics

MAP exists within an ecosystem of ranking metrics. Understanding its relationships clarifies when to use each metric.

AP vs. P@k and R@k

Aspect	P@k / R@k	AP
Cutoff dependency	Requires choosing k	No cutoff needed
Position weighting	All positions equal within k	Positions weighted by precision at that point
Single summary	Need to report multiple k	Single value summarizes all
Interpretability	Intuitive (fraction at k)	Less intuitive (average of precisions)

AP vs. NDCG

Aspect	AP	NDCG
Relevance model	Binary (relevant or not)	Graded (levels of relevance)
Position discounting	Implicit via precision	Explicit logarithmic discount
Normalization	By total relevant	By ideal ranking score
Use case	Document retrieval	Graded recommendations

AP and the F1-Relationship

AP vs. MRR (Mean Reciprocal Rank)

Aspect	AP	MRR
What it measures	All relevant items	Only first relevant item
Use case	Need all relevant items	Only one answer needed
Formula	Average of P@k at relevant positions	1/rank of first relevant
Example domain	Document retrieval	Question answering

When to Use MAP vs. Alternatives

Production-Grade Implementation

Implementing MAP correctly in production requires careful handling of edge cases, efficiency considerations, and proper statistical reporting.

production_map.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import numpy as np
from typing import List, Set, Optional, Tuple
from dataclasses import dataclass
import warnings
 
@dataclass
class APResult:
    """Container for Average Precision result with diagnostics."""
    ap: float
    precision_at_relevant: List[float]
    recall_at_relevant: List[float]
    relevant_found: int
    total_relevant: int
    ranking_length: int
 
@dataclass
class MAPResult:
    """Container for MAP result with comprehensive statistics."""
    map_score: float
    ap_values: List[float]
    num_queries: int
    std: float
    ci_lower: float
    ci_upper: float
    queries_with_no_relevant: int
 
class MAPEvaluator:
    """
    Production-grade MAP evaluator with comprehensive edge case handling.
    """
    
    @staticmethod
    def average_precision(
        ranked_items: List[str],
        relevant_items: Set[str],
        cutoff: Optional[int] = None
    ) -> APResult:
        """
        Compute Average Precision for a single query.
        
        Args:
            ranked_items: Items in ranked order (list of IDs)
            relevant_items: Set of relevant item IDs
            cutoff: Optional maximum rank to consider (None = all)
        
        Returns:
            APResult with AP value and diagnostics
        """
        if len(relevant_items) == 0:
            warnings.warn("No relevant items for this query")
            return APResult(
                ap=0.0,
                precision_at_relevant=[],
                recall_at_relevant=[],
                relevant_found=0,
                total_relevant=0,
                ranking_length=len(ranked_items)
            )
        
        # Apply cutoff if specified
        if cutoff is not None:
            ranked_items = ranked_items[:cutoff]
        
        # Convert to binary relevance array
        relevance = np.array([item in relevant_items for item in ranked_items], dtype=int)
        
        n = len(relevance)
        total_relevant = len(relevant_items)
        
        if n == 0:
            return APResult(
                ap=0.0,
                precision_at_relevant=[],
                recall_at_relevant=[],
                relevant_found=0,
                total_relevant=total_relevant,
                ranking_length=0
            )
        
        # Compute precision and recall at each position
        cumsum = np.cumsum(relevance)
        positions = np.arange(1, n + 1)
        precision = cumsum / positions
        recall = cumsum / total_relevant
        
        # Extract values at relevant positions
        relevant_mask = relevance == 1
        precision_at_relevant = precision[relevant_mask].tolist()
        recall_at_relevant = recall[relevant_mask].tolist()
        
        # Compute AP
        ap = np.sum(precision * relevance) / total_relevant
        
        return APResult(
            ap=float(ap),
            precision_at_relevant=precision_at_relevant,
            recall_at_relevant=recall_at_relevant,
            relevant_found=int(cumsum[-1]) if n > 0 else 0,
            total_relevant=total_relevant,
            ranking_length=n
        )
    
    @staticmethod
    def mean_average_precision(
        ap_results: List[APResult],
        confidence_level: float = 0.95
    ) -> MAPResult:
        """
        Compute MAP with confidence intervals.
        
        Args:
            ap_results: List of APResult objects from individual queries
            confidence_level: Confidence level for CI (default 95%)
        
        Returns:
            MAPResult with MAP and statistical information
        """
        if len(ap_results) == 0:
            raise ValueError("Cannot compute MAP with no queries")
        
        ap_values = [r.ap for r in ap_results]
        ap_array = np.array(ap_values)
        
        # Count queries with no relevant items
        no_relevant = sum(1 for r in ap_results if r.total_relevant == 0)
        
        # Compute statistics
        map_score = float(np.mean(ap_array))
        std = float(np.std(ap_array, ddof=1)) if len(ap_array) > 1 else 0.0
        
        # Confidence interval using t-distribution
        n = len(ap_array)
        if n > 1:
            from scipy import stats
            t_value = stats.t.ppf((1 + confidence_level) / 2, n - 1)
            margin = t_value * std / np.sqrt(n)
            ci_lower = max(0.0, map_score - margin)
            ci_upper = min(1.0, map_score + margin)
        else:
            ci_lower = ci_upper = map_score
        
        return MAPResult(
            map_score=map_score,
            ap_values=ap_values,
            num_queries=n,
            std=std,
            ci_lower=ci_lower,
            ci_upper=ci_upper,
            queries_with_no_relevant=no_relevant
        )
 
# Demo usage
def demo():
    evaluator = MAPEvaluator()
    
    # Simulate search engine evaluation
    test_queries = [
        {
            'ranked': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'],
            'relevant': {'d1', 'd3', 'd5', 'd7'}  # AP should be high
        },
        {
            'ranked': ['d2', 'd4', 'd1', 'd6', 'd3', 'd8', 'd5', 'd7'],
            'relevant': {'d1', 'd3', 'd5', 'd7'}  # Same relevant, worse ranking
        },
        {
            'ranked': ['d1', 'd2', 'd3'],
            'relevant': {'d1', 'd2', 'd3', 'd4', 'd5'}  # Only partial retrieval
        },
    ]
    
    ap_results = []
    for i, q in enumerate(test_queries):
        result = evaluator.average_precision(q['ranked'], q['relevant'])
        print(f"Query {i+1}:")
        print(f"  AP = {result.ap:.4f}")
        print(f"  Found {result.relevant_found}/{result.total_relevant} relevant")
        print(f"  Precision at relevant positions: {[f'{p:.3f}' for p in result.precision_at_relevant]}")
        ap_results.append(result)
    
    # Compute MAP
    map_result = evaluator.mean_average_precision(ap_results)
    print(f"\nMAP = {map_result.map_score:.4f} ± {1.96*map_result.std/np.sqrt(map_result.num_queries):.4f}")
    print(f"95% CI: [{map_result.ci_lower:.4f}, {map_result.ci_upper:.4f}]")
 
if __name__ == "__main__":
    demo()

Statistical Testing and Significance

When comparing two ranking systems, raw MAP differences aren't enough—we need to assess statistical significance.

Why Statistical Testing Matters:

If System A has MAP = 0.412 and System B has MAP = 0.398, is A truly better? The difference (0.014) could be:

A genuine improvement indicating a better model
Random variation due to finite query sample
An artifact of the specific queries in the test set

Common Statistical Tests for MAP:

Paired t-test: Compares mean AP differences across queries
- Assumption: AP differences are normally distributed
- Pro: Simple, widely understood
- Con: Normality assumption often violated
Wilcoxon signed-rank test: Non-parametric alternative
- Pro: No normality assumption
- Con: Less powerful when data is actually normal
Bootstrap confidence intervals: Resampling approach
- Pro: No distributional assumptions; provides CI for difference
- Con: Computationally more expensive
Randomization test: Permutation-based significance
- Pro: Exact p-values; no assumptions
- Con: Computationally intensive

map_statistical_testing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
import numpy as np
from scipy import stats
 
def compare_maps_paired_ttest(ap_system_a: list, ap_system_b: list) -> dict:
    """
    Compare two systems using paired t-test on per-query AP.
    
    Args:
        ap_system_a: List of AP values for system A (one per query)
        ap_system_b: List of AP values for system B (one per query)
    
    Returns:
        Dictionary with test results
    """
    assert len(ap_system_a) == len(ap_system_b), "Must have same queries"
    
    ap_a = np.array(ap_system_a)
    ap_b = np.array(ap_system_b)
    differences = ap_a - ap_b
    
    # Paired t-test
    t_stat, p_value = stats.ttest_rel(ap_a, ap_b)
    
    # Effect size (Cohen's d for paired samples)
    d = np.mean(differences) / np.std(differences, ddof=1)
    
    return {
        'map_a': np.mean(ap_a),
        'map_b': np.mean(ap_b),
        'mean_difference': np.mean(differences),
        't_statistic': t_stat,
        'p_value': p_value,
        'cohens_d': d,
        'significant_05': p_value < 0.05,
        'significant_01': p_value < 0.01
    }
 
def compare_maps_bootstrap(ap_system_a: list, ap_system_b: list, 
                          n_bootstrap: int = 10000) -> dict:
    """
    Compare two systems using bootstrap confidence intervals.
    """
    ap_a = np.array(ap_system_a)
    ap_b = np.array(ap_system_b)
    n = len(ap_a)
    
    # Bootstrap the difference in MAP
    rng = np.random.default_rng(42)
    bootstrap_diffs = []
    
    for _ in range(n_bootstrap):
        indices = rng.choice(n, size=n, replace=True)
        diff = np.mean(ap_a[indices]) - np.mean(ap_b[indices])
        bootstrap_diffs.append(diff)
    
    bootstrap_diffs = np.array(bootstrap_diffs)
    
    # Confidence intervals
    ci_95 = np.percentile(bootstrap_diffs, [2.5, 97.5])
    ci_99 = np.percentile(bootstrap_diffs, [0.5, 99.5])
    
    # "Significant" if CI doesn't contain 0
    sig_95 = ci_95[0] > 0 or ci_95[1] < 0
    
    return {
        'map_a': np.mean(ap_a),
        'map_b': np.mean(ap_b),
        'mean_difference': np.mean(ap_a) - np.mean(ap_b),
        'bootstrap_mean_diff': np.mean(bootstrap_diffs),
        'ci_95': tuple(ci_95),
        'ci_99': tuple(ci_99),
        'significant_95': sig_95
    }
 
# Example: Compare two search systems
np.random.seed(42)
n_queries = 50
 
# System A: Generally better, with some variance
ap_a = np.clip(np.random.normal(0.55, 0.15, n_queries), 0, 1)
 
# System B: Slightly worse
ap_b = np.clip(np.random.normal(0.48, 0.18, n_queries), 0, 1)
 
print("System Comparison: Paired t-test")
print("=" * 50)
ttest_result = compare_maps_paired_ttest(ap_a.tolist(), ap_b.tolist())
for key, value in ttest_result.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")
 
print("\nSystem Comparison: Bootstrap")
print("=" * 50)
boot_result = compare_maps_bootstrap(ap_a.tolist(), ap_b.tolist())
for key, value in boot_result.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    elif isinstance(value, tuple):
        print(f"{key}: ({value[0]:.4f}, {value[1]:.4f})")
    else:
        print(f"{key}: {value}")

Multiple Comparison Correction

Common Pitfalls and Best Practices

MAP is widely used but often miscomputed or misinterpreted. Here are critical points to avoid errors:

Common Pitfalls

•Dividing by found relevant instead of total relevant: The AP denominator must be |R| (all relevant items), not just those found in the ranking. Dividing by found items artificially inflates AP for systems with poor recall.
•Ignoring queries with no relevant items: Some datasets include queries with |R| = 0. Decide whether to exclude them (common) or assign AP = 0 (penalizes the evaluation).
•Using interpolated AP when non-interpolated is expected: Interpolated AP gives different (usually higher) values. Know which variant your benchmark uses.
•Truncating rankings before computing AP: If you cut rankings to top-100, all relevant items beyond position 100 are treated as not found. Document your cutoff.
•Averaging MAP across heterogeneous query types: If some queries are fundamentally harder (e.g., navigational vs. informational), consider stratified analysis.

Best Practices

•Report confidence intervals: MAP alone doesn't convey uncertainty. Always report CI or standard error.
•Document your denominator: Explicitly state whether |R| includes items beyond your ranking cutoff.
•Report complementary metrics: MAP with P@10 and R@100 gives a more complete picture than MAP alone.
•Use consistent evaluation: Same queries, same relevance judgments, same preprocessing across all systems compared.
•Perform significance testing: Don't claim one system beats another without statistical evidence.
•Analyze per-query performance: Distribution of AP values often reveals more than MAP alone (bimodal distributions suggest different query types).

Summary: Mean Average Precision

We have established Mean Average Precision as the gold-standard metric for ranked retrieval evaluation. Let's consolidate our understanding:

Key Takeaways

•Average Precision (AP) computes the mean of precision values at each position where a relevant item is found: $\text{AP} = \frac{1}{|R|} \sum_{k: \text{rel}_k=1} P@k$
•MAP is the mean of AP across queries: $\text{MAP} = \frac{1}{|Q|} \sum_{q} \text{AP}(q)$
•AP equals the area under the precision-recall curve — a beautiful geometric interpretation
•AP rewards relevant items appearing early — precision at position 1 is worth more than at position 100
•The denominator is always total relevant |R| — not just items found, penalizing systems with poor recall
•Statistical significance testing is essential — use paired t-tests, Wilcoxon, or bootstrap when comparing systems
•Report MAP with confidence intervals — single point estimates don't capture uncertainty

Mathematical Summary:

$$\text{AP}(q) = \frac{1}{|R_q|} \sum_{k=1}^{n} P@k \cdot \mathbb{1}[\text{rel}_k = 1]$$

$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$

What's Next:

Page Complete