Loading content...
Precision@k suffers from a fundamental limitation: it requires choosing a specific cutoff $k$, and different choices can lead to contradictory conclusions about which ranking system is superior. A system might excel at P@5 but falter at P@10, leaving practitioners uncertain about true quality.
Mean Average Precision (MAP) resolves this dilemma through an elegant mathematical construction. Rather than evaluating precision at a single arbitrary cutoff, MAP computes precision at every position where a relevant item appears and averages these values. This produces a single summary statistic that simultaneously captures:
MAP has become the de facto standard metric in information retrieval research, serving as the primary evaluation measure in the Text REtrieval Conference (TREC) for decades. Understanding MAP deeply is essential for anyone working in search, recommendation, or document retrieval.
By the end of this page, you will understand MAP from first principles—its definition as an average of Average Precision values, its relationship to the precision-recall curve, its statistical properties, and its practical implementation. You will be able to compute, interpret, and critically evaluate MAP in both research and production contexts.
Before defining MAP, we must first understand Average Precision (AP)—the metric computed for a single query. AP is the heart of MAP; MAP is simply the mean of AP values across queries.
The Core Idea:
Instead of computing precision at one cutoff, AP computes precision at every position where a relevant item is found, then averages these values:
$$\text{AP} = \frac{1}{|R|} \sum_{k=1}^{n} \text{P@}k \cdot \text{rel}_k$$
where:
The factor $\text{rel}_k$ acts as a filter: we only include precision values at positions where relevant items appear. The division by $|R|$ normalizes the sum.
Expanded Form:
$$\text{AP} = \frac{1}{|R|} \sum_{k : \text{rel}k = 1} \text{P@}k = \frac{1}{|R|} \sum{i=1}^{|R|} \text{P@rank}(r_i)$$
where $\text{rank}(r_i)$ is the position of the $i$-th relevant item in the ranked list.
| Rank k | Document | Relevant? | P@k | P@k × rel_k | Running Sum |
|---|---|---|---|---|---|
| 1 | Doc A | ✓ Yes | 1/1 = 1.000 | 1.000 | 1.000 |
| 2 | Doc B | ✗ No | 1/2 = 0.500 | 0.000 | 1.000 |
| 3 | Doc C | ✓ Yes | 2/3 = 0.667 | 0.667 | 1.667 |
| 4 | Doc D | ✗ No | 2/4 = 0.500 | 0.000 | 1.667 |
| 5 | Doc E | ✓ Yes | 3/5 = 0.600 | 0.600 | 2.267 |
| 6 | Doc F | ✗ No | 3/6 = 0.500 | 0.000 | 2.267 |
| 7 | Doc G | ✓ Yes | 4/7 = 0.571 | 0.571 | 2.838 |
| 8 | Doc H | ✗ No | 4/8 = 0.500 | 0.000 | 2.838 |
Computing AP from the Example:
In this example, we have:
$$\text{AP} = \frac{1.000 + 0.667 + 0.600 + 0.571}{4} = \frac{2.838}{4} = 0.7095$$
Intuition:
AP rewards rankings where relevant items appear early. Compare:
The earlier relevant items appear, the higher the precision values at those positions, and thus the higher AP.
If some relevant items don't appear in the ranked list at all (e.g., the list is truncated), AP treats them as having precision 0 at infinite rank. The denominator remains |R| (total relevant), not just the relevant items found. This penalizes systems that fail to retrieve all relevant items.
One of the most beautiful properties of Average Precision is its geometric interpretation: AP approximates the area under the precision-recall curve.
The Precision-Recall Curve for Rankings:
As we traverse the ranked list from position 1 to n, we can plot:
This traces out a stepped curve. Importantly:
AP as a Riemann Sum:
The AP formula can be rewritten as:
$$\text{AP} = \sum_{k=1}^{n} \text{P@}k \cdot \Delta\text{R@}k$$
where $\Delta\text{R@}k = \text{R@}k - \text{R@}(k-1)$ is the change in recall at position $k$.
Since $\Delta\text{R@}k = \frac{\text{rel}_k}{|R|}$ (i.e., $1/|R|$ when relevant, 0 otherwise), this becomes:
$$\text{AP} = \sum_{k : \text{rel}k=1} \text{P@}k \cdot \frac{1}{|R|} = \frac{1}{|R|} \sum{k : \text{rel}_k=1} \text{P@}k$$
This is exactly the original AP formula—proving that AP computes the area under the stepped precision-recall curve using a right-endpoint Riemann sum.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npimport matplotlib.pyplot as plt def compute_pr_curve_and_ap(relevance: list[int], total_relevant: int): """ Compute precision-recall curve and demonstrate AP as area under curve. Args: relevance: Binary relevance of ranked items (1=relevant, 0=not) total_relevant: Total relevant items |R| Returns: Dictionary with precision, recall, AP, and area verification """ n = len(relevance) relevance = np.array(relevance) # Compute precision and recall at each position cumsum_rel = np.cumsum(relevance) positions = np.arange(1, n + 1) precision = cumsum_rel / positions recall = cumsum_rel / total_relevant # Compute AP: sum of P@k at positions where item is relevant ap_terms = precision * relevance ap = ap_terms.sum() / total_relevant # Compute area under PR curve using rectangles # For each relevant item, add rectangle: width = delta_recall, height = precision delta_recall = relevance / total_relevant area_riemann = np.sum(precision * delta_recall) # Verify they match print(f"AP (formula): {ap:.6f}") print(f"Area (Riemann): {area_riemann:.6f}") print(f"Match: {np.isclose(ap, area_riemann)}") return { 'precision': precision, 'recall': recall, 'ap': ap, 'area': area_riemann } def plot_pr_curve_with_area(relevance: list[int], total_relevant: int): """Visualize PR curve and show AP as area under curve.""" results = compute_pr_curve_and_ap(relevance, total_relevant) # Prepend (0, 1) point for complete curve recall = np.concatenate([[0], results['recall']]) precision = np.concatenate([[1], results['precision']]) fig, ax = plt.subplots(figsize=(10, 6)) # Plot stepped PR curve ax.step(recall, precision, where='post', linewidth=2, color='blue', label='PR Curve') # Fill area under curve ax.fill_between(recall, precision, step='post', alpha=0.3, color='blue', label=f'AP = {results["ap"]:.4f}') # Mark points where relevant items occur rel_positions = np.where(np.array(relevance) == 1)[0] for pos in rel_positions: ax.scatter(results['recall'][pos], results['precision'][pos], color='red', s=100, zorder=5) ax.set_xlabel('Recall', fontsize=12) ax.set_ylabel('Precision', fontsize=12) ax.set_title('Precision-Recall Curve with AP as Area Under Curve', fontsize=14) ax.set_xlim([0, 1.05]) ax.set_ylim([0, 1.05]) ax.legend(fontsize=11) ax.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('pr_curve_with_ap.png', dpi=150) plt.show() # Example: Demonstrate with a realistic rankingrelevance = [1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0] # 6 relevant in top 15total_relevant = 8 # But 8 relevant exist total print("Demonstrating AP as Area Under PR Curve")print("=" * 50)results = compute_pr_curve_and_ap(relevance, total_relevant) # Note: We only found 6 of 8 relevant items# The 2 missing items contribute 0 to AP but increase denominatorprint(f"\nRelevant found in ranking: {sum(relevance)}")print(f"Total relevant (|R|): {total_relevant}")print(f"Recall achieved: {results['recall'][-1]:.3f}")Some implementations use 'interpolated AP' where precision at each recall level is replaced by the maximum precision at any recall level ≥ that value. This produces a non-decreasing precision curve. The 11-point interpolated AP (measuring at recall = 0.0, 0.1, 0.2, ..., 1.0) was common historically but is less used today. Modern evaluations typically use non-interpolated AP.
A ranking system handles many queries, each with its own ranked list and set of relevant items. Mean Average Precision (MAP) averages AP across all queries:
$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$
where $Q$ is the set of queries and $\text{AP}(q)$ is the average precision for query $q$.
Why Average, Not Sum?
Averaging normalizes for the number of queries, making MAP comparable across different test sets. A system evaluated on 1000 queries should have a similar MAP expectation as one evaluated on 100 queries (assuming similar quality).
Weighting Considerations:
The standard MAP gives equal weight to each query. This means:
Some alternatives weight queries by:
The standard unweighted MAP remains most common in research settings.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as npfrom typing import List, Dict, Any def average_precision(relevance: List[int], total_relevant: int) -> float: """ Compute Average Precision for a single query. Args: relevance: Binary relevance of ranked items [1, 0, 1, ...] total_relevant: Total relevant items |R| for this query Returns: AP value in [0, 1], or 0 if no relevant items """ if total_relevant == 0: return 0.0 relevance = np.array(relevance) n = len(relevance) # Cumulative sum of relevant items at each position cumsum = np.cumsum(relevance) # Precision at each position positions = np.arange(1, n + 1) precision = cumsum / positions # Sum precision at positions where item is relevant precision_sum = np.sum(precision * relevance) # Divide by total relevant (not just those found) return precision_sum / total_relevant def mean_average_precision(query_results: List[Dict[str, Any]]) -> Dict[str, float]: """ Compute MAP across multiple queries. Args: query_results: List of dicts with 'relevance' and 'total_relevant' Returns: Dict with MAP and per-query AP values """ ap_values = [] for qr in query_results: ap = average_precision(qr['relevance'], qr['total_relevant']) ap_values.append(ap) map_score = np.mean(ap_values) return { 'MAP': map_score, 'per_query_AP': ap_values, 'num_queries': len(ap_values), 'std_AP': np.std(ap_values), 'min_AP': np.min(ap_values), 'max_AP': np.max(ap_values) } # Example: Multiple queries with varying difficultyqueries = [ { 'query_id': 'q1', 'relevance': [1, 1, 0, 1, 0], # Easy: 3 relevant, all found early 'total_relevant': 3 }, { 'query_id': 'q2', 'relevance': [0, 0, 1, 0, 1, 0, 1], # Medium: relevant items scattered 'total_relevant': 4 # Note: 1 relevant not in top 7 }, { 'query_id': 'q3', 'relevance': [0, 0, 0, 0, 1, 1], # Hard: relevant items ranked low 'total_relevant': 3 # Note: 1 relevant not found }, { 'query_id': 'q4', 'relevance': [1, 1, 1, 1, 1], # Perfect: all relevant at top 'total_relevant': 5 }] print("Mean Average Precision Computation")print("=" * 60) for q in queries: ap = average_precision(q['relevance'], q['total_relevant']) found = sum(q['relevance']) total = q['total_relevant'] print(f"{q['query_id']}: AP = {ap:.4f} (found {found}/{total} relevant)") results = mean_average_precision(queries)print(f"\nMAP = {results['MAP']:.4f}")print(f"Std(AP) = {results['std_AP']:.4f}")print(f"AP range: [{results['min_AP']:.4f}, {results['max_AP']:.4f}]")| MAP Range | Interpretation | Quality Assessment |
|---|---|---|
| 0.9 - 1.0 | Exceptional: Most relevant items ranked first | State-of-the-art performance |
| 0.7 - 0.9 | Strong: Good precision across recall levels | Publication-quality results |
| 0.5 - 0.7 | Moderate: Room for improvement | Acceptable for many applications |
| 0.3 - 0.5 | Weak: Significant relevant items ranked poorly | Needs model improvement |
| < 0.3 | Poor: Little better than random | Fundamental issues |
Understanding the mathematical properties of AP and MAP helps predict their behavior and avoid misinterpretation.
Property 1: Bounds
$$0 \leq \text{AP} \leq 1$$
Property 2: Range of AP Given Partial Retrieval
If only $r$ of $|R|$ relevant items appear in the ranking:
$$\text{AP}_{\text{max}} = \frac{r}{|R|}$$ (achieved when all $r$ are at positions 1 through $r$)
The $|R| - r$ missing items contribute 0 to the numerator but remain in the denominator.
Property 3: Sensitivity to Ranking Changes
Swapping adjacent items affects AP:
Property 4: Relationship to Break-Even Point
The Break-Even Point (BEP) is the recall level where precision equals recall. On the PR curve, this is where the curve crosses the diagonal. AP is influenced by but not equal to BEP.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom itertools import permutations def ap_quick(relevance, total_relevant): """Quick AP computation.""" relevance = np.array(relevance) if total_relevant == 0: return 0.0 cumsum = np.cumsum(relevance) precision = cumsum / np.arange(1, len(relevance) + 1) return np.sum(precision * relevance) / total_relevant # Property 1: Demonstrate boundsprint("Property 1: AP Bounds")print("-" * 40) # Perfect ranking: all relevant firstperfect = [1, 1, 1, 1, 0, 0, 0, 0]print(f"Perfect ranking [1,1,1,1,0,0,0,0]: AP = {ap_quick(perfect, 4):.4f}") # Worst possible: all relevant lastworst = [0, 0, 0, 0, 1, 1, 1, 1]print(f"Worst ranking [0,0,0,0,1,1,1,1]: AP = {ap_quick(worst, 4):.4f}") # Property 2: Partial retrieval boundsprint("\nProperty 2: Partial Retrieval")print("-" * 40) # Only 2 of 4 relevant foundpartial = [1, 0, 1, 0, 0] # Found 2 at positions 1 and 3total_relevant = 4 # But 4 existap_partial = ap_quick(partial, total_relevant)ap_max_theoretical = 2 / 4 # Best if both at positions 1,2print(f"Found 2 of 4 relevant at positions 1,3: AP = {ap_partial:.4f}")print(f"Max AP with 2 of 4 found (positions 1,2): {ap_quick([1,1,0,0,0], 4):.4f}")print(f"Theoretical max = r/|R| = {ap_max_theoretical:.4f}") # Property 3: Sensitivity to swapsprint("\nProperty 3: Sensitivity to Swaps")print("-" * 40) original = [1, 0, 1, 0, 1, 0]print(f"Original [1,0,1,0,1,0]: AP = {ap_quick(original, 3):.4f}") # Swap positions 1 and 2 (relevant <-> irrelevant)swapped_rel_irrel = [0, 1, 1, 0, 1, 0]print(f"Swap pos 1,2 [0,1,1,0,1,0]: AP = {ap_quick(swapped_rel_irrel, 3):.4f}") # Swap positions 1 and 3 (relevant <-> relevant)swapped_rel_rel = [1, 0, 1, 0, 1, 0] # Same as original!print(f"Swap two relevants: AP unchanged = {ap_quick(swapped_rel_rel, 3):.4f}") # Swap positions 2 and 4 (irrelevant <-> irrelevant)swapped_irrel_irrel = [1, 0, 1, 0, 1, 0] # Same as original!print(f"Swap two irrelevants: AP unchanged = {ap_quick(swapped_irrel_irrel, 3):.4f}") # Property 4: Exhaustive search for short listprint("\nProperty 4: AP Distribution Over All Rankings")print("-" * 40) # All rankings of 3 relevant + 2 irrelevant items# This shows the range of possible AP valuesitems = [1, 1, 1, 0, 0] # 3 relevant, 2 irrelevantaps = set() for perm in permutations(items): ap = ap_quick(list(perm), 3) aps.add(round(ap, 4)) aps = sorted(aps)print(f"For 3 relevant + 2 irrelevant items:")print(f"Number of distinct AP values: {len(aps)}")print(f"Range: [{min(aps):.4f}, {max(aps):.4f}]")print(f"All values: {aps}")MAP exists within an ecosystem of ranking metrics. Understanding its relationships clarifies when to use each metric.
AP vs. P@k and R@k
| Aspect | P@k / R@k | AP |
|---|---|---|
| Cutoff dependency | Requires choosing k | No cutoff needed |
| Position weighting | All positions equal within k | Positions weighted by precision at that point |
| Single summary | Need to report multiple k | Single value summarizes all |
| Interpretability | Intuitive (fraction at k) | Less intuitive (average of precisions) |
AP vs. NDCG
| Aspect | AP | NDCG |
|---|---|---|
| Relevance model | Binary (relevant or not) | Graded (levels of relevance) |
| Position discounting | Implicit via precision | Explicit logarithmic discount |
| Normalization | By total relevant | By ideal ranking score |
| Use case | Document retrieval | Graded recommendations |
AP and the F1-Relationship
At each relevant position $k$, the precision P@k and corresponding recall R@k are related. The AP can be viewed as giving heavier weight to precision values where recall has also improved (each relevant item adds to recall).
AP vs. MRR (Mean Reciprocal Rank)
| Aspect | AP | MRR |
|---|---|---|
| What it measures | All relevant items | Only first relevant item |
| Use case | Need all relevant items | Only one answer needed |
| Formula | Average of P@k at relevant positions | 1/rank of first relevant |
| Example domain | Document retrieval | Question answering |
Use MAP when: (1) relevance is binary, (2) all relevant items matter, (3) you want a single summary metric, (4) comparing against IR literature. Use NDCG when relevance is graded. Use MRR when only the first result matters. Use P@k/R@k when you need metrics at specific cutoffs that match your UI.
Implementing MAP correctly in production requires careful handling of edge cases, efficiency considerations, and proper statistical reporting.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187
import numpy as npfrom typing import List, Set, Optional, Tuplefrom dataclasses import dataclassimport warnings @dataclassclass APResult: """Container for Average Precision result with diagnostics.""" ap: float precision_at_relevant: List[float] recall_at_relevant: List[float] relevant_found: int total_relevant: int ranking_length: int @dataclassclass MAPResult: """Container for MAP result with comprehensive statistics.""" map_score: float ap_values: List[float] num_queries: int std: float ci_lower: float ci_upper: float queries_with_no_relevant: int class MAPEvaluator: """ Production-grade MAP evaluator with comprehensive edge case handling. """ @staticmethod def average_precision( ranked_items: List[str], relevant_items: Set[str], cutoff: Optional[int] = None ) -> APResult: """ Compute Average Precision for a single query. Args: ranked_items: Items in ranked order (list of IDs) relevant_items: Set of relevant item IDs cutoff: Optional maximum rank to consider (None = all) Returns: APResult with AP value and diagnostics """ if len(relevant_items) == 0: warnings.warn("No relevant items for this query") return APResult( ap=0.0, precision_at_relevant=[], recall_at_relevant=[], relevant_found=0, total_relevant=0, ranking_length=len(ranked_items) ) # Apply cutoff if specified if cutoff is not None: ranked_items = ranked_items[:cutoff] # Convert to binary relevance array relevance = np.array([item in relevant_items for item in ranked_items], dtype=int) n = len(relevance) total_relevant = len(relevant_items) if n == 0: return APResult( ap=0.0, precision_at_relevant=[], recall_at_relevant=[], relevant_found=0, total_relevant=total_relevant, ranking_length=0 ) # Compute precision and recall at each position cumsum = np.cumsum(relevance) positions = np.arange(1, n + 1) precision = cumsum / positions recall = cumsum / total_relevant # Extract values at relevant positions relevant_mask = relevance == 1 precision_at_relevant = precision[relevant_mask].tolist() recall_at_relevant = recall[relevant_mask].tolist() # Compute AP ap = np.sum(precision * relevance) / total_relevant return APResult( ap=float(ap), precision_at_relevant=precision_at_relevant, recall_at_relevant=recall_at_relevant, relevant_found=int(cumsum[-1]) if n > 0 else 0, total_relevant=total_relevant, ranking_length=n ) @staticmethod def mean_average_precision( ap_results: List[APResult], confidence_level: float = 0.95 ) -> MAPResult: """ Compute MAP with confidence intervals. Args: ap_results: List of APResult objects from individual queries confidence_level: Confidence level for CI (default 95%) Returns: MAPResult with MAP and statistical information """ if len(ap_results) == 0: raise ValueError("Cannot compute MAP with no queries") ap_values = [r.ap for r in ap_results] ap_array = np.array(ap_values) # Count queries with no relevant items no_relevant = sum(1 for r in ap_results if r.total_relevant == 0) # Compute statistics map_score = float(np.mean(ap_array)) std = float(np.std(ap_array, ddof=1)) if len(ap_array) > 1 else 0.0 # Confidence interval using t-distribution n = len(ap_array) if n > 1: from scipy import stats t_value = stats.t.ppf((1 + confidence_level) / 2, n - 1) margin = t_value * std / np.sqrt(n) ci_lower = max(0.0, map_score - margin) ci_upper = min(1.0, map_score + margin) else: ci_lower = ci_upper = map_score return MAPResult( map_score=map_score, ap_values=ap_values, num_queries=n, std=std, ci_lower=ci_lower, ci_upper=ci_upper, queries_with_no_relevant=no_relevant ) # Demo usagedef demo(): evaluator = MAPEvaluator() # Simulate search engine evaluation test_queries = [ { 'ranked': ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8'], 'relevant': {'d1', 'd3', 'd5', 'd7'} # AP should be high }, { 'ranked': ['d2', 'd4', 'd1', 'd6', 'd3', 'd8', 'd5', 'd7'], 'relevant': {'d1', 'd3', 'd5', 'd7'} # Same relevant, worse ranking }, { 'ranked': ['d1', 'd2', 'd3'], 'relevant': {'d1', 'd2', 'd3', 'd4', 'd5'} # Only partial retrieval }, ] ap_results = [] for i, q in enumerate(test_queries): result = evaluator.average_precision(q['ranked'], q['relevant']) print(f"Query {i+1}:") print(f" AP = {result.ap:.4f}") print(f" Found {result.relevant_found}/{result.total_relevant} relevant") print(f" Precision at relevant positions: {[f'{p:.3f}' for p in result.precision_at_relevant]}") ap_results.append(result) # Compute MAP map_result = evaluator.mean_average_precision(ap_results) print(f"\nMAP = {map_result.map_score:.4f} ± {1.96*map_result.std/np.sqrt(map_result.num_queries):.4f}") print(f"95% CI: [{map_result.ci_lower:.4f}, {map_result.ci_upper:.4f}]") if __name__ == "__main__": demo()When comparing two ranking systems, raw MAP differences aren't enough—we need to assess statistical significance.
Why Statistical Testing Matters:
If System A has MAP = 0.412 and System B has MAP = 0.398, is A truly better? The difference (0.014) could be:
Common Statistical Tests for MAP:
Paired t-test: Compares mean AP differences across queries
Wilcoxon signed-rank test: Non-parametric alternative
Bootstrap confidence intervals: Resampling approach
Randomization test: Permutation-based significance
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
import numpy as npfrom scipy import stats def compare_maps_paired_ttest(ap_system_a: list, ap_system_b: list) -> dict: """ Compare two systems using paired t-test on per-query AP. Args: ap_system_a: List of AP values for system A (one per query) ap_system_b: List of AP values for system B (one per query) Returns: Dictionary with test results """ assert len(ap_system_a) == len(ap_system_b), "Must have same queries" ap_a = np.array(ap_system_a) ap_b = np.array(ap_system_b) differences = ap_a - ap_b # Paired t-test t_stat, p_value = stats.ttest_rel(ap_a, ap_b) # Effect size (Cohen's d for paired samples) d = np.mean(differences) / np.std(differences, ddof=1) return { 'map_a': np.mean(ap_a), 'map_b': np.mean(ap_b), 'mean_difference': np.mean(differences), 't_statistic': t_stat, 'p_value': p_value, 'cohens_d': d, 'significant_05': p_value < 0.05, 'significant_01': p_value < 0.01 } def compare_maps_bootstrap(ap_system_a: list, ap_system_b: list, n_bootstrap: int = 10000) -> dict: """ Compare two systems using bootstrap confidence intervals. """ ap_a = np.array(ap_system_a) ap_b = np.array(ap_system_b) n = len(ap_a) # Bootstrap the difference in MAP rng = np.random.default_rng(42) bootstrap_diffs = [] for _ in range(n_bootstrap): indices = rng.choice(n, size=n, replace=True) diff = np.mean(ap_a[indices]) - np.mean(ap_b[indices]) bootstrap_diffs.append(diff) bootstrap_diffs = np.array(bootstrap_diffs) # Confidence intervals ci_95 = np.percentile(bootstrap_diffs, [2.5, 97.5]) ci_99 = np.percentile(bootstrap_diffs, [0.5, 99.5]) # "Significant" if CI doesn't contain 0 sig_95 = ci_95[0] > 0 or ci_95[1] < 0 return { 'map_a': np.mean(ap_a), 'map_b': np.mean(ap_b), 'mean_difference': np.mean(ap_a) - np.mean(ap_b), 'bootstrap_mean_diff': np.mean(bootstrap_diffs), 'ci_95': tuple(ci_95), 'ci_99': tuple(ci_99), 'significant_95': sig_95 } # Example: Compare two search systemsnp.random.seed(42)n_queries = 50 # System A: Generally better, with some varianceap_a = np.clip(np.random.normal(0.55, 0.15, n_queries), 0, 1) # System B: Slightly worseap_b = np.clip(np.random.normal(0.48, 0.18, n_queries), 0, 1) print("System Comparison: Paired t-test")print("=" * 50)ttest_result = compare_maps_paired_ttest(ap_a.tolist(), ap_b.tolist())for key, value in ttest_result.items(): if isinstance(value, float): print(f"{key}: {value:.4f}") else: print(f"{key}: {value}") print("\nSystem Comparison: Bootstrap")print("=" * 50)boot_result = compare_maps_bootstrap(ap_a.tolist(), ap_b.tolist())for key, value in boot_result.items(): if isinstance(value, float): print(f"{key}: {value:.4f}") elif isinstance(value, tuple): print(f"{key}: ({value[0]:.4f}, {value[1]:.4f})") else: print(f"{key}: {value}")When comparing multiple systems or multiple metrics, apply correction for multiple comparisons (e.g., Bonferroni, Holm-Bonferroni, or FDR control). Testing 10 systems pairwise involves 45 comparisons; expecting ~2 false positives at α=0.05 without correction.
MAP is widely used but often miscomputed or misinterpreted. Here are critical points to avoid errors:
We have established Mean Average Precision as the gold-standard metric for ranked retrieval evaluation. Let's consolidate our understanding:
Mathematical Summary:
$$\text{AP}(q) = \frac{1}{|R_q|} \sum_{k=1}^{n} P@k \cdot \mathbb{1}[\text{rel}_k = 1]$$
$$\text{MAP} = \frac{1}{|Q|} \sum_{q \in Q} \text{AP}(q)$$
What's Next:
With MAP established, we're ready to examine Normalized Discounted Cumulative Gain (NDCG)—a metric that extends to graded relevance and uses explicit position discounting, making it the standard for recommendation and graded retrieval evaluation.
You now understand Mean Average Precision as the gold-standard metric for binary-relevance ranking evaluation. This metric elegantly synthesizes precision and recall into a single summary statistic that rewards good rankings at all recall levels.