Loading content...
Not all search tasks are created equal. Sometimes users want to explore multiple relevant documents—browsing research papers, comparing products, or discovering new music. For these tasks, metrics like MAP and NDCG that evaluate entire ranked lists are appropriate.
But many search tasks have a fundamentally different character: the user seeks one specific answer, and finding it anywhere in the results—not just at position 1—constitutes success, albeit with varying degrees of satisfaction. Consider:
For these tasks, Mean Reciprocal Rank (MRR) is the natural metric. It measures one thing elegantly: How far down the list must the user go to find the first relevant result?
By the end of this page, you will understand MRR from first principles—its definition, computation, statistical properties, and appropriate use cases. You will learn when MRR is the right choice, how it relates to other ranking metrics, and how to implement and interpret it correctly.
Before addressing Mean Reciprocal Rank, we define Reciprocal Rank (RR) for a single query.
Definition:
Let $\text{rank}_q$ denote the position (1-indexed) of the first relevant item in the ranked list for query $q$. The Reciprocal Rank is:
$$\text{RR}(q) = \frac{1}{\text{rank}_q}$$
If no relevant item appears in the list, RR is typically defined as 0.
Properties:
| Rank of First Relevant | Reciprocal Rank | User Experience |
|---|---|---|
| 1 | 1.000 | Perfect — answer at top |
| 2 | 0.500 | Good — answer visible without scrolling |
| 3 | 0.333 | Acceptable — requires some scanning |
| 5 | 0.200 | Marginal — requires scrolling |
| 10 | 0.100 | Poor — bottom of first page (typical 10-result page) |
| 20 | 0.050 | Very poor — second page |
| Not found | 0.000 | Failure — answer not in results |
The Intuition Behind Reciprocal:
Why $1/\text{rank}$ specifically? This choice has elegant properties:
Expected cost interpretation: If the user examines results sequentially until finding a relevant one, RR is inversely proportional to the expected number of items examined.
Discounting: Like NDCG, RR applies a position discount, but with $1/r$ rather than $1/\log(r)$. This is a steeper discount—position 2 is worth half of position 1, not 63%.
Parsimony: The reciprocal is the simplest monotonically decreasing function of rank with appropriate properties.
Example Calculation:
For a ranked list: [irrelevant, irrelevant, relevant, irrelevant, relevant]
The first relevant item is at position 3, so: $$\text{RR} = \frac{1}{3} \approx 0.333$$
Note that the relevant item at position 5 doesn't affect RR—only the first matters.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom typing import Optional, List, Set, Any def reciprocal_rank(relevance: List[int]) -> float: """ Compute Reciprocal Rank for a single query. Args: relevance: Binary relevance in ranked order (1=relevant, 0=not) Returns: RR value in [0, 1], or 0 if no relevant items """ for i, rel in enumerate(relevance): if rel == 1: return 1.0 / (i + 1) # i is 0-indexed, rank is 1-indexed return 0.0 def reciprocal_rank_from_items( ranked_items: List[Any], relevant_items: Set[Any]) -> float: """ Compute RR given ranked item IDs and relevant item set. Args: ranked_items: Item IDs in ranked order relevant_items: Set of relevant item IDs Returns: RR value """ for i, item in enumerate(ranked_items): if item in relevant_items: return 1.0 / (i + 1) return 0.0 # Examplesprint("Reciprocal Rank Examples")print("=" * 50) # Case 1: First result is relevantrr1 = reciprocal_rank([1, 0, 0, 1, 0])print(f"[1, 0, 0, 1, 0] → RR = {rr1:.4f} (first relevant at rank 1)") # Case 2: Third result is first relevantrr2 = reciprocal_rank([0, 0, 1, 0, 1])print(f"[0, 0, 1, 0, 1] → RR = {rr2:.4f} (first relevant at rank 3)") # Case 3: Relevant at position 10rr3 = reciprocal_rank([0]*9 + [1])print(f"[0]*9 + [1] → RR = {rr3:.4f} (first relevant at rank 10)") # Case 4: No relevant itemsrr4 = reciprocal_rank([0, 0, 0, 0, 0])print(f"[0, 0, 0, 0, 0] → RR = {rr4:.4f} (no relevant items)") # Using item IDsprint("\nWith Item IDs:")ranked = ['doc_a', 'doc_b', 'doc_c', 'doc_d', 'doc_e']relevant = {'doc_c', 'doc_e'}rr5 = reciprocal_rank_from_items(ranked, relevant)print(f"Ranked: {ranked}")print(f"Relevant: {relevant}")print(f"RR = {rr5:.4f} (doc_c found at rank 3)")Mean Reciprocal Rank (MRR) is simply the arithmetic mean of Reciprocal Ranks across all queries:
$$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \text{RR}(q) = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}$$
where $Q$ is the set of queries and $\text{rank}_q$ is the rank of the first relevant result for query $q$.
Properties of MRR:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
import numpy as npfrom typing import List def mean_reciprocal_rank(query_relevances: List[List[int]]) -> dict: """ Compute MRR across multiple queries. Args: query_relevances: List of relevance lists (one per query) Returns: Dict with MRR and diagnostics """ rr_values = [] ranks = [] for relevance in query_relevances: rr = 0.0 rank = float('inf') for i, rel in enumerate(relevance): if rel == 1: rank = i + 1 rr = 1.0 / rank break rr_values.append(rr) ranks.append(rank) return { 'MRR': np.mean(rr_values), 'per_query_RR': rr_values, 'per_query_rank': ranks, 'queries_without_relevant': sum(1 for r in ranks if r == float('inf')), 'num_queries': len(query_relevances) } # Example: Evaluate a QA systemqueries = [ # Query 1: Perfect - answer at position 1 [1, 0, 0, 0, 0], # Query 2: Answer at position 3 [0, 0, 1, 0, 1], # Query 3: Answer at position 2 [0, 1, 1, 0, 0], # Query 4: Answer at position 5 [0, 0, 0, 0, 1], # Query 5: No answer found [0, 0, 0, 0, 0],] result = mean_reciprocal_rank(queries) print("MRR Computation Example")print("=" * 60)print(f"Number of queries: {result['num_queries']}")print(f"Queries without relevant: {result['queries_without_relevant']}")print() print(f"{'Query':<10} {'First Relevant Rank':<22} {'RR':<10}")print("-" * 45)for i, (rr, rank) in enumerate(zip(result['per_query_RR'], result['per_query_rank'])): rank_str = str(rank) if rank != float('inf') else "Not found" print(f"Query {i+1:<4} {rank_str:<22} {rr:.4f}") print("-" * 45)print(f"MRR = {result['MRR']:.4f}") # Breakdown of MRR calculationprint(f"\nMRR = ({' + '.join([f'{rr:.4f}' for rr in result['per_query_RR']])}) / {result['num_queries']}")print(f" = {sum(result['per_query_RR']):.4f} / {result['num_queries']}")print(f" = {result['MRR']:.4f}")MRR is a simple arithmetic mean, giving equal weight to each query. In practice, some queries may be more important (higher frequency, higher business value). Weighted variants exist but standard MRR remains most common in research.
Understanding when to use MRR versus alternatives is crucial for correct evaluation.
MRR vs. MAP:
| Aspect | MRR | MAP |
|---|---|---|
| Focus | First relevant item only | All relevant items |
| Assumption | One answer suffices | Multiple answers valuable |
| Sensitivity | Rank of first relevant | Ranks of all relevant |
| Use case | QA, navigational search | Document retrieval, research |
MRR vs. NDCG:
| Aspect | MRR | NDCG |
|---|---|---|
| Relevance model | Binary (is there a relevant item?) | Graded (how relevant?) |
| What counts | Only first relevant | All items, weighted by position |
| Position discount | $1/\text{rank}$ (steep) | $1/\log(\text{rank})$ (gentle) |
| Use case | Single-answer tasks | Ranked browsing tasks |
MRR vs. Success@k (Hit@k):
Success@k (also called Hit Rate@k) is a binary metric: 1 if any relevant item appears in top-k, 0 otherwise.
| Aspect | MRR | Success@k |
|---|---|---|
| Position sensitivity | Yes (reciprocal) | No (binary at k) |
| Information content | Rich (continuous) | Sparse (0 or 1) |
| Statistical efficiency | Higher | Lower |
| Use case | Most single-answer tasks | Very rough evaluation |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as np def compare_metrics(relevance: list) -> dict: """Compare MRR with other metrics for a single query.""" rel = np.array(relevance) n = len(rel) # MRR (from first relevant) mrr = 0.0 for i, r in enumerate(relevance): if r > 0: mrr = 1.0 / (i + 1) break # Success@k for various k success = {k: 1 if np.sum(rel[:k]) > 0 else 0 for k in [1, 3, 5, 10]} # MAP (treating any rel > 0 as relevant) binary = (rel > 0).astype(float) total_rel = np.sum(binary) if total_rel > 0: cumsum = np.cumsum(binary) precisions = cumsum / np.arange(1, n + 1) ap = np.sum(precisions * binary) / total_rel else: ap = 0.0 # NDCG (with binary relevance) positions = np.arange(1, n + 1) discounts = 1.0 / np.log2(positions + 1) gains = binary # Using binary for fair comparison dcg = np.sum(gains * discounts) ideal_gains = np.sort(gains)[::-1] idcg = np.sum(ideal_gains * discounts) ndcg = dcg / idcg if idcg > 0 else 0.0 return { 'MRR': mrr, 'Success@1': success[1], 'Success@3': success[3], 'Success@5': success[5], 'AP': ap, 'NDCG': ndcg } # Different scenariosscenarios = [ { 'name': 'Perfect: answer at rank 1', 'relevance': [1, 0, 1, 0, 0, 0, 0, 0, 0, 0] }, { 'name': 'Good: answer at rank 2', 'relevance': [0, 1, 0, 1, 0, 0, 0, 0, 0, 0] }, { 'name': 'Marginal: answer at rank 5', 'relevance': [0, 0, 0, 0, 1, 1, 0, 0, 0, 0] }, { 'name': 'Many relevant, first at rank 3', 'relevance': [0, 0, 1, 1, 1, 1, 1, 0, 0, 0] }, { 'name': 'One relevant at rank 10', 'relevance': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] }] print("Comparison of Ranking Metrics Across Scenarios")print("=" * 75) for scenario in scenarios: metrics = compare_metrics(scenario['relevance']) print(f"\n{scenario['name']}") print(f"Relevance: {scenario['relevance']}") print(f"{'Metric':<15} {'Value':<10}") print("-" * 25) for metric, value in metrics.items(): print(f"{metric:<15} {value:.4f}") print("\n" + "=" * 75)print("Key Observations:")print("- MRR only cares about the first relevant item's position")print("- Success@k is binary and loses position information within k")print("- AP considers all relevant items and their positions")print("- NDCG uses log discount (gentler than MRR's 1/rank)")Use MRR when: (1) Users need exactly one answer (QA, entity lookup), (2) Finding the first relevant result is success, (3) You want position sensitivity but only for the first result. Avoid MRR when: (1) Users benefit from multiple relevant items, (2) Relevance is graded, (3) You need to evaluate recall or coverage.
Understanding MRR's statistical properties helps with interpretation and significance testing.
Expected Value Under Random Ranking:
If there are $R$ relevant items among $N$ total items, and items are ranked uniformly at random, what is the expected rank of the first relevant item?
For the first relevant item's position, we have:
$$\mathbb{E}[\text{rank}_1] = \frac{N + 1}{R + 1}$$
The expected RR under random ranking is approximately:
$$\mathbb{E}[\text{RR}] \approx \frac{R}{N} \cdot H_R$$
where $H_R = \sum_{i=1}^{R} 1/i$ is the R-th harmonic number. For sparse relevance ($R \ll N$), this is very small.
Variance:
RR has high variance because it's determined by a single position. A system might have RR = 1 for one query (rank 1) and RR = 0.1 for another (rank 10). This makes MRR estimates noisy with small query sets.
Confidence Intervals:
Given the skewed distribution of RR values (clustered near 1 or near 0), bootstrap confidence intervals are preferred over parametric approaches:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import numpy as npfrom scipy import stats def mrr_with_bootstrap_ci( rr_values: list, n_bootstrap: int = 10000, confidence_level: float = 0.95) -> dict: """ Compute MRR with bootstrap confidence interval. Args: rr_values: List of RR values (one per query) n_bootstrap: Number of bootstrap samples confidence_level: Confidence level for CI Returns: Dict with MRR, CI, and statistics """ rr = np.array(rr_values) n = len(rr) # Point estimate mrr = np.mean(rr) # Bootstrap rng = np.random.default_rng(42) bootstrap_mrrs = [] for _ in range(n_bootstrap): sample_indices = rng.choice(n, size=n, replace=True) bootstrap_mrr = np.mean(rr[sample_indices]) bootstrap_mrrs.append(bootstrap_mrr) bootstrap_mrrs = np.array(bootstrap_mrrs) # Confidence interval (percentile method) alpha = 1 - confidence_level ci_lower = np.percentile(bootstrap_mrrs, 100 * alpha / 2) ci_upper = np.percentile(bootstrap_mrrs, 100 * (1 - alpha / 2)) # Standard error se = np.std(rr, ddof=1) / np.sqrt(n) return { 'MRR': mrr, 'SE': se, 'CI_lower': ci_lower, 'CI_upper': ci_upper, 'Bootstrap_std': np.std(bootstrap_mrrs), 'n_queries': n } def compare_mrr_systems(rr_a: list, rr_b: list) -> dict: """ Compare two systems using paired bootstrap test. Tests whether system A has significantly different MRR from system B. """ rr_a = np.array(rr_a) rr_b = np.array(rr_b) assert len(rr_a) == len(rr_b), "Must have same queries" n = len(rr_a) # Observed difference observed_diff = np.mean(rr_a) - np.mean(rr_b) # Paired bootstrap rng = np.random.default_rng(42) n_bootstrap = 10000 bootstrap_diffs = [] for _ in range(n_bootstrap): indices = rng.choice(n, size=n, replace=True) diff = np.mean(rr_a[indices]) - np.mean(rr_b[indices]) bootstrap_diffs.append(diff) bootstrap_diffs = np.array(bootstrap_diffs) # CI for difference ci_lower = np.percentile(bootstrap_diffs, 2.5) ci_upper = np.percentile(bootstrap_diffs, 97.5) # Significant if CI doesn't contain 0 significant = ci_lower > 0 or ci_upper < 0 return { 'MRR_A': np.mean(rr_a), 'MRR_B': np.mean(rr_b), 'Difference': observed_diff, 'CI_difference': (ci_lower, ci_upper), 'Significant_at_95': significant } # Example: Evaluate QA system performancenp.random.seed(42) # Simulate RR values for 100 queriesn_queries = 100rr_values = []for _ in range(n_queries): # 60% of queries have answer in top 3 # 30% have answer in positions 4-10 # 10% have no answer found p = np.random.random() if p < 0.6: rank = np.random.choice([1, 2, 3], p=[0.5, 0.3, 0.2]) elif p < 0.9: rank = np.random.choice(range(4, 11)) else: rank = float('inf') rr = 1/rank if rank != float('inf') else 0 rr_values.append(rr) result = mrr_with_bootstrap_ci(rr_values) print("MRR with Bootstrap Confidence Interval")print("=" * 50)print(f"Number of queries: {result['n_queries']}")print(f"MRR: {result['MRR']:.4f}")print(f"Standard Error: {result['SE']:.4f}")print(f"95% CI: [{result['CI_lower']:.4f}, {result['CI_upper']:.4f}]")print(f"Bootstrap Std: {result['Bootstrap_std']:.4f}") # Compare two systemsprint("\nComparing Two QA Systems")print("=" * 50)# System A is slightly betterrr_system_a = rr_values# System B is same with some noiserr_system_b = [max(0, rr - 0.05 + 0.1 * np.random.random()) for rr in rr_values] comparison = compare_mrr_systems(rr_system_a, rr_system_b)print(f"System A MRR: {comparison['MRR_A']:.4f}")print(f"System B MRR: {comparison['MRR_B']:.4f}")print(f"Difference (A - B): {comparison['Difference']:.4f}")print(f"95% CI for difference: ({comparison['CI_difference'][0]:.4f}, {comparison['CI_difference'][1]:.4f})")print(f"Significant at 95%: {comparison['Significant_at_95']}")Several variants of MRR address specific needs:
1. MRR@k (Truncated MRR):
Only consider the first k positions; if the first relevant item is after position k, treat RR as 0:
$$\text{RR@}k = \begin{cases} 1/\text{rank}_1 & \text{if } \text{rank}_1 \leq k \\ 0 & \text{otherwise} \end{cases}$$
This is useful when users won't scroll past position k (e.g., MRR@10 for web search).
2. Weighted MRR:
Weight queries by importance (frequency, revenue, etc.):
$$\text{Weighted MRR} = \frac{\sum_q w_q \cdot \text{RR}(q)}{\sum_q w_q}$$
3. Generalized Reciprocal Rank:
Replace $1/\text{rank}$ with a custom discount function $f(\text{rank})$:
$$\text{GRR} = f(\text{rank}_1)$$
Common choices:
4. ERR (Expected Reciprocal Rank):
A cascade model where the probability of stopping depends on satisfaction at each position:
$$\text{ERR} = \sum_{r=1}^{n} \frac{1}{r} \prod_{i=1}^{r-1}(1-R_i) \cdot R_r$$
where $R_i$ is the probability of satisfaction at position i (derived from graded relevance). ERR is sometimes confused with MRR but models different user behavior.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
import numpy as npfrom typing import Callable, List def mrr_at_k(rr_values: List[float], ranks: List[int], k: int) -> float: """ Compute MRR@k: only count RR if first relevant is within top-k. Args: rr_values: Original RR values ranks: Rank of first relevant item for each query k: Cutoff position Returns: MRR@k value """ truncated_rr = [ rr if (r <= k and r != float('inf')) else 0.0 for rr, r in zip(rr_values, ranks) ] return np.mean(truncated_rr) def generalized_rr(rank: int, discount_fn: Callable[[int], float]) -> float: """ Compute generalized reciprocal rank with custom discount. Args: rank: Rank of first relevant item (inf if none found) discount_fn: Function mapping rank to score Returns: Generalized RR value """ if rank == float('inf'): return 0.0 return discount_fn(rank) def expected_reciprocal_rank(relevance: List[float], max_rel: float = 4) -> float: """ Compute ERR (Expected Reciprocal Rank). Models cascade where user stops when satisfied. Args: relevance: Graded relevance scores in ranked order max_rel: Maximum relevance grade (for normalization) Returns: ERR value """ # Convert relevance to satisfaction probability # R_i = (2^rel_i - 1) / 2^max_rel R = [(2**rel - 1) / (2**max_rel) for rel in relevance] err = 0.0 cascade_prob = 1.0 # Probability of reaching this position for i, r_i in enumerate(R): position = i + 1 # Contribution: (1/rank) * P(reach position) * P(satisfied here) err += (1 / position) * cascade_prob * r_i # Update cascade probability cascade_prob *= (1 - r_i) return err # Examplesprint("MRR Variants Demonstration")print("=" * 60) # Sample data: ranks of first relevant item for 10 queriesranks = [1, 3, 2, 15, 5, 1, 8, float('inf'), 2, 6]rr_values = [1/r if r != float('inf') else 0.0 for r in ranks] print("Query ranks of first relevant:", ranks)print("RR values:", [f"{rr:.3f}" for rr in rr_values])print() # Standard MRRmrr_all = np.mean(rr_values)print(f"Standard MRR: {mrr_all:.4f}") # MRR@k for various kfor k in [3, 5, 10]: mrr_k = mrr_at_k(rr_values, ranks, k) print(f"MRR@{k}: {mrr_k:.4f}") # Generalized RR with different discountsprint("\nGeneralized RR with Different Discounts (for rank=3):")print("-" * 50)rank = 3 discounts = { 'Standard (1/r)': lambda r: 1/r, 'Log (1/log2(r+1))': lambda r: 1/np.log2(r+1), 'Sqrt (1/sqrt(r))': lambda r: 1/np.sqrt(r), 'Exponential (e^(-(r-1)/3))': lambda r: np.exp(-(r-1)/3)} for name, fn in discounts.items(): grr = generalized_rr(rank, fn) print(f"{name:<30}: {grr:.4f}") # ERR exampleprint("\nERR (Expected Reciprocal Rank):")print("-" * 50)graded_relevance = [3, 1, 4, 0, 2] # Graded on 0-4 scaleprint(f"Graded relevance: {graded_relevance}")err = expected_reciprocal_rank(graded_relevance, max_rel=4)print(f"ERR: {err:.4f}")MRR is the standard metric in several important domains:
| Domain | What 'Relevant' Means | Typical MRR Values | Notes |
|---|---|---|---|
| Question Answering | Contains correct answer | 0.3 - 0.6 (extractive QA) | SQuAD leaderboard uses MRR |
| Entity Linking | Correct entity identified | 0.4 - 0.8 | Knowledge graph linking |
| Navigational Search | User's intended page | 0.7 - 0.9 | Brand/site queries |
| Code Completion | User accepts suggestion | 0.3 - 0.5 | IDE autocomplete |
| Known-Item Retrieval | The specific document sought | 0.5 - 0.8 | Library search |
| Slot Filling | Correct value extracted | 0.4 - 0.7 | Information extraction |
Case Study: Question Answering Systems
In reading comprehension QA (e.g., SQuAD), the task is to find a text span that answers a question. The system returns candidate spans ranked by confidence. MRR measures how quickly the correct span appears:
Case Study: Knowledge Graph Completion
In link prediction for knowledge graphs, given (subject, relation, ?), the system ranks candidate entities. MRR measures how well the true entity is ranked:
These seemingly low values reflect the difficulty of ranking among thousands of entities.
When MRR Values Seem Low:
MRR values in the 0.2-0.4 range are common in difficult tasks. Context matters:
MRR = 0.25 might seem poor, but if you're ranking among 10,000 candidates, it means the correct answer is typically in the top 4 positions—vastly better than random (rank ~5000). Always contextualize MRR against the baseline and candidate pool size.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167
import numpy as npfrom typing import List, Set, Any, Optional, Dictfrom dataclasses import dataclass @dataclassclass RRResult: """Container for Reciprocal Rank result.""" rr: float rank_of_first: Optional[int] # None if no relevant found found: bool @dataclassclass MRRResult: """Container for MRR computation result.""" mrr: float num_queries: int queries_with_result: int per_query_rr: List[float] per_query_rank: List[Optional[int]] ci_lower: float ci_upper: float class MRREvaluator: """ Production-grade MRR evaluator with comprehensive features. """ def __init__(self, cutoff_k: Optional[int] = None): """ Initialize MRR evaluator. Args: cutoff_k: Maximum rank to consider (None = unlimited) """ self.cutoff_k = cutoff_k def reciprocal_rank( self, ranked_items: List[Any], relevant_items: Set[Any] ) -> RRResult: """ Compute RR for a single query. Args: ranked_items: Items in ranked order relevant_items: Set of relevant items Returns: RRResult with RR value and diagnostics """ max_pos = len(ranked_items) if self.cutoff_k is not None: max_pos = min(max_pos, self.cutoff_k) for i in range(max_pos): if ranked_items[i] in relevant_items: rank = i + 1 return RRResult( rr=1.0 / rank, rank_of_first=rank, found=True ) return RRResult(rr=0.0, rank_of_first=None, found=False) def mean_reciprocal_rank( self, queries: List[Dict[str, Any]], n_bootstrap: int = 10000 ) -> MRRResult: """ Compute MRR across queries with bootstrap CI. Args: queries: List of dicts with 'ranked_items' and 'relevant_items' n_bootstrap: Bootstrap samples for CI Returns: MRRResult with MRR, CI, and per-query values """ rr_results = [ self.reciprocal_rank(q['ranked_items'], q['relevant_items']) for q in queries ] rr_values = np.array([r.rr for r in rr_results]) ranks = [r.rank_of_first for r in rr_results] n = len(rr_values) mrr = np.mean(rr_values) queries_found = sum(1 for r in rr_results if r.found) # Bootstrap CI rng = np.random.default_rng(42) bootstrap_mrrs = [] for _ in range(n_bootstrap): indices = rng.choice(n, size=n, replace=True) bootstrap_mrrs.append(np.mean(rr_values[indices])) ci_lower = np.percentile(bootstrap_mrrs, 2.5) ci_upper = np.percentile(bootstrap_mrrs, 97.5) return MRRResult( mrr=float(mrr), num_queries=n, queries_with_result=queries_found, per_query_rr=rr_values.tolist(), per_query_rank=ranks, ci_lower=float(ci_lower), ci_upper=float(ci_upper) ) # Demo: Evaluate a QA systemdef demo(): evaluator = MRREvaluator(cutoff_k=10) # Only consider top 10 # Simulated QA results queries = [ { 'question': 'What is the capital of France?', 'ranked_items': ['Paris', 'Lyon', 'Marseille', 'Nice', 'Bordeaux'], 'relevant_items': {'Paris'} }, { 'question': 'Who wrote Romeo and Juliet?', 'ranked_items': ['Marlowe', 'Shakespeare', 'Jonson', 'Bacon', 'Oxford'], 'relevant_items': {'Shakespeare'} }, { 'question': 'What year did WW2 end?', 'ranked_items': ['1944', '1946', '1943', '1945', '1947'], 'relevant_items': {'1945'} }, { 'question': 'Capital of Liechtenstein?', # Harder question 'ranked_items': ['Bern', 'Vienna', 'Zurich', 'Munich', 'Vaduz'], 'relevant_items': {'Vaduz'} }, { 'question': 'Obscure trivia?', 'ranked_items': ['wrong1', 'wrong2', 'wrong3', 'wrong4', 'wrong5'], 'relevant_items': {'correct_answer'} # Not in list } ] result = evaluator.mean_reciprocal_rank(queries) print("QA System Evaluation with MRR") print("=" * 60) print(f"{'Query':<25} {'Rank':<10} {'RR':<10}") print("-" * 45) for i, (q, rr, rank) in enumerate(zip( queries, result.per_query_rr, result.per_query_rank )): q_short = q['question'][:22] + '...' if len(q['question']) > 25 else q['question'] rank_str = str(rank) if rank else 'Not found' print(f"{q_short:<25} {rank_str:<10} {rr:.4f}") print("-" * 45) print(f"MRR: {result.mrr:.4f}") print(f"95% CI: [{result.ci_lower:.4f}, {result.ci_upper:.4f}]") print(f"Queries with answer found: {result.queries_with_result}/{result.num_queries}") if __name__ == "__main__": demo()We have established MRR as the natural metric for single-answer retrieval tasks. Let's consolidate our understanding:
Mathematical Summary:
$$\text{RR}(q) = \frac{1}{\text{rank}_q}$$ where $\text{rank}_q$ is the position of the first relevant item.
$$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \text{RR}(q)$$
What's Next:
With MRR established for single-answer retrieval, we turn to partial AUC (pAUC)—a metric that focuses evaluation on specific regions of the ROC curve, useful when only a portion of the operating range is practically relevant.
You now understand Mean Reciprocal Rank as the standard metric for single-answer retrieval tasks. MRR elegantly captures the quality of the first relevant result's position, which is precisely what matters in question answering, navigational search, and similar applications.