Ranking Metrics - Learning Module

Loading content...

0/245

Mean Reciprocal Rank

When Only the First Result Matters

Not all search tasks are created equal. Sometimes users want to explore multiple relevant documents—browsing research papers, comparing products, or discovering new music. For these tasks, metrics like MAP and NDCG that evaluate entire ranked lists are appropriate.

But many search tasks have a fundamentally different character: the user seeks one specific answer, and finding it anywhere in the results—not just at position 1—constitutes success, albeit with varying degrees of satisfaction. Consider:

Question answering: "What year was the Eiffel Tower built?" There's one correct answer (1887-1889), and finding it anywhere makes the search successful.
Navigational search: "Facebook login" — The user wants facebook.com, period.
Entity retrieval: "CEO of Tesla" — The user wants one name: Elon Musk.
Known-item search: Looking for a specific document you've seen before.

For these tasks, Mean Reciprocal Rank (MRR) is the natural metric. It measures one thing elegantly: How far down the list must the user go to find the first relevant result?

What You Will Learn

By the end of this page, you will understand MRR from first principles—its definition, computation, statistical properties, and appropriate use cases. You will learn when MRR is the right choice, how it relates to other ranking metrics, and how to implement and interpret it correctly.

Reciprocal Rank (RR) for a Single Query

Before addressing Mean Reciprocal Rank, we define Reciprocal Rank (RR) for a single query.

Definition:

Let $\text{rank}_q$ denote the position (1-indexed) of the first relevant item in the ranked list for query $q$. The Reciprocal Rank is:

$$\text{RR}(q) = \frac{1}{\text{rank}_q}$$

If no relevant item appears in the list, RR is typically defined as 0.

Properties:

Range: $[0, 1]$ where 1 means the first relevant item is at rank 1
Monotonically decreasing in rank: RR drops as the first relevant item moves down
Hyperbolic decay: RR(rank=1) = 1, RR(rank=2) = 0.5, RR(rank=10) = 0.1
Ignores all other relevant items: Once the first is found, RR is determined

Reciprocal Rank Values by Position
Rank of First Relevant	Reciprocal Rank	User Experience
1	1.000	Perfect — answer at top
2	0.500	Good — answer visible without scrolling
3	0.333	Acceptable — requires some scanning
5	0.200	Marginal — requires scrolling
10	0.100	Poor — bottom of first page (typical 10-result page)
20	0.050	Very poor — second page
Not found	0.000	Failure — answer not in results

The Intuition Behind Reciprocal:

Why $1/\text{rank}$ specifically? This choice has elegant properties:

Expected cost interpretation: If the user examines results sequentially until finding a relevant one, RR is inversely proportional to the expected number of items examined.
Discounting: Like NDCG, RR applies a position discount, but with $1/r$ rather than $1/\log(r)$. This is a steeper discount—position 2 is worth half of position 1, not 63%.
Parsimony: The reciprocal is the simplest monotonically decreasing function of rank with appropriate properties.

Example Calculation:

For a ranked list: [irrelevant, irrelevant, relevant, irrelevant, relevant]

The first relevant item is at position 3, so: $$\text{RR} = \frac{1}{3} \approx 0.333$$

Note that the relevant item at position 5 doesn't affect RR—only the first matters.

reciprocal_rank.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from typing import Optional, List, Set, Any
 
def reciprocal_rank(relevance: List[int]) -> float:
    """
    Compute Reciprocal Rank for a single query.
    
    Args:
        relevance: Binary relevance in ranked order (1=relevant, 0=not)
    
    Returns:
        RR value in [0, 1], or 0 if no relevant items
    """
    for i, rel in enumerate(relevance):
        if rel == 1:
            return 1.0 / (i + 1)  # i is 0-indexed, rank is 1-indexed
    return 0.0
 
def reciprocal_rank_from_items(
    ranked_items: List[Any],
    relevant_items: Set[Any]
) -> float:
    """
    Compute RR given ranked item IDs and relevant item set.
    
    Args:
        ranked_items: Item IDs in ranked order
        relevant_items: Set of relevant item IDs
    
    Returns:
        RR value
    """
    for i, item in enumerate(ranked_items):
        if item in relevant_items:
            return 1.0 / (i + 1)
    return 0.0
 
# Examples
print("Reciprocal Rank Examples")
print("=" * 50)
 
# Case 1: First result is relevant
rr1 = reciprocal_rank([1, 0, 0, 1, 0])
print(f"[1, 0, 0, 1, 0] → RR = {rr1:.4f} (first relevant at rank 1)")
 
# Case 2: Third result is first relevant
rr2 = reciprocal_rank([0, 0, 1, 0, 1])
print(f"[0, 0, 1, 0, 1] → RR = {rr2:.4f} (first relevant at rank 3)")
 
# Case 3: Relevant at position 10
rr3 = reciprocal_rank([0]*9 + [1])
print(f"[0]*9 + [1]    → RR = {rr3:.4f} (first relevant at rank 10)")
 
# Case 4: No relevant items
rr4 = reciprocal_rank([0, 0, 0, 0, 0])
print(f"[0, 0, 0, 0, 0] → RR = {rr4:.4f} (no relevant items)")
 
# Using item IDs
print("\nWith Item IDs:")
ranked = ['doc_a', 'doc_b', 'doc_c', 'doc_d', 'doc_e']
relevant = {'doc_c', 'doc_e'}
rr5 = reciprocal_rank_from_items(ranked, relevant)
print(f"Ranked: {ranked}")
print(f"Relevant: {relevant}")
print(f"RR = {rr5:.4f} (doc_c found at rank 3)")

Mean Reciprocal Rank (MRR): Aggregating Across Queries

Mean Reciprocal Rank (MRR) is simply the arithmetic mean of Reciprocal Ranks across all queries:

$$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \text{RR}(q) = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}$$

where $Q$ is the set of queries and $\text{rank}_q$ is the rank of the first relevant result for query $q$.

Properties of MRR:

Range: $[0, 1]$ where 1 means every query has a relevant result at rank 1
Clean interpretation: Average reciprocal of the rank of the first relevant result
Sensitive to top positions: Improving from rank 2 to rank 1 (RR: 0.5 → 1.0) adds more than from rank 10 to rank 9 (RR: 0.1 → 0.111)
One-answer assumption: Implicitly assumes each query has one primary relevant answer

mrr_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from typing import List
 
def mean_reciprocal_rank(query_relevances: List[List[int]]) -> dict:
    """
    Compute MRR across multiple queries.
    
    Args:
        query_relevances: List of relevance lists (one per query)
    
    Returns:
        Dict with MRR and diagnostics
    """
    rr_values = []
    ranks = []
    
    for relevance in query_relevances:
        rr = 0.0
        rank = float('inf')
        
        for i, rel in enumerate(relevance):
            if rel == 1:
                rank = i + 1
                rr = 1.0 / rank
                break
        
        rr_values.append(rr)
        ranks.append(rank)
    
    return {
        'MRR': np.mean(rr_values),
        'per_query_RR': rr_values,
        'per_query_rank': ranks,
        'queries_without_relevant': sum(1 for r in ranks if r == float('inf')),
        'num_queries': len(query_relevances)
    }
 
# Example: Evaluate a QA system
queries = [
    # Query 1: Perfect - answer at position 1
    [1, 0, 0, 0, 0],
    # Query 2: Answer at position 3
    [0, 0, 1, 0, 1],
    # Query 3: Answer at position 2
    [0, 1, 1, 0, 0],
    # Query 4: Answer at position 5
    [0, 0, 0, 0, 1],
    # Query 5: No answer found
    [0, 0, 0, 0, 0],
]
 
result = mean_reciprocal_rank(queries)
 
print("MRR Computation Example")
print("=" * 60)
print(f"Number of queries: {result['num_queries']}")
print(f"Queries without relevant: {result['queries_without_relevant']}")
print()
 
print(f"{'Query':<10} {'First Relevant Rank':<22} {'RR':<10}")
print("-" * 45)
for i, (rr, rank) in enumerate(zip(result['per_query_RR'], result['per_query_rank'])):
    rank_str = str(rank) if rank != float('inf') else "Not found"
    print(f"Query {i+1:<4} {rank_str:<22} {rr:.4f}")
 
print("-" * 45)
print(f"MRR = {result['MRR']:.4f}")
 
# Breakdown of MRR calculation
print(f"\nMRR = ({' + '.join([f'{rr:.4f}' for rr in result['per_query_RR']])}) / {result['num_queries']}")
print(f"    = {sum(result['per_query_RR']):.4f} / {result['num_queries']}")
print(f"    = {result['MRR']:.4f}")

The 'Mean' in MRR

MRR is a simple arithmetic mean, giving equal weight to each query. In practice, some queries may be more important (higher frequency, higher business value). Weighted variants exist but standard MRR remains most common in research.

MRR vs. Other Ranking Metrics

Understanding when to use MRR versus alternatives is crucial for correct evaluation.

MRR vs. MAP:

Aspect	MRR	MAP
Focus	First relevant item only	All relevant items
Assumption	One answer suffices	Multiple answers valuable
Sensitivity	Rank of first relevant	Ranks of all relevant
Use case	QA, navigational search	Document retrieval, research

MRR vs. NDCG:

Aspect	MRR	NDCG
Relevance model	Binary (is there a relevant item?)	Graded (how relevant?)
What counts	Only first relevant	All items, weighted by position
Position discount	$1/\text{rank}$ (steep)	$1/\log(\text{rank})$ (gentle)
Use case	Single-answer tasks	Ranked browsing tasks

MRR vs. Success@k (Hit@k):

Success@k (also called Hit Rate@k) is a binary metric: 1 if any relevant item appears in top-k, 0 otherwise.

Aspect	MRR	Success@k
Position sensitivity	Yes (reciprocal)	No (binary at k)
Information content	Rich (continuous)	Sparse (0 or 1)
Statistical efficiency	Higher	Lower
Use case	Most single-answer tasks	Very rough evaluation

mrr_vs_alternatives.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
 
def compare_metrics(relevance: list) -> dict:
    """Compare MRR with other metrics for a single query."""
    rel = np.array(relevance)
    n = len(rel)
    
    # MRR (from first relevant)
    mrr = 0.0
    for i, r in enumerate(relevance):
        if r > 0:
            mrr = 1.0 / (i + 1)
            break
    
    # Success@k for various k
    success = {k: 1 if np.sum(rel[:k]) > 0 else 0 for k in [1, 3, 5, 10]}
    
    # MAP (treating any rel > 0 as relevant)
    binary = (rel > 0).astype(float)
    total_rel = np.sum(binary)
    if total_rel > 0:
        cumsum = np.cumsum(binary)
        precisions = cumsum / np.arange(1, n + 1)
        ap = np.sum(precisions * binary) / total_rel
    else:
        ap = 0.0
    
    # NDCG (with binary relevance)
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    gains = binary  # Using binary for fair comparison
    dcg = np.sum(gains * discounts)
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    ndcg = dcg / idcg if idcg > 0 else 0.0
    
    return {
        'MRR': mrr,
        'Success@1': success[1],
        'Success@3': success[3],
        'Success@5': success[5],
        'AP': ap,
        'NDCG': ndcg
    }
 
# Different scenarios
scenarios = [
    {
        'name': 'Perfect: answer at rank 1',
        'relevance': [1, 0, 1, 0, 0, 0, 0, 0, 0, 0]
    },
    {
        'name': 'Good: answer at rank 2',
        'relevance': [0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
    },
    {
        'name': 'Marginal: answer at rank 5',
        'relevance': [0, 0, 0, 0, 1, 1, 0, 0, 0, 0]
    },
    {
        'name': 'Many relevant, first at rank 3',
        'relevance': [0, 0, 1, 1, 1, 1, 1, 0, 0, 0]
    },
    {
        'name': 'One relevant at rank 10',
        'relevance': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
    }
]
 
print("Comparison of Ranking Metrics Across Scenarios")
print("=" * 75)
 
for scenario in scenarios:
    metrics = compare_metrics(scenario['relevance'])
    print(f"\n{scenario['name']}")
    print(f"Relevance: {scenario['relevance']}")
    print(f"{'Metric':<15} {'Value':<10}")
    print("-" * 25)
    for metric, value in metrics.items():
        print(f"{metric:<15} {value:.4f}")
        
print("\n" + "=" * 75)
print("Key Observations:")
print("- MRR only cares about the first relevant item's position")
print("- Success@k is binary and loses position information within k")
print("- AP considers all relevant items and their positions")
print("- NDCG uses log discount (gentler than MRR's 1/rank)")

Choosing the Right Metric

Use MRR when: (1) Users need exactly one answer (QA, entity lookup), (2) Finding the first relevant result is success, (3) You want position sensitivity but only for the first result. Avoid MRR when: (1) Users benefit from multiple relevant items, (2) Relevance is graded, (3) You need to evaluate recall or coverage.

Statistical Properties of MRR

Understanding MRR's statistical properties helps with interpretation and significance testing.

Expected Value Under Random Ranking:

If there are $R$ relevant items among $N$ total items, and items are ranked uniformly at random, what is the expected rank of the first relevant item?

For the first relevant item's position, we have:

$$\mathbb{E}[\text{rank}_1] = \frac{N + 1}{R + 1}$$

The expected RR under random ranking is approximately:

$$\mathbb{E}[\text{RR}] \approx \frac{R}{N} \cdot H_R$$

where $H_R = \sum_{i=1}^{R} 1/i$ is the R-th harmonic number. For sparse relevance ($R \ll N$), this is very small.

Variance:

RR has high variance because it's determined by a single position. A system might have RR = 1 for one query (rank 1) and RR = 0.1 for another (rank 10). This makes MRR estimates noisy with small query sets.

Confidence Intervals:

Given the skewed distribution of RR values (clustered near 1 or near 0), bootstrap confidence intervals are preferred over parametric approaches:

Resample queries with replacement
Compute MRR for each bootstrap sample
Take the 2.5th and 97.5th percentiles as 95% CI bounds

mrr_statistics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
from scipy import stats
 
def mrr_with_bootstrap_ci(
    rr_values: list,
    n_bootstrap: int = 10000,
    confidence_level: float = 0.95
) -> dict:
    """
    Compute MRR with bootstrap confidence interval.
    
    Args:
        rr_values: List of RR values (one per query)
        n_bootstrap: Number of bootstrap samples
        confidence_level: Confidence level for CI
    
    Returns:
        Dict with MRR, CI, and statistics
    """
    rr = np.array(rr_values)
    n = len(rr)
    
    # Point estimate
    mrr = np.mean(rr)
    
    # Bootstrap
    rng = np.random.default_rng(42)
    bootstrap_mrrs = []
    
    for _ in range(n_bootstrap):
        sample_indices = rng.choice(n, size=n, replace=True)
        bootstrap_mrr = np.mean(rr[sample_indices])
        bootstrap_mrrs.append(bootstrap_mrr)
    
    bootstrap_mrrs = np.array(bootstrap_mrrs)
    
    # Confidence interval (percentile method)
    alpha = 1 - confidence_level
    ci_lower = np.percentile(bootstrap_mrrs, 100 * alpha / 2)
    ci_upper = np.percentile(bootstrap_mrrs, 100 * (1 - alpha / 2))
    
    # Standard error
    se = np.std(rr, ddof=1) / np.sqrt(n)
    
    return {
        'MRR': mrr,
        'SE': se,
        'CI_lower': ci_lower,
        'CI_upper': ci_upper,
        'Bootstrap_std': np.std(bootstrap_mrrs),
        'n_queries': n
    }
 
def compare_mrr_systems(rr_a: list, rr_b: list) -> dict:
    """
    Compare two systems using paired bootstrap test.
    
    Tests whether system A has significantly different MRR from system B.
    """
    rr_a = np.array(rr_a)
    rr_b = np.array(rr_b)
    
    assert len(rr_a) == len(rr_b), "Must have same queries"
    n = len(rr_a)
    
    # Observed difference
    observed_diff = np.mean(rr_a) - np.mean(rr_b)
    
    # Paired bootstrap
    rng = np.random.default_rng(42)
    n_bootstrap = 10000
    bootstrap_diffs = []
    
    for _ in range(n_bootstrap):
        indices = rng.choice(n, size=n, replace=True)
        diff = np.mean(rr_a[indices]) - np.mean(rr_b[indices])
        bootstrap_diffs.append(diff)
    
    bootstrap_diffs = np.array(bootstrap_diffs)
    
    # CI for difference
    ci_lower = np.percentile(bootstrap_diffs, 2.5)
    ci_upper = np.percentile(bootstrap_diffs, 97.5)
    
    # Significant if CI doesn't contain 0
    significant = ci_lower > 0 or ci_upper < 0
    
    return {
        'MRR_A': np.mean(rr_a),
        'MRR_B': np.mean(rr_b),
        'Difference': observed_diff,
        'CI_difference': (ci_lower, ci_upper),
        'Significant_at_95': significant
    }
 
# Example: Evaluate QA system performance
np.random.seed(42)
 
# Simulate RR values for 100 queries
n_queries = 100
rr_values = []
for _ in range(n_queries):
    # 60% of queries have answer in top 3
    # 30% have answer in positions 4-10
    # 10% have no answer found
    p = np.random.random()
    if p < 0.6:
        rank = np.random.choice([1, 2, 3], p=[0.5, 0.3, 0.2])
    elif p < 0.9:
        rank = np.random.choice(range(4, 11))
    else:
        rank = float('inf')
    
    rr = 1/rank if rank != float('inf') else 0
    rr_values.append(rr)
 
result = mrr_with_bootstrap_ci(rr_values)
 
print("MRR with Bootstrap Confidence Interval")
print("=" * 50)
print(f"Number of queries: {result['n_queries']}")
print(f"MRR: {result['MRR']:.4f}")
print(f"Standard Error: {result['SE']:.4f}")
print(f"95% CI: [{result['CI_lower']:.4f}, {result['CI_upper']:.4f}]")
print(f"Bootstrap Std: {result['Bootstrap_std']:.4f}")
 
# Compare two systems
print("\nComparing Two QA Systems")
print("=" * 50)
# System A is slightly better
rr_system_a = rr_values
# System B is same with some noise
rr_system_b = [max(0, rr - 0.05 + 0.1 * np.random.random()) for rr in rr_values]
 
comparison = compare_mrr_systems(rr_system_a, rr_system_b)
print(f"System A MRR: {comparison['MRR_A']:.4f}")
print(f"System B MRR: {comparison['MRR_B']:.4f}")
print(f"Difference (A - B): {comparison['Difference']:.4f}")
print(f"95% CI for difference: ({comparison['CI_difference'][0]:.4f}, {comparison['CI_difference'][1]:.4f})")
print(f"Significant at 95%: {comparison['Significant_at_95']}")

MRR Variants and Extensions

Several variants of MRR address specific needs:

1. MRR@k (Truncated MRR):

Only consider the first k positions; if the first relevant item is after position k, treat RR as 0:

$$\text{RR@}k = \begin{cases} 1/\text{rank}_1 & \text{if } \text{rank}_1 \leq k \\ 0 & \text{otherwise} \end{cases}$$

This is useful when users won't scroll past position k (e.g., MRR@10 for web search).

2. Weighted MRR:

Weight queries by importance (frequency, revenue, etc.):

$$\text{Weighted MRR} = \frac{\sum_q w_q \cdot \text{RR}(q)}{\sum_q w_q}$$

3. Generalized Reciprocal Rank:

Replace $1/\text{rank}$ with a custom discount function $f(\text{rank})$:

$$\text{GRR} = f(\text{rank}_1)$$

Common choices:

$f(r) = 1/r$ (standard RR)
$f(r) = 1/\log_2(r+1)$ (NDCG-style discount)
$f(r) = e^{-(r-1)/\tau}$ (exponential decay with temperature τ)

4. ERR (Expected Reciprocal Rank):

A cascade model where the probability of stopping depends on satisfaction at each position:

$$\text{ERR} = \sum_{r=1}^{n} \frac{1}{r} \prod_{i=1}^{r-1}(1-R_i) \cdot R_r$$

where $R_i$ is the probability of satisfaction at position i (derived from graded relevance). ERR is sometimes confused with MRR but models different user behavior.

mrr_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
import numpy as np
from typing import Callable, List
 
def mrr_at_k(rr_values: List[float], ranks: List[int], k: int) -> float:
    """
    Compute MRR@k: only count RR if first relevant is within top-k.
    
    Args:
        rr_values: Original RR values
        ranks: Rank of first relevant item for each query
        k: Cutoff position
    
    Returns:
        MRR@k value
    """
    truncated_rr = [
        rr if (r <= k and r != float('inf')) else 0.0
        for rr, r in zip(rr_values, ranks)
    ]
    return np.mean(truncated_rr)
 
def generalized_rr(rank: int, discount_fn: Callable[[int], float]) -> float:
    """
    Compute generalized reciprocal rank with custom discount.
    
    Args:
        rank: Rank of first relevant item (inf if none found)
        discount_fn: Function mapping rank to score
    
    Returns:
        Generalized RR value
    """
    if rank == float('inf'):
        return 0.0
    return discount_fn(rank)
 
def expected_reciprocal_rank(relevance: List[float], max_rel: float = 4) -> float:
    """
    Compute ERR (Expected Reciprocal Rank).
    
    Models cascade where user stops when satisfied.
    
    Args:
        relevance: Graded relevance scores in ranked order
        max_rel: Maximum relevance grade (for normalization)
    
    Returns:
        ERR value
    """
    # Convert relevance to satisfaction probability
    # R_i = (2^rel_i - 1) / 2^max_rel
    R = [(2**rel - 1) / (2**max_rel) for rel in relevance]
    
    err = 0.0
    cascade_prob = 1.0  # Probability of reaching this position
    
    for i, r_i in enumerate(R):
        position = i + 1
        # Contribution: (1/rank) * P(reach position) * P(satisfied here)
        err += (1 / position) * cascade_prob * r_i
        # Update cascade probability
        cascade_prob *= (1 - r_i)
    
    return err
 
# Examples
print("MRR Variants Demonstration")
print("=" * 60)
 
# Sample data: ranks of first relevant item for 10 queries
ranks = [1, 3, 2, 15, 5, 1, 8, float('inf'), 2, 6]
rr_values = [1/r if r != float('inf') else 0.0 for r in ranks]
 
print("Query ranks of first relevant:", ranks)
print("RR values:", [f"{rr:.3f}" for rr in rr_values])
print()
 
# Standard MRR
mrr_all = np.mean(rr_values)
print(f"Standard MRR:  {mrr_all:.4f}")
 
# MRR@k for various k
for k in [3, 5, 10]:
    mrr_k = mrr_at_k(rr_values, ranks, k)
    print(f"MRR@{k}:        {mrr_k:.4f}")
 
# Generalized RR with different discounts
print("\nGeneralized RR with Different Discounts (for rank=3):")
print("-" * 50)
rank = 3
 
discounts = {
    'Standard (1/r)': lambda r: 1/r,
    'Log (1/log2(r+1))': lambda r: 1/np.log2(r+1),
    'Sqrt (1/sqrt(r))': lambda r: 1/np.sqrt(r),
    'Exponential (e^(-(r-1)/3))': lambda r: np.exp(-(r-1)/3)
}
 
for name, fn in discounts.items():
    grr = generalized_rr(rank, fn)
    print(f"{name:<30}: {grr:.4f}")
 
# ERR example
print("\nERR (Expected Reciprocal Rank):")
print("-" * 50)
graded_relevance = [3, 1, 4, 0, 2]  # Graded on 0-4 scale
print(f"Graded relevance: {graded_relevance}")
err = expected_reciprocal_rank(graded_relevance, max_rel=4)
print(f"ERR: {err:.4f}")

Practical Applications of MRR

MRR is the standard metric in several important domains:

MRR Applications by Domain
Domain	What 'Relevant' Means	Typical MRR Values	Notes
Question Answering	Contains correct answer	0.3 - 0.6 (extractive QA)	SQuAD leaderboard uses MRR
Entity Linking	Correct entity identified	0.4 - 0.8	Knowledge graph linking
Navigational Search	User's intended page	0.7 - 0.9	Brand/site queries
Code Completion	User accepts suggestion	0.3 - 0.5	IDE autocomplete
Known-Item Retrieval	The specific document sought	0.5 - 0.8	Library search
Slot Filling	Correct value extracted	0.4 - 0.7	Information extraction

Case Study: Question Answering Systems

In reading comprehension QA (e.g., SQuAD), the task is to find a text span that answers a question. The system returns candidate spans ranked by confidence. MRR measures how quickly the correct span appears:

MRR = 0.5 suggests the correct answer is typically at rank 2
MRR = 0.33 suggests it's typically at rank 3
MRR = 0.7+ is considered strong performance

Case Study: Knowledge Graph Completion

In link prediction for knowledge graphs, given (subject, relation, ?), the system ranks candidate entities. MRR measures how well the true entity is ranked:

TransE on FB15k-237: MRR ≈ 0.29
RotatE on FB15k-237: MRR ≈ 0.34
Modern approaches: MRR ≈ 0.4-0.5

These seemingly low values reflect the difficulty of ranking among thousands of entities.

When MRR Values Seem Low:

MRR values in the 0.2-0.4 range are common in difficult tasks. Context matters:

Random baseline for 1000 candidates: MRR ≈ 0.007
MRR = 0.3 means 40× better than random
The denominator (candidate pool size) dramatically affects interpretation

Interpreting Low MRR

MRR = 0.25 might seem poor, but if you're ranking among 10,000 candidates, it means the correct answer is typically in the top 4 positions—vastly better than random (rank ~5000). Always contextualize MRR against the baseline and candidate pool size.

Production-Grade Implementation

production_mrr.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
import numpy as np
from typing import List, Set, Any, Optional, Dict
from dataclasses import dataclass
 
@dataclass
class RRResult:
    """Container for Reciprocal Rank result."""
    rr: float
    rank_of_first: Optional[int]  # None if no relevant found
    found: bool
 
@dataclass
class MRRResult:
    """Container for MRR computation result."""
    mrr: float
    num_queries: int
    queries_with_result: int
    per_query_rr: List[float]
    per_query_rank: List[Optional[int]]
    ci_lower: float
    ci_upper: float
 
class MRREvaluator:
    """
    Production-grade MRR evaluator with comprehensive features.
    """
    
    def __init__(self, cutoff_k: Optional[int] = None):
        """
        Initialize MRR evaluator.
        
        Args:
            cutoff_k: Maximum rank to consider (None = unlimited)
        """
        self.cutoff_k = cutoff_k
    
    def reciprocal_rank(
        self,
        ranked_items: List[Any],
        relevant_items: Set[Any]
    ) -> RRResult:
        """
        Compute RR for a single query.
        
        Args:
            ranked_items: Items in ranked order
            relevant_items: Set of relevant items
        
        Returns:
            RRResult with RR value and diagnostics
        """
        max_pos = len(ranked_items)
        if self.cutoff_k is not None:
            max_pos = min(max_pos, self.cutoff_k)
        
        for i in range(max_pos):
            if ranked_items[i] in relevant_items:
                rank = i + 1
                return RRResult(
                    rr=1.0 / rank,
                    rank_of_first=rank,
                    found=True
                )
        
        return RRResult(rr=0.0, rank_of_first=None, found=False)
    
    def mean_reciprocal_rank(
        self,
        queries: List[Dict[str, Any]],
        n_bootstrap: int = 10000
    ) -> MRRResult:
        """
        Compute MRR across queries with bootstrap CI.
        
        Args:
            queries: List of dicts with 'ranked_items' and 'relevant_items'
            n_bootstrap: Bootstrap samples for CI
        
        Returns:
            MRRResult with MRR, CI, and per-query values
        """
        rr_results = [
            self.reciprocal_rank(q['ranked_items'], q['relevant_items'])
            for q in queries
        ]
        
        rr_values = np.array([r.rr for r in rr_results])
        ranks = [r.rank_of_first for r in rr_results]
        
        n = len(rr_values)
        mrr = np.mean(rr_values)
        queries_found = sum(1 for r in rr_results if r.found)
        
        # Bootstrap CI
        rng = np.random.default_rng(42)
        bootstrap_mrrs = []
        for _ in range(n_bootstrap):
            indices = rng.choice(n, size=n, replace=True)
            bootstrap_mrrs.append(np.mean(rr_values[indices]))
        
        ci_lower = np.percentile(bootstrap_mrrs, 2.5)
        ci_upper = np.percentile(bootstrap_mrrs, 97.5)
        
        return MRRResult(
            mrr=float(mrr),
            num_queries=n,
            queries_with_result=queries_found,
            per_query_rr=rr_values.tolist(),
            per_query_rank=ranks,
            ci_lower=float(ci_lower),
            ci_upper=float(ci_upper)
        )
 
# Demo: Evaluate a QA system
def demo():
    evaluator = MRREvaluator(cutoff_k=10)  # Only consider top 10
    
    # Simulated QA results
    queries = [
        {
            'question': 'What is the capital of France?',
            'ranked_items': ['Paris', 'Lyon', 'Marseille', 'Nice', 'Bordeaux'],
            'relevant_items': {'Paris'}
        },
        {
            'question': 'Who wrote Romeo and Juliet?',
            'ranked_items': ['Marlowe', 'Shakespeare', 'Jonson', 'Bacon', 'Oxford'],
            'relevant_items': {'Shakespeare'}
        },
        {
            'question': 'What year did WW2 end?',
            'ranked_items': ['1944', '1946', '1943', '1945', '1947'],
            'relevant_items': {'1945'}
        },
        {
            'question': 'Capital of Liechtenstein?',  # Harder question
            'ranked_items': ['Bern', 'Vienna', 'Zurich', 'Munich', 'Vaduz'],
            'relevant_items': {'Vaduz'}
        },
        {
            'question': 'Obscure trivia?',
            'ranked_items': ['wrong1', 'wrong2', 'wrong3', 'wrong4', 'wrong5'],
            'relevant_items': {'correct_answer'}  # Not in list
        }
    ]
    
    result = evaluator.mean_reciprocal_rank(queries)
    
    print("QA System Evaluation with MRR")
    print("=" * 60)
    print(f"{'Query':<25} {'Rank':<10} {'RR':<10}")
    print("-" * 45)
    
    for i, (q, rr, rank) in enumerate(zip(
        queries, result.per_query_rr, result.per_query_rank
    )):
        q_short = q['question'][:22] + '...' if len(q['question']) > 25 else q['question']
        rank_str = str(rank) if rank else 'Not found'
        print(f"{q_short:<25} {rank_str:<10} {rr:.4f}")
    
    print("-" * 45)
    print(f"MRR: {result.mrr:.4f}")
    print(f"95% CI: [{result.ci_lower:.4f}, {result.ci_upper:.4f}]")
    print(f"Queries with answer found: {result.queries_with_result}/{result.num_queries}")
 
if __name__ == "__main__":
    demo()

Common Pitfalls and Best Practices

Common Pitfalls

•Using MRR when multiple results matter: MRR ignores all relevant items after the first. If users benefit from seeing multiple relevant items, use MAP or NDCG instead.
•Ignoring the candidate pool size: MRR = 0.3 means very different things when ranking among 10 candidates vs. 10,000. Always report context.
•Not handling missing relevant items: Decide explicitly: should queries with no answer in the list get RR = 0, or be excluded? Document your choice.
•Confusing MRR with Hits@1: Hits@1 (or Success@1) is binary; MRR is continuous. MRR provides more information.
•Averaging over heterogeneous query types: If some queries inherently have answers deep in the list, stratify analysis.

Best Practices

•Report with confidence intervals: MRR has high variance; single point estimates are unreliable.
•Report complementary metrics: Hits@1, Hits@10 alongside MRR give a fuller picture.
•Specify cutoff if used: If computing MRR@10, make this explicit.
•Compare against baselines: Random, popularity-based, or simple heuristic baselines contextualize performance.
•Stratify by query difficulty: Analyze MRR separately for easy vs. hard queries when possible.

Summary: Mean Reciprocal Rank

We have established MRR as the natural metric for single-answer retrieval tasks. Let's consolidate our understanding:

Key Takeaways

•Reciprocal Rank = $1/\text{rank}_1$ — inverse of the position of the first relevant item
•MRR = mean of RR across queries — simple arithmetic mean
•Only the first relevant item matters — subsequent relevant items don't affect RR
•Steep position discount — position 2 is worth half of position 1 (steeper than NDCG's log discount)
•Best for single-answer tasks — QA, entity linking, navigational search, known-item retrieval
•High variance — bootstrap CIs preferred over parametric approaches
•Context is crucial — MRR = 0.3 is excellent when ranking among thousands of candidates

Mathematical Summary:

$$\text{RR}(q) = \frac{1}{\text{rank}_q}$$ where $\text{rank}_q$ is the position of the first relevant item.

$$\text{MRR} = \frac{1}{|Q|} \sum_{q \in Q} \text{RR}(q)$$

What's Next:

With MRR established for single-answer retrieval, we turn to partial AUC (pAUC)—a metric that focuses evaluation on specific regions of the ROC curve, useful when only a portion of the operating range is practically relevant.

Page Complete

You now understand Mean Reciprocal Rank as the standard metric for single-answer retrieval tasks. MRR elegantly captures the quality of the first relevant result's position, which is precisely what matters in question answering, navigational search, and similar applications.