Loading learning content...
When you type a query into a search engine, the system doesn't present you with a simple yes/no answer. Instead, it returns a ranked list of results—documents ordered by predicted relevance. When a streaming service recommends movies, it doesn't merely classify each film as 'relevant' or 'not relevant'; it produces an ordered sequence where position reflects confidence and importance.
This shift from classification to ranking fundamentally changes how we must evaluate model performance. In classification, a prediction is either correct or incorrect. In ranking, correctness is position-dependent: a relevant document at position 1 is more valuable than the same document at position 100. The user who finds their answer in the first result has a qualitatively different experience from one who must scroll through pages of irrelevant content.
Precision@k and Recall@k are the foundational metrics that bridge classification thinking with ranking reality. They answer the questions every search and recommendation system must address: Of the top-k items we show the user, how many are actually relevant? And of all the relevant items that exist, how many appear in our top-k?
By the end of this page, you will understand Precision@k and Recall@k from first principles—their definitions, computational procedures, statistical properties, and practical applications. You will be able to implement these metrics, interpret their values correctly, and understand their relationships to both classification metrics and more sophisticated ranking measures.
Before defining metrics, we must precisely formulate the ranking problem. Consider a query $q$ (which may be explicit, like a search string, or implicit, like a user's context) and a corpus of $N$ items $\mathcal{D} = {d_1, d_2, \ldots, d_N}$. Each item has a true relevance to the query, denoted $\text{rel}(d_i, q) \in {0, 1}$ for binary relevance (relevant or not) or $\text{rel}(d_i, q) \in {0, 1, 2, \ldots, L}$ for graded relevance.
The Ranking Task:
A ranking model produces a permutation $\pi$ of items, where $\pi(i)$ denotes the item placed at rank $i$. Equivalently, we can describe this as a ranked list $[d_{\pi(1)}, d_{\pi(2)}, \ldots, d_{\pi(N)}]$ where the item at position 1 is ranked highest.
Key Notation:
| Symbol | Meaning |
|---|---|
| $k$ | The cut-off rank (we evaluate only the top-$k$ items) |
| $R$ | The set of all truly relevant items for query $q$ |
| $\text{rel}_i$ | Relevance of the item at rank $i$ (shorthand for $\text{rel}(\pi(i), q)$) |
| $R_k$ | Relevant items in the top-$k$: ${\pi(i) : i \leq k \text{ and } \text{rel}_i = 1}$ |
The ranking problem differs fundamentally from classification in that:
User studies consistently show that click probability drops dramatically with position. In web search, position 1 receives 30-40% of clicks, position 2 receives about 15%, and positions beyond 10 receive negligible attention. A metric that ignores position fails to capture the user's actual experience.
Precision@k measures the proportion of relevant items among the top-$k$ ranked items. It answers the question: Of the items we showed the user, how many were actually relevant?
Formal Definition:
$$\text{Precision@}k = \frac{|R_k|}{k} = \frac{\text{Number of relevant items in top-}k}{k}$$
Equivalently, using indicator notation:
$$\text{Precision@}k = \frac{1}{k} \sum_{i=1}^{k} \mathbb{1}[\text{rel}_i = 1]$$
where $\mathbb{1}[\cdot]$ is the indicator function that equals 1 if the condition is true and 0 otherwise.
Properties:
| Rank | Document | Relevant? | Cumulative Relevant | P@k |
|---|---|---|---|---|
| 1 | ML Tutorial | ✓ Yes | 1 | 1/1 = 1.000 |
| 2 | Python Basics | ✗ No | 1 | 1/2 = 0.500 |
| 3 | Deep Learning Guide | ✓ Yes | 2 | 2/3 = 0.667 |
| 4 | Neural Networks | ✓ Yes | 3 | 3/4 = 0.750 |
| 5 | Cooking Recipes | ✗ No | 3 | 3/5 = 0.600 |
| 6 | Data Science Intro | ✓ Yes | 4 | 4/6 = 0.667 |
| 7 | Random Blog Post | ✗ No | 4 | 4/7 = 0.571 |
| 8 | ML Research Paper | ✓ Yes | 5 | 5/8 = 0.625 |
| 9 | Weather Forecast | ✗ No | 5 | 5/9 = 0.556 |
| 10 | Algorithm Book | ✓ Yes | 6 | 6/10 = 0.600 |
Interpreting the Example:
Notice that P@k fluctuates as k increases:
This non-monotonic behavior is characteristic of P@k. Unlike cumulative metrics, inserting an irrelevant item at any position reduces the precision at that cutoff and beyond (until enough relevant items accumulate).
A crucial property of P@k is that the denominator is always k—the cutoff you chose. This means P@k is independent of how many relevant items exist in the corpus. Whether there are 5 or 5000 relevant documents, P@10 still divides by 10. This property has important implications for comparing systems across queries with different numbers of relevant items.
Recall@k measures the proportion of all relevant items that appear in the top-$k$ ranked items. It answers the question: Of all the relevant items that exist, how many did we surface in our top-k?
Formal Definition:
$$\text{Recall@}k = \frac{|R_k|}{|R|} = \frac{\text{Number of relevant items in top-}k}{\text{Total number of relevant items}}$$
Equivalently:
$$\text{Recall@}k = \frac{\sum_{i=1}^{k} \mathbb{1}[\text{rel}_i = 1]}{|R|}$$
Properties:
| Rank | Document | Relevant? | Cumulative Relevant | R@k (|R|=8) |
|---|---|---|---|---|
| 1 | ML Tutorial | ✓ Yes | 1 | 1/8 = 0.125 |
| 2 | Python Basics | ✗ No | 1 | 1/8 = 0.125 |
| 3 | Deep Learning Guide | ✓ Yes | 2 | 2/8 = 0.250 |
| 4 | Neural Networks | ✓ Yes | 3 | 3/8 = 0.375 |
| 5 | Cooking Recipes | ✗ No | 3 | 3/8 = 0.375 |
| 6 | Data Science Intro | ✓ Yes | 4 | 4/8 = 0.500 |
| 7 | Random Blog Post | ✗ No | 4 | 4/8 = 0.500 |
| 8 | ML Research Paper | ✓ Yes | 5 | 5/8 = 0.625 |
| 9 | Weather Forecast | ✗ No | 5 | 5/8 = 0.625 |
| 10 | Algorithm Book | ✓ Yes | 6 | 6/8 = 0.750 |
Interpreting the Example:
Unlike P@k, R@k is monotonically non-decreasing:
Recall can never decrease as k increases because we can only accumulate more relevant items (or stay flat if the new position is irrelevant).
The Coverage Interpretation:
Recall@k captures coverage—how much of the relevant content have we exposed to the user? In applications where users need comprehensive results (e.g., legal document discovery, prior art search in patents), high recall is critical. Missing even one relevant document could have serious consequences.
When a query has no relevant items (|R| = 0), Recall@k is undefined (division by zero). This edge case requires explicit handling: some systems report NaN, others report 0, and others exclude such queries from evaluation. Always document your handling of this case.
Precision@k and Recall@k are intimately related through their shared numerator—the count of relevant items in the top-k. This creates a fundamental relationship:
$$\text{Precision@}k \times k = \text{Recall@}k \times |R|$$
Both sides equal $|R_k|$, the number of relevant items found.
The Precision-Recall Tradeoff:
As $k$ increases:
This creates the classic precision-recall tradeoff:
Mathematical Relationship:
We can derive one from the other:
$$\text{Recall@}k = \frac{k \cdot \text{Precision@}k}{|R|}$$
$$\text{Precision@}k = \frac{|R| \cdot \text{Recall@}k}{k}$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
import numpy as npimport matplotlib.pyplot as plt def compute_precision_recall_at_k(relevance_list: list[int], total_relevant: int): """ Compute P@k and R@k for all values of k. Args: relevance_list: Binary list where 1 = relevant item at that rank total_relevant: Total number of relevant items in corpus |R| Returns: Dict with precision@k and recall@k arrays """ relevance = np.array(relevance_list) n = len(relevance) # Cumulative relevant count at each position cumulative_relevant = np.cumsum(relevance) # P@k = cumulative_relevant / k k_values = np.arange(1, n + 1) precision_at_k = cumulative_relevant / k_values # R@k = cumulative_relevant / |R| recall_at_k = cumulative_relevant / total_relevant if total_relevant > 0 else np.zeros(n) return { 'k': k_values, 'precision': precision_at_k, 'recall': recall_at_k, 'cumulative_relevant': cumulative_relevant } # Example: Ranked list with relevance judgments# 1 = relevant, 0 = not relevantranked_relevance = [1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]total_relevant = 10 # Total relevant items in corpus results = compute_precision_recall_at_k(ranked_relevance, total_relevant) # Display resultsprint("Precision@k and Recall@k Analysis")print("=" * 50)print(f"Total items ranked: {len(ranked_relevance)}")print(f"Total relevant in corpus: {total_relevant}")print(f"Relevant in ranked list: {sum(ranked_relevance)}")print() print(f"{'k':>3} | {'Rel@k':>5} | {'P@k':>6} | {'R@k':>6}")print("-" * 30)for k in [1, 3, 5, 10, 15, 20]: if k <= len(ranked_relevance): print(f"{k:>3} | {int(results['cumulative_relevant'][k-1]):>5} | " f"{results['precision'][k-1]:>6.3f} | {results['recall'][k-1]:>6.3f}") # Verify the relationship: P@k * k = R@k * |R|print("\nVerifying P@k × k = R@k × |R|:")for k in [5, 10]: lhs = results['precision'][k-1] * k rhs = results['recall'][k-1] * total_relevant print(f"k={k}: P@k × k = {lhs:.2f}, R@k × |R| = {rhs:.2f}")Visualizing the Tradeoff:
Plotting P@k against R@k as k varies produces a precision-recall curve for the ranking. Unlike classification PR curves (which vary with threshold), ranking PR curves vary with cutoff $k$. The curve typically shows:
Implementing P@k and R@k correctly requires attention to several practical considerations. Let's build a robust implementation from first principles.
Input Representations:
Ranking metrics can be computed from various input formats:
Algorithm Complexity:
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| P@k from relevance list | O(k) | O(1) |
| R@k from relevance list | O(k) | O(1) |
| All P@k, R@k for k=1..n | O(n) | O(n) |
| Building from item IDs | O(k) with hash set for relevant | O( |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170
import numpy as npfrom typing import List, Set, Optional, Dict, Any class RankingMetrics: """ Production-grade implementation of ranking metrics. Handles edge cases, supports multiple input formats, and provides comprehensive metric computation. """ def __init__(self, relevant_items: Set[Any], ranked_items: List[Any]): """ Initialize with ground truth and ranked predictions. Args: relevant_items: Set of item IDs that are relevant ranked_items: List of item IDs in ranked order (position 0 = rank 1) """ self.relevant_items = set(relevant_items) self.ranked_items = ranked_items self.n_relevant = len(self.relevant_items) self.n_ranked = len(ranked_items) # Precompute binary relevance array for efficiency self._relevance_array = np.array([ 1 if item in self.relevant_items else 0 for item in ranked_items ]) # Precompute cumulative relevance self._cumulative_relevant = np.cumsum(self._relevance_array) def precision_at_k(self, k: int) -> float: """ Compute Precision@k. Args: k: Cutoff rank (1-indexed conceptually, but we handle internally) Returns: Precision@k value in [0, 1] Raises: ValueError: If k < 1 """ if k < 1: raise ValueError(f"k must be >= 1, got {k}") # Handle k larger than our ranked list effective_k = min(k, self.n_ranked) if effective_k == 0: return 0.0 relevant_in_top_k = self._cumulative_relevant[effective_k - 1] return float(relevant_in_top_k) / k # Note: divide by k, not effective_k def recall_at_k(self, k: int) -> Optional[float]: """ Compute Recall@k. Args: k: Cutoff rank Returns: Recall@k value in [0, 1], or None if no relevant items exist """ if k < 1: raise ValueError(f"k must be >= 1, got {k}") if self.n_relevant == 0: return None # Undefined when no relevant items effective_k = min(k, self.n_ranked) if effective_k == 0: return 0.0 relevant_in_top_k = self._cumulative_relevant[effective_k - 1] return float(relevant_in_top_k) / self.n_relevant def precision_recall_at_all_k(self) -> Dict[str, np.ndarray]: """ Compute P@k and R@k for all k from 1 to n_ranked. Returns: Dictionary with 'k', 'precision', 'recall', 'f1' arrays """ if self.n_ranked == 0: return { 'k': np.array([]), 'precision': np.array([]), 'recall': np.array([]), 'f1': np.array([]) } k_values = np.arange(1, self.n_ranked + 1) precision = self._cumulative_relevant / k_values if self.n_relevant > 0: recall = self._cumulative_relevant / self.n_relevant else: recall = np.zeros(self.n_ranked) # F1@k = harmonic mean of P@k and R@k with np.errstate(divide='ignore', invalid='ignore'): f1 = 2 * precision * recall / (precision + recall) f1 = np.nan_to_num(f1) # Handle 0/0 cases return { 'k': k_values, 'precision': precision, 'recall': recall, 'f1': f1 } def r_precision(self) -> float: """ Compute R-Precision: Precision at k=|R|. A single-number summary that balances precision and recall. """ if self.n_relevant == 0: return 0.0 return self.precision_at_k(self.n_relevant) # Example usage with realistic dataif __name__ == "__main__": # Scenario: Product search results # User searched for "wireless headphones" relevant_products = { 'sony_wh1000xm4', 'bose_qc45', 'apple_airpods_max', 'sennheiser_momentum', 'jabra_elite_85h', 'audio_technica_m50x' } # 6 relevant products # Model's ranked output (top 15 results) search_results = [ 'sony_wh1000xm4', # Rank 1: Relevant ✓ 'cheap_wired_earbuds', # Rank 2: Not relevant 'bose_qc45', # Rank 3: Relevant ✓ 'phone_case', # Rank 4: Not relevant 'apple_airpods_max', # Rank 5: Relevant ✓ 'usb_cable', # Rank 6: Not relevant 'jabra_elite_85h', # Rank 7: Relevant ✓ 'random_speaker', # Rank 8: Not relevant 'bluetooth_adapter', # Rank 9: Not relevant 'sennheiser_momentum', # Rank 10: Relevant ✓ 'gaming_headset', # Rank 11: Not relevant 'microphone', # Rank 12: Not relevant 'audio_technica_m50x', # Rank 13: Relevant ✓ 'laptop_stand', # Rank 14: Not relevant 'mouse_pad', # Rank 15: Not relevant ] metrics = RankingMetrics(relevant_products, search_results) print("Product Search Ranking Evaluation") print("=" * 50) print(f"Total relevant products: {len(relevant_products)}") print(f"Products in ranked list: {len(search_results)}") print() # Report key cutoffs for k in [1, 3, 5, 10]: p_at_k = metrics.precision_at_k(k) r_at_k = metrics.recall_at_k(k) print(f"P@{k:2d} = {p_at_k:.3f}, R@{k:2d} = {r_at_k:.3f}") print(f"\nR-Precision (P@{len(relevant_products)}): {metrics.r_precision():.3f}")Understanding the statistical properties of P@k and R@k is essential for proper interpretation and comparison.
Expected Values Under Random Ranking:
If items are ranked uniformly at random, what are the expected values of P@k and R@k?
Let $\rho = |R|/N$ be the base rate—the proportion of relevant items in the corpus.
$$\mathbb{E}[\text{P@}k] = \rho$$ $$\mathbb{E}[\text{R@}k] = \frac{k}{N}$$ (approximately, for large N)
A good ranker should significantly outperform these baselines. If your P@10 equals the base rate ρ, your model is no better than random.
Variance and Confidence:
The variance of P@k (under random ranking) follows a binomial distribution:
$$\text{Var}[\text{P@}k] = \frac{\rho(1-\rho)}{k}$$
This has important implications:
| P@k Range | Interpretation | Typical Action |
|---|---|---|
| 0.9 - 1.0 | Excellent: Almost all shown items are relevant | Possibly too conservative; check recall |
| 0.7 - 0.9 | Good: Strong majority of results are relevant | Solid performance for most applications |
| 0.5 - 0.7 | Moderate: About half the results are useful | Room for improvement; check feature quality |
| 0.3 - 0.5 | Weak: More noise than signal | Significant model improvements needed |
| < 0.3 | Poor: Mostly irrelevant results | Fundamental issues with ranking model |
These interpretations assume balanced relevance. If only 1% of items are relevant (|R|/N = 0.01), then P@10 = 0.3 means your model is 30× better than random—which is excellent! Always contextualize P@k against the base rate and the specific application requirements.
Averaging Across Queries:
In practice, we evaluate a ranking system across many queries and report average metrics:
$$\text{Mean P@}k = \frac{1}{|Q|} \sum_{q \in Q} \text{P@}k(q)$$
$$\text{Mean R@}k = \frac{1}{|Q|} \sum_{q \in Q} \text{R@}k(q)$$
Statistical Significance:
When comparing two ranking systems, use appropriate statistical tests:
A typical threshold is p < 0.05 for claiming one system is significantly better.
R-Precision is a special case of Precision@k where $k$ is set to $|R|$, the total number of relevant items for the query:
$$\text{R-Precision} = \text{Precision@}|R| = \frac{|R_{|R|}|}{|R|}$$
Why R-Precision is Elegant:
At $k = |R|$, something mathematically beautiful happens:
$$\text{R-Precision} = \text{Precision@}|R| = \text{Recall@}|R|$$
Proof: At k = |R|:
Both reduce to the same quantity!
Interpretation:
R-Precision asks: "If you could only show |R| items (the exact number that are relevant), how many of those would be correct?" This is a natural performance measure because:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as np def r_precision(ranked_relevance: list[int], total_relevant: int) -> float: """ Compute R-Precision: Precision at k = |R|. Args: ranked_relevance: Binary relevance of items in ranked order total_relevant: Total number of relevant items |R| Returns: R-Precision value """ if total_relevant == 0: return 0.0 # Look at top |R| positions k = total_relevant top_k = ranked_relevance[:k] if k <= len(ranked_relevance) else ranked_relevance # Count relevant in top k relevant_found = sum(top_k) return relevant_found / total_relevant def compare_with_precision_recall(ranked_relevance: list[int], total_relevant: int): """ Demonstrate that R-Precision = P@|R| = R@|R| """ k = total_relevant relevant_in_top_k = sum(ranked_relevance[:k]) if k <= len(ranked_relevance) else sum(ranked_relevance) precision_at_R = relevant_in_top_k / k if k > 0 else 0 recall_at_R = relevant_in_top_k / total_relevant if total_relevant > 0 else 0 r_prec = r_precision(ranked_relevance, total_relevant) print(f"k = |R| = {k}") print(f"Relevant items in top-{k}: {relevant_in_top_k}") print(f"P@{k} = {relevant_in_top_k}/{k} = {precision_at_R:.4f}") print(f"R@{k} = {relevant_in_top_k}/{total_relevant} = {recall_at_R:.4f}") print(f"R-Precision = {r_prec:.4f}") print(f"P@|R| == R@|R|: {abs(precision_at_R - recall_at_R) < 1e-10}") # Example 1: Perfect ranking (all relevant items at top)print("Example 1: Perfect Ranking")print("-" * 40)# 5 relevant items, ranked at positions 1-5ranked = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]compare_with_precision_recall(ranked, total_relevant=5) print() # Example 2: Imperfect rankingprint("Example 2: Imperfect Ranking")print("-" * 40)# 5 relevant items scattered in top 10ranked = [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]compare_with_precision_recall(ranked, total_relevant=5) print() # Example 3: Worst case (all relevant after position |R|)print("Example 3: Worst Ranking")print("-" * 40)# 3 relevant items, all ranked after position 3ranked = [0, 0, 0, 1, 1, 1, 0, 0, 0, 0]compare_with_precision_recall(ranked, total_relevant=3)R-Precision is particularly useful when you want a single-number summary that doesn't require choosing an arbitrary k. It's commonly used in TREC evaluations and when comparing systems that must adapt to queries with varying numbers of relevant documents.
P@k and R@k manifest differently across application domains. Understanding these differences is crucial for selecting appropriate values of k and interpreting results correctly.
| Domain | Typical k Values | Primary Focus | Key Considerations |
|---|---|---|---|
| Web Search | k=1,3,5,10 | P@k (user sees few results) | Position 1 dominates; P@1 often most important |
| E-commerce Recommendations | k=5,10,20 | Balance P@k and R@k | Revenue depends on clicks; explore-exploit tradeoff |
| Document Retrieval (Legal) | k=100,1000 | R@k (completeness vital) | Missing a document could lose a case |
| Medical Literature | k=20,50,100 | High R@k with acceptable P@k | Must not miss relevant studies; false positives cost time |
| Social Media Feeds | k=10,20,50 | P@k (engagement focus) | User attention is limited; stale content penalized |
| Question Answering | k=1,3,5 | P@k, especially P@1 | User expects answer in first result |
Web Search: The Position 1 Obsession
In web search, P@1 (also called Hit Rate@1) is often the most critical metric. User behavior studies show:
For Google and Bing, improving P@1 by even 0.1% translates to millions of better experiences daily.
Recommendation Systems: The Coverage Challenge
In recommendations, there's tension between:
Overly precision-focused systems create "filter bubbles" where users never discover new content. Systems often track R@50 or R@100 to ensure diversity.
Legal/Medical: The "Can't Miss" Requirement
In high-stakes domains, recall often trumps precision:
These domains often require R@k ≥ 0.95 for large k, accepting low precision as the cost of completeness.
While P@k and R@k are foundational, they have important limitations that motivate more sophisticated metrics.
Limitation 1: Position Blindness Within Top-k
P@k treats all positions within the top-k equally. Whether the 3 relevant items are at positions {1,2,3} or {8,9,10}, P@10 = 0.3 in both cases. But the user experience differs dramatically! This limitation motivates position-weighted metrics like NDCG and MRR.
Limitation 2: Binary Relevance Only
P@k and R@k assume binary relevance (relevant or not). In reality, relevance is often graded:
Graded relevance requires metrics like NDCG (covered in a later page).
Limitation 3: Arbitrary k Selection
The choice of k is somewhat arbitrary. Different k values can lead to different conclusions about which system is better. This motivates:
Limitation 4: No Penalty for Specific Error Types
P@k doesn't distinguish between:
These distinctions require more expressive metrics.
The limitations of P@k and R@k have spawned dozens of alternative ranking metrics. Subsequent pages will cover Mean Average Precision (MAP), NDCG, MRR, and partial AUC—each addressing specific limitations. P@k and R@k remain foundational because they're intuitive, easy to compute, and often sufficiently informative for many applications.
123456789101112131415161718192021222324252627
# Demonstrating limitations of P@k # Example: Same P@k, very different user experienceranking_a = [1, 1, 1, 0, 0, 0, 0, 0, 0, 0] # Relevant at positions 1,2,3ranking_b = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1] # Relevant at positions 8,9,10 # Both have identical P@10p_at_10_a = sum(ranking_a) / 10 # = 0.3p_at_10_b = sum(ranking_b) / 10 # = 0.3 print("Limitation 1: Position Blindness")print(f"Ranking A (relevant at top): P@10 = {p_at_10_a}")print(f"Ranking B (relevant at bottom): P@10 = {p_at_10_b}")print("Same P@10, but Ranking A is clearly better for users!")print() # Example: Inconsistent winner at different kranking_x = [1, 1, 0, 0, 0, 0, 0, 0, 0, 0]ranking_y = [0, 0, 1, 1, 1, 0, 0, 0, 0, 0] print("Limitation 3: k-Dependent Conclusions")for k in [2, 3, 5, 10]: p_x = sum(ranking_x[:k]) / k p_y = sum(ranking_y[:k]) / k winner = "X" if p_x > p_y else ("Y" if p_y > p_x else "Tie") print(f"P@{k}: X={p_x:.2f}, Y={p_y:.2f} → Winner: {winner}")print("Different k values give different winners!")We have established Precision@k and Recall@k as the foundational metrics for evaluating ranked retrieval systems. Let's consolidate our understanding:
Mathematical Summary:
| Metric | Formula | Range | Monotonic in k? |
|---|---|---|---|
| P@k | $\frac{\sum_{i=1}^{k} \text{rel}_i}{k}$ | [0, 1] | No |
| R@k | $\frac{\sum_{i=1}^{k} \text{rel}_i}{ | R | }$ |
| R-Precision | $\frac{\sum_{i=1}^{ | R | } \text{rel}_i}{ |
What's Next:
With P@k and R@k established, we're ready to examine Mean Average Precision (MAP)—a metric that elegantly combines precision at all relevant positions into a single summary statistic, addressing the arbitrary k-selection limitation.
You now understand Precision@k and Recall@k as the foundational metrics for ranking evaluation. This knowledge is prerequisite to understanding more sophisticated metrics like MAP, NDCG, and MRR—each of which builds upon and addresses limitations of these fundamental measures.