Loading content...
The metrics we've studied so far—Precision@k, Recall@k, and MAP—treat relevance as a binary property: an item is either relevant or not. But in practice, relevance is rarely black and white.
Consider rating a movie recommendation:
Binary metrics cannot distinguish between placing a 5-star movie versus a 3-star movie at position 1. Yet user satisfaction differs dramatically! Similarly, in web search, a perfect answer (directly contains what the user sought) differs qualitatively from a partially relevant page (contains some useful information).
Normalized Discounted Cumulative Gain (NDCG) addresses this limitation with two key innovations:
NDCG has become the standard metric for recommendation systems, graded search evaluation, and any ranking task where relevance admits degrees.
By the end of this page, you will understand NDCG from first principles—from Cumulative Gain through Discounted Cumulative Gain to the normalized form. You will master the mathematical foundations, implementation details, variants, and practical considerations that make NDCG the industry standard for graded ranking evaluation.
We begin with the simplest aggregation of graded relevance: Cumulative Gain (CG).
Definition:
Given a ranked list of items with relevance scores $\text{rel}_1, \text{rel}_2, \ldots, \text{rel}_n$, the Cumulative Gain at position $k$ is simply the sum of relevance scores up to that position:
$$\text{CG@}k = \sum_{i=1}^{k} \text{rel}_i$$
Properties:
The Critical Limitation:
The order-independence is precisely CG's weakness. Consider two rankings:
| Rank | Ranking A (relevance) | Ranking B (relevance) |
|---|---|---|
| 1 | 3 | 0 |
| 2 | 2 | 0 |
| 3 | 0 | 2 |
| 4 | 0 | 3 |
Both have CG@4 = 5, but Ranking A is clearly superior for users who see fewer results!
1234567891011121314151617181920212223242526272829303132333435
import numpy as np def cumulative_gain(relevance_scores: list, k: int = None) -> np.ndarray: """ Compute Cumulative Gain at each position. Args: relevance_scores: Graded relevance scores in ranked order k: Maximum position to consider (None = all) Returns: Array of CG values at each position """ rel = np.array(relevance_scores) if k is not None: rel = rel[:k] return np.cumsum(rel) # Example: CG fails to distinguish ranking qualityranking_a = [3, 2, 0, 0] # Best items firstranking_b = [0, 0, 2, 3] # Best items last cg_a = cumulative_gain(ranking_a)cg_b = cumulative_gain(ranking_b) print("Cumulative Gain Comparison")print("=" * 45)print(f"{'Position':<10} {'CG (Ranking A)':<18} {'CG (Ranking B)':<18}")print("-" * 45)for i in range(4): print(f"{i+1:<10} {cg_a[i]:<18} {cg_b[i]:<18}") print(f"\nFinal CG@4: A = {cg_a[-1]}, B = {cg_b[-1]}")print("Despite identical final CG, Ranking A is clearly superior!")print("CG cannot capture the value of placing good items early.")CG is rarely used as a standalone metric. Its purpose is pedagogical—it establishes the baseline of summing relevance, upon which we build the position-discounting mechanism of DCG.
Discounted Cumulative Gain (DCG) introduces a position-based discount that reduces the contribution of items at lower ranks. The intuition: users are less likely to see, examine, or benefit from items further down the list.
Two Common Formulations:
Formulation 1 (Original, Järvelin & Kekäläinen 2002):
$$\text{DCG@}k = \text{rel}1 + \sum{i=2}^{k} \frac{\text{rel}_i}{\log_2(i)}$$
The first position is not discounted; subsequent positions are discounted by $\log_2(i)$.
Formulation 2 (Industry Standard, stronger emphasis on top positions):
$$\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$$
This formulation:
The Exponential Gain (2^rel - 1):
With the exponential numerator:
This creates a non-linear relationship where highly relevant items contribute disproportionately more than marginally relevant ones.
| Position i | log₂(i+1) | Discount Factor 1/log₂(i+1) |
|---|---|---|
| 1 | 1.00 | 1.000 |
| 2 | 1.58 | 0.631 |
| 3 | 2.00 | 0.500 |
| 4 | 2.32 | 0.431 |
| 5 | 2.58 | 0.387 |
| 10 | 3.46 | 0.289 |
| 20 | 4.39 | 0.228 |
| 50 | 5.67 | 0.176 |
| 100 | 6.66 | 0.150 |
Interpreting the Discount:
The logarithmic discount means:
This models the empirical observation that user attention drops rapidly with position, but not instantly. The logarithm provides a "soft" decay—gentler than exponential but still significant.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as np def dcg_at_k(relevance_scores: list, k: int = None, use_exponential_gain: bool = True) -> float: """ Compute Discounted Cumulative Gain at position k. Args: relevance_scores: Graded relevance scores in ranked order k: Cutoff position (None = use all) use_exponential_gain: If True, use (2^rel - 1); if False, use rel Returns: DCG@k value """ rel = np.array(relevance_scores, dtype=float) if k is not None: rel = rel[:k] n = len(rel) if n == 0: return 0.0 # Position discount: 1 / log2(i + 1) for i = 1, 2, ..., n positions = np.arange(1, n + 1) discounts = 1.0 / np.log2(positions + 1) # Gain: either exponential or linear if use_exponential_gain: gains = (2 ** rel) - 1 else: gains = rel return np.sum(gains * discounts) def dcg_at_all_k(relevance_scores: list, use_exponential_gain: bool = True) -> np.ndarray: """Compute DCG at all cutoffs from 1 to n.""" rel = np.array(relevance_scores, dtype=float) n = len(rel) positions = np.arange(1, n + 1) discounts = 1.0 / np.log2(positions + 1) if use_exponential_gain: gains = (2 ** rel) - 1 else: gains = rel return np.cumsum(gains * discounts) # Compare the two rankings from beforeranking_a = [3, 2, 0, 0] # Best items firstranking_b = [0, 0, 2, 3] # Best items last dcg_a = dcg_at_all_k(ranking_a)dcg_b = dcg_at_all_k(ranking_b) print("DCG Comparison (with exponential gain)")print("=" * 55)print(f"{'Position':<10} {'DCG (Ranking A)':<20} {'DCG (Ranking B)':<20}")print("-" * 55)for i in range(4): print(f"{i+1:<10} {dcg_a[i]:<20.4f} {dcg_b[i]:<20.4f}") print(f"\nDCG@4: Ranking A = {dcg_a[-1]:.4f}, Ranking B = {dcg_b[-1]:.4f}")print(f"Improvement of A over B: {100*(dcg_a[-1] - dcg_b[-1])/dcg_b[-1]:.1f}%")print("\nNow DCG successfully captures that Ranking A is superior!") # Show the contribution breakdownprint("\n" + "=" * 60)print("Contribution Breakdown for Ranking A (rel = [3, 2, 0, 0]):")print("-" * 60)for i, rel in enumerate(ranking_a): pos = i + 1 discount = 1 / np.log2(pos + 1) gain = (2 ** rel) - 1 contribution = gain * discount print(f"Position {pos}: gain = 2^{rel}-1 = {gain}, " f"discount = 1/log₂({pos}+1) = {discount:.4f}, " f"contribution = {contribution:.4f}")DCG's absolute values depend on the number and levels of relevant items, making it hard to compare across queries. A query with five highly relevant items will have higher DCG than a query with two marginally relevant items, regardless of ranking quality.
Normalization via the Ideal Ranking:
NDCG normalizes DCG by the Ideal DCG (IDCG)—the DCG achieved by the perfect ranking that places all items in decreasing order of relevance:
$$\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$$
where IDCG@k is the DCG@k of the ideal ranking (items sorted by relevance, descending).
Properties of NDCG:
Edge Case: IDCG = 0
If no items are relevant (all relevance = 0), IDCG@k = 0, making NDCG undefined. Common conventions:
Choose based on your application semantics and document the choice.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
import numpy as npfrom typing import Optional def ndcg_at_k(relevance_scores: list, k: int = None, use_exponential_gain: bool = True) -> float: """ Compute NDCG@k (Normalized Discounted Cumulative Gain). Args: relevance_scores: Graded relevance in ranked order k: Cutoff position (None = use all) use_exponential_gain: Use (2^rel - 1) if True, else rel Returns: NDCG@k value in [0, 1] """ rel = np.array(relevance_scores, dtype=float) if k is not None: rel = rel[:k] n = len(rel) if n == 0: return 0.0 # Compute DCG of the given ranking positions = np.arange(1, n + 1) discounts = 1.0 / np.log2(positions + 1) if use_exponential_gain: gains = (2 ** rel) - 1 else: gains = rel dcg = np.sum(gains * discounts) # Compute IDCG (ideal ranking: sort gains descending) ideal_gains = np.sort(gains)[::-1] idcg = np.sum(ideal_gains * discounts) # Handle IDCG = 0 if idcg == 0: return 0.0 # No relevant items return dcg / idcg def ndcg_at_all_k(relevance_scores: list, use_exponential_gain: bool = True) -> np.ndarray: """Compute NDCG at all cutoffs from 1 to n.""" n = len(relevance_scores) return np.array([ndcg_at_k(relevance_scores, k=i+1, use_exponential_gain=use_exponential_gain) for i in range(n)]) # Example comparisonsprint("NDCG Demonstration")print("=" * 60) # Perfect rankingperfect = [3, 3, 2, 1, 0, 0]ndcg_perfect = ndcg_at_k(perfect)print(f"Perfect ranking [3,3,2,1,0,0]: NDCG = {ndcg_perfect:.4f}") # Good but imperfectgood = [3, 2, 3, 1, 0, 0] # Swapped positions 2 and 3ndcg_good = ndcg_at_k(good)print(f"Slightly off [3,2,3,1,0,0]: NDCG = {ndcg_good:.4f}") # Poor rankingpoor = [0, 0, 1, 2, 3, 3] # Reversedndcg_poor = ndcg_at_k(poor)print(f"Reversed [0,0,1,2,3,3]: NDCG = {ndcg_poor:.4f}") # NDCG at different cutoffsprint("\nNDCG at Different Cutoffs:")print("-" * 40)relevance = [3, 0, 2, 0, 1, 3, 0, 0, 2, 1]ndcgs = ndcg_at_all_k(relevance)for k in [1, 3, 5, 10]: print(f"NDCG@{k:2d} = {ndcgs[k-1]:.4f}") # Comparing two systemsprint("\nComparing Two Recommendation Systems:")print("-" * 50)# User's true ratings for 8 itemstrue_ratings = [5, 4, 3, 2, 1, 3, 0, 4] # Indices 0-7 # System A's ranking: item indices in order shownsystem_a_order = [0, 7, 1, 3, 5, 2, 4, 6] # Shows item 0 first, then 7, etc.system_a_rels = [true_ratings[i] for i in system_a_order] # System B's rankingsystem_b_order = [2, 5, 4, 6, 0, 1, 7, 3]system_b_rels = [true_ratings[i] for i in system_b_order] print(f"System A relevances: {system_a_rels}")print(f"System B relevances: {system_b_rels}")print(f"NDCG@5 - System A: {ndcg_at_k(system_a_rels, k=5):.4f}, " f"System B: {ndcg_at_k(system_b_rels, k=5):.4f}")print(f"NDCG@8 - System A: {ndcg_at_k(system_a_rels, k=8):.4f}, " f"System B: {ndcg_at_k(system_b_rels, k=8):.4f}")Normalization serves two purposes: (1) It bounds NDCG to [0,1], making values interpretable regardless of the relevance scale or number of relevant items. (2) It enables fair comparison across queries—a query with 10 highly relevant items and a query with 2 marginally relevant items can be aggregated into a single mean NDCG.
Computing IDCG@k correctly requires care. There are two common approaches, yielding slightly different semantics:
Approach 1: IDCG from the full corpus (Common in academia)
Sort all relevant items in the corpus by relevance, take the top-k, compute DCG. This is the maximum possible DCG@k if we had perfect knowledge and could select and rank any items.
Approach 2: IDCG from the returned list (Common in industry)
Sort only the items in the returned ranking by relevance, compute DCG@k. This measures how well you ranked within your own result set.
Key Difference:
Suppose you return 10 items but there are 100 relevant items in the corpus:
Approach 2 can give NDCG = 1 even if you missed highly relevant items—as long as you correctly ordered what you returned. Approach 1 penalizes missing relevant items.
Recommendation:
For evaluating retrieval/recommendation systems, Approach 1 is more rigorous as it penalizes poor recall. For evaluating pure ranking (where the item set is fixed), Approach 2 is appropriate.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as np def ndcg_approach_1(relevance_returned: list, all_relevances: list, k: int) -> float: """ NDCG with IDCG computed from full corpus. Penalizes missing highly relevant items. """ rel = np.array(relevance_returned[:k], dtype=float) all_rel = np.array(all_relevances, dtype=float) n = len(rel) positions = np.arange(1, n + 1) discounts = 1.0 / np.log2(positions + 1) gains = (2 ** rel) - 1 dcg = np.sum(gains * discounts) # IDCG from best k items in corpus ideal_gains = np.sort((2 ** all_rel) - 1)[::-1][:k] ideal_positions = np.arange(1, len(ideal_gains) + 1) ideal_discounts = 1.0 / np.log2(ideal_positions + 1) idcg = np.sum(ideal_gains * ideal_discounts) return dcg / idcg if idcg > 0 else 0.0 def ndcg_approach_2(relevance_returned: list, k: int) -> float: """ NDCG with IDCG computed from returned list only. Measures ranking quality of returned items. """ rel = np.array(relevance_returned[:k], dtype=float) n = len(rel) positions = np.arange(1, n + 1) discounts = 1.0 / np.log2(positions + 1) gains = (2 ** rel) - 1 dcg = np.sum(gains * discounts) # IDCG from returned list only ideal_gains = np.sort(gains)[::-1] idcg = np.sum(ideal_gains * discounts) return dcg / idcg if idcg > 0 else 0.0 # Scenario: System returns 5 items, but corpus has more relevant items# True corpus relevances (indices 0-9)corpus_relevances = [5, 5, 4, 3, 3, 2, 2, 1, 1, 0] # Some highly relevant # System returns items at indices [7, 2, 9, 5, 4] (missed items 0, 1)returned_indices = [7, 2, 9, 5, 4]returned_relevances = [corpus_relevances[i] for i in returned_indices]# returned_relevances = [1, 4, 0, 2, 3] k = 5print("IDCG Approach Comparison")print("=" * 60)print(f"Corpus has items with relevances: {corpus_relevances}")print(f"System returned items with relevances: {returned_relevances}")print(f"(Missed the two highest-relevance items with rel=5)")print() ndcg_1 = ndcg_approach_1(returned_relevances, corpus_relevances, k)ndcg_2 = ndcg_approach_2(returned_relevances, k) print(f"Approach 1 (IDCG from corpus): NDCG@{k} = {ndcg_1:.4f}")print(f"Approach 2 (IDCG from returned): NDCG@{k} = {ndcg_2:.4f}")print()print("Approach 1 penalizes missing the highly relevant items.")print("Approach 2 only measures how well we ranked what we returned.")NDCG values are not comparable if computed with different IDCG approaches. Always specify whether IDCG is computed from the full corpus or only the returned items. When using libraries like scikit-learn, check the documentation to understand which approach is implemented.
NDCG relates to other ranking metrics in important ways:
NDCG vs. MAP:
| Aspect | NDCG | MAP |
|---|---|---|
| Relevance | Graded (0, 1, 2, ...) | Binary (relevant or not) |
| Position weighting | Explicit log discount | Implicit via precision at relevant positions |
| Normalization | By ideal ranking | By total relevant items |
| When to use | Graded relevance available | Only binary judgments |
NDCG with Binary Relevance:
When relevance is binary (0 or 1), NDCG behaves similarly to (but not identically to) MAP:
NDCG vs. Precision@k:
| Aspect | NDCG@k | Precision@k |
|---|---|---|
| Position within top-k | Matters (log discount) | Doesn't matter (all equal) |
| Relevance grades | Utilized | Collapsed to binary |
| Use case | Fine-grained ranking evaluation | Simple binary evaluation |
Connection to Expected Utility:
NDCG can be interpreted as a normalized expected utility under a user model where:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as np def all_metrics_comparison(relevance: list, k: int): """Compare NDCG, MAP, and P@k for a ranking.""" rel = np.array(relevance[:k], dtype=float) n = len(rel) # NDCG (graded) positions = np.arange(1, n + 1) discounts = 1.0 / np.log2(positions + 1) gains = (2 ** rel) - 1 dcg = np.sum(gains * discounts) ideal_gains = np.sort(gains)[::-1] idcg = np.sum(ideal_gains * discounts) ndcg = dcg / idcg if idcg > 0 else 0.0 # Binary conversion for P@k and MAP binary_rel = (rel > 0).astype(float) total_relevant = np.sum(binary_rel) # P@k precision_at_k = np.mean(binary_rel) # AP (assuming all relevant items are in the list) if total_relevant > 0: cumsum = np.cumsum(binary_rel) precisions = cumsum / np.arange(1, n + 1) ap = np.sum(precisions * binary_rel) / total_relevant else: ap = 0.0 return { 'NDCG': ndcg, 'P@k': precision_at_k, 'AP': ap } # Compare rankings with same binary relevance but different gradesprint("Metric Comparison: Same Binary Relevance, Different Grades")print("=" * 70) # Ranking A: Highly relevant items in relevant positionsranking_a = [5, 4, 0, 0, 3] # Binary: [1, 1, 0, 0, 1] # Ranking B: Marginally relevant items in same positionsranking_b = [1, 1, 0, 0, 1] # Binary: [1, 1, 0, 0, 1] metrics_a = all_metrics_comparison(ranking_a, k=5)metrics_b = all_metrics_comparison(ranking_b, k=5) print(f"Ranking A (high grades): {ranking_a}")print(f"Ranking B (low grades): {ranking_b}")print(f"Both have same binary pattern: [1, 1, 0, 0, 1]")print() print(f"{'Metric':<12} {'Ranking A':<15} {'Ranking B':<15} {'Difference'}")print("-" * 55)for metric in ['NDCG', 'P@k', 'AP']: a_val = metrics_a[metric] b_val = metrics_b[metric] print(f"{metric:<12} {a_val:<15.4f} {b_val:<15.4f} {a_val - b_val:.4f}") print()print("NDCG distinguishes between high and low grades.")print("P@k and AP (binary metrics) cannot see the difference!")Use NDCG when you have graded relevance judgments (star ratings, annotation levels). Use MAP when you only have binary relevance. If you have graded relevance but want to compare with systems evaluated on binary, report both—you can always binarize grades for MAP computation.
Several NDCG variants address specific needs:
1. Alternative Gain Functions:
Beyond linear and exponential, custom gain functions can model domain-specific utility:
2. Alternative Discount Functions:
The logarithmic discount can be replaced:
3. ERR (Expected Reciprocal Rank):
A cascade-model variant where users stop after finding a satisfying result. Lower-ranked items only contribute if higher-ranked items didn't satisfy the user:
$$\text{ERR} = \sum_{r=1}^{n} \frac{1}{r} \prod_{i=1}^{r-1}(1 - R_i) \cdot R_r$$
where $R_i$ is the probability of satisfaction at position $i$.
4. Intent-Aware NDCG:
For queries with multiple intents, compute NDCG per intent and aggregate:
$$\text{IA-NDCG} = \sum_{\text{intent } j} P(j) \cdot \text{NDCG}_j$$
where $P(j)$ is the probability of intent $j$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom typing import Callable def generalized_ndcg( relevance: list, k: int, gain_fn: Callable[[float], float] = lambda r: 2**r - 1, discount_fn: Callable[[int], float] = lambda i: 1/np.log2(i+1)) -> float: """ Generalized NDCG with customizable gain and discount functions. Args: relevance: Relevance scores in ranked order k: Cutoff position gain_fn: Function mapping relevance to gain discount_fn: Function mapping position to discount Returns: Generalized NDCG value """ rel = np.array(relevance[:k], dtype=float) n = len(rel) if n == 0: return 0.0 # Compute gains and discounts gains = np.array([gain_fn(r) for r in rel]) discounts = np.array([discount_fn(i+1) for i in range(n)]) dcg = np.sum(gains * discounts) # Ideal ranking ideal_gains = np.sort(gains)[::-1] idcg = np.sum(ideal_gains * discounts) return dcg / idcg if idcg > 0 else 0.0 # Compare different gain functionsrelevance = [3, 1, 2, 0, 2, 1] print("Impact of Gain Function on NDCG")print("=" * 50)print(f"Relevance scores: {relevance}")print() # Linear gainlinear_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: r)print(f"Linear gain (g(r) = r): NDCG = {linear_ndcg:.4f}") # Exponential gain (standard)exp_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: 2**r - 1)print(f"Exponential (g(r) = 2^r-1): NDCG = {exp_ndcg:.4f}") # Custom: Square gain (emphasizes high relevance even more)square_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: r**2)print(f"Square gain (g(r) = r^2): NDCG = {square_ndcg:.4f}") # Compare different discount functionsprint("\nImpact of Discount Function on NDCG")print("=" * 50) # Standard log discountlog_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/np.log2(i+1))print(f"Log discount (d(i) = 1/log₂(i+1)): NDCG = {log_ndcg:.4f}") # Linear discount (steeper)linear_disc_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/i)print(f"Linear discount (d(i) = 1/i): NDCG = {linear_disc_ndcg:.4f}") # Gentle discountgentle_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/np.log(i+np.e))print(f"Gentle discount (d(i) = 1/ln(i+e)): NDCG = {gentle_ndcg:.4f}")Implementing NDCG in production requires attention to several practical issues:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import numpy as npfrom typing import List, Optionalfrom dataclasses import dataclass @dataclassclass NDCGResult: """Container for NDCG computation results.""" ndcg: float dcg: float idcg: float k_effective: int class NDCGEvaluator: """Production-grade NDCG evaluator.""" def __init__(self, use_exponential_gain: bool = True, log_base: float = 2.0): """ Initialize NDCG evaluator. Args: use_exponential_gain: Use 2^rel - 1 if True, else rel log_base: Base of logarithm for position discount """ self.use_exp_gain = use_exponential_gain self.log_base = log_base def _compute_gains(self, rel: np.ndarray) -> np.ndarray: """Compute gains from relevance scores.""" if self.use_exp_gain: return (2 ** rel) - 1 return rel.copy() def _compute_discounts(self, n: int) -> np.ndarray: """Compute position discounts for n positions.""" positions = np.arange(1, n + 1) return 1.0 / (np.log(positions + 1) / np.log(self.log_base)) def ndcg_at_k(self, relevance: List[float], k: Optional[int] = None, ideal_relevance: Optional[List[float]] = None) -> NDCGResult: """ Compute NDCG@k with full diagnostics. Args: relevance: Relevance scores in ranked order k: Cutoff position (None = use all) ideal_relevance: Full corpus relevances for IDCG (None = use items in relevance list) Returns: NDCGResult with NDCG, DCG, IDCG, and effective k """ rel = np.array(relevance, dtype=float) if k is not None: rel = rel[:k] n = len(rel) if n == 0: return NDCGResult(ndcg=0.0, dcg=0.0, idcg=0.0, k_effective=0) # Compute DCG gains = self._compute_gains(rel) discounts = self._compute_discounts(n) dcg = np.sum(gains * discounts) # Compute IDCG if ideal_relevance is not None: ideal_rel = np.array(ideal_relevance, dtype=float) ideal_gains = np.sort(self._compute_gains(ideal_rel))[::-1][:n] else: ideal_gains = np.sort(gains)[::-1] ideal_discounts = self._compute_discounts(len(ideal_gains)) idcg = np.sum(ideal_gains * ideal_discounts) # Compute NDCG ndcg = dcg / idcg if idcg > 0 else 0.0 return NDCGResult( ndcg=float(ndcg), dcg=float(dcg), idcg=float(idcg), k_effective=n ) def mean_ndcg(self, query_relevances: List[List[float]], k: Optional[int] = None) -> dict: """ Compute mean NDCG across multiple queries. Args: query_relevances: List of relevance lists (one per query) k: Cutoff position Returns: Dict with mean NDCG and per-query values """ results = [self.ndcg_at_k(rel, k) for rel in query_relevances] ndcgs = [r.ndcg for r in results] return { 'mean_ndcg': np.mean(ndcgs), 'std_ndcg': np.std(ndcgs), 'per_query_ndcg': ndcgs, 'num_queries': len(ndcgs) } # Demoevaluator = NDCGEvaluator(use_exponential_gain=True, log_base=2.0) # Single query evaluationrel = [3, 2, 3, 0, 1, 2, 0, 0, 1, 0]result = evaluator.ndcg_at_k(rel, k=10) print("Single Query NDCG Evaluation")print("=" * 50)print(f"Relevances: {rel}")print(f"DCG@10: {result.dcg:.4f}")print(f"IDCG@10: {result.idcg:.4f}")print(f"NDCG@10: {result.ndcg:.4f}") # Multiple queriesqueries = [ [3, 2, 1, 0, 0], [0, 0, 3, 2, 1], [5, 4, 3, 2, 1],] mean_result = evaluator.mean_ndcg(queries, k=5)print(f"\nMean NDCG@5: {mean_result['mean_ndcg']:.4f} ± {mean_result['std_ndcg']:.4f}")We have established NDCG as the standard metric for graded ranking evaluation. Let's consolidate our understanding:
Mathematical Summary:
$$\text{DCG@}k = \sum_{i=1}^{k} \frac{\text{gain}(\text{rel}_i)}{\text{discount}(i)}$$
$$\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$$
Standard choices:
What's Next:
With NDCG established for graded relevance, we turn to Mean Reciprocal Rank (MRR)—a simpler metric optimized for the common case where only the rank of the first relevant item matters, such as question answering and navigational search.
You now understand NDCG as the standard metric for graded ranking evaluation. This metric extends beyond binary relevance to capture degrees of relevance while respecting the crucial importance of position in item rankings.