Ranking Metrics - Learning Module

Loading content...

0/245

Normalized Discounted Cumulative Gain (NDCG)

Beyond Binary Relevance: The Graded World

The metrics we've studied so far—Precision@k, Recall@k, and MAP—treat relevance as a binary property: an item is either relevant or not. But in practice, relevance is rarely black and white.

Consider rating a movie recommendation:

A movie you'd give 5 stars is highly relevant
A movie you'd give 3 stars is somewhat relevant
A movie you'd give 1 star is marginally relevant at best
A movie you'd never watch is not relevant

Binary metrics cannot distinguish between placing a 5-star movie versus a 3-star movie at position 1. Yet user satisfaction differs dramatically! Similarly, in web search, a perfect answer (directly contains what the user sought) differs qualitatively from a partially relevant page (contains some useful information).

Normalized Discounted Cumulative Gain (NDCG) addresses this limitation with two key innovations:

Graded relevance: Items have relevance scores (0, 1, 2, 3, ...) rather than binary labels
Position discounting: Items at lower positions contribute less to the score, reflecting diminished user attention

NDCG has become the standard metric for recommendation systems, graded search evaluation, and any ranking task where relevance admits degrees.

What You Will Learn

By the end of this page, you will understand NDCG from first principles—from Cumulative Gain through Discounted Cumulative Gain to the normalized form. You will master the mathematical foundations, implementation details, variants, and practical considerations that make NDCG the industry standard for graded ranking evaluation.

Cumulative Gain (CG): The Foundation

We begin with the simplest aggregation of graded relevance: Cumulative Gain (CG).

Definition:

Given a ranked list of items with relevance scores $\text{rel}_1, \text{rel}_2, \ldots, \text{rel}_n$, the Cumulative Gain at position $k$ is simply the sum of relevance scores up to that position:

$$\text{CG@}k = \sum_{i=1}^{k} \text{rel}_i$$

Properties:

Range: $[0, k \cdot \max(\text{rel})]$ where $\max(\text{rel})$ is the maximum relevance level
Monotonically non-decreasing in k: Adding more positions can only increase CG
Order-independent: Permuting items within the top-k doesn't change CG@k

The Critical Limitation:

The order-independence is precisely CG's weakness. Consider two rankings:

Rank	Ranking A (relevance)	Ranking B (relevance)
1	3	0
2	2	0
3	0	2
4	0	3

Both have CG@4 = 5, but Ranking A is clearly superior for users who see fewer results!

cumulative_gain.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import numpy as np
 
def cumulative_gain(relevance_scores: list, k: int = None) -> np.ndarray:
    """
    Compute Cumulative Gain at each position.
    
    Args:
        relevance_scores: Graded relevance scores in ranked order
        k: Maximum position to consider (None = all)
    
    Returns:
        Array of CG values at each position
    """
    rel = np.array(relevance_scores)
    if k is not None:
        rel = rel[:k]
    return np.cumsum(rel)
 
# Example: CG fails to distinguish ranking quality
ranking_a = [3, 2, 0, 0]  # Best items first
ranking_b = [0, 0, 2, 3]  # Best items last
 
cg_a = cumulative_gain(ranking_a)
cg_b = cumulative_gain(ranking_b)
 
print("Cumulative Gain Comparison")
print("=" * 45)
print(f"{'Position':<10} {'CG (Ranking A)':<18} {'CG (Ranking B)':<18}")
print("-" * 45)
for i in range(4):
    print(f"{i+1:<10} {cg_a[i]:<18} {cg_b[i]:<18}")
 
print(f"\nFinal CG@4: A = {cg_a[-1]}, B = {cg_b[-1]}")
print("Despite identical final CG, Ranking A is clearly superior!")
print("CG cannot capture the value of placing good items early.")

CG's Role

CG is rarely used as a standalone metric. Its purpose is pedagogical—it establishes the baseline of summing relevance, upon which we build the position-discounting mechanism of DCG.

Discounted Cumulative Gain (DCG): Position Matters

Discounted Cumulative Gain (DCG) introduces a position-based discount that reduces the contribution of items at lower ranks. The intuition: users are less likely to see, examine, or benefit from items further down the list.

Two Common Formulations:

Formulation 1 (Original, Järvelin & Kekäläinen 2002):

$$\text{DCG@}k = \text{rel}1 + \sum{i=2}^{k} \frac{\text{rel}_i}{\log_2(i)}$$

The first position is not discounted; subsequent positions are discounted by $\log_2(i)$.

Formulation 2 (Industry Standard, stronger emphasis on top positions):

$$\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$$

This formulation:

Uses $\log_2(i+1)$ so position 1 has discount $\log_2(2) = 1$ (no discount)
Uses $2^{\text{rel}_i} - 1$ in the numerator, which amplifies differences between high relevance levels

The Exponential Gain (2^rel - 1):

With the exponential numerator:

rel=0 → gain = 0
rel=1 → gain = 1
rel=2 → gain = 3
rel=3 → gain = 7

This creates a non-linear relationship where highly relevant items contribute disproportionately more than marginally relevant ones.

Position Discount Factors (log₂(i+1))
Position i	log₂(i+1)	Discount Factor 1/log₂(i+1)
1	1.00	1.000
2	1.58	0.631
3	2.00	0.500
4	2.32	0.431
5	2.58	0.387
10	3.46	0.289
20	4.39	0.228
50	5.67	0.176
100	6.66	0.150

Interpreting the Discount:

The logarithmic discount means:

Position 2 is worth ~63% of position 1
Position 3 is worth ~50% of position 1
Position 10 is worth ~29% of position 1

This models the empirical observation that user attention drops rapidly with position, but not instantly. The logarithm provides a "soft" decay—gentler than exponential but still significant.

dcg_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def dcg_at_k(relevance_scores: list, k: int = None, 
              use_exponential_gain: bool = True) -> float:
    """
    Compute Discounted Cumulative Gain at position k.
    
    Args:
        relevance_scores: Graded relevance scores in ranked order
        k: Cutoff position (None = use all)
        use_exponential_gain: If True, use (2^rel - 1); if False, use rel
    
    Returns:
        DCG@k value
    """
    rel = np.array(relevance_scores, dtype=float)
    if k is not None:
        rel = rel[:k]
    
    n = len(rel)
    if n == 0:
        return 0.0
    
    # Position discount: 1 / log2(i + 1) for i = 1, 2, ..., n
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    # Gain: either exponential or linear
    if use_exponential_gain:
        gains = (2 ** rel) - 1
    else:
        gains = rel
    
    return np.sum(gains * discounts)
 
def dcg_at_all_k(relevance_scores: list, 
                  use_exponential_gain: bool = True) -> np.ndarray:
    """Compute DCG at all cutoffs from 1 to n."""
    rel = np.array(relevance_scores, dtype=float)
    n = len(rel)
    
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    if use_exponential_gain:
        gains = (2 ** rel) - 1
    else:
        gains = rel
    
    return np.cumsum(gains * discounts)
 
# Compare the two rankings from before
ranking_a = [3, 2, 0, 0]  # Best items first
ranking_b = [0, 0, 2, 3]  # Best items last
 
dcg_a = dcg_at_all_k(ranking_a)
dcg_b = dcg_at_all_k(ranking_b)
 
print("DCG Comparison (with exponential gain)")
print("=" * 55)
print(f"{'Position':<10} {'DCG (Ranking A)':<20} {'DCG (Ranking B)':<20}")
print("-" * 55)
for i in range(4):
    print(f"{i+1:<10} {dcg_a[i]:<20.4f} {dcg_b[i]:<20.4f}")
 
print(f"\nDCG@4: Ranking A = {dcg_a[-1]:.4f}, Ranking B = {dcg_b[-1]:.4f}")
print(f"Improvement of A over B: {100*(dcg_a[-1] - dcg_b[-1])/dcg_b[-1]:.1f}%")
print("\nNow DCG successfully captures that Ranking A is superior!")
 
# Show the contribution breakdown
print("\n" + "=" * 60)
print("Contribution Breakdown for Ranking A (rel = [3, 2, 0, 0]):")
print("-" * 60)
for i, rel in enumerate(ranking_a):
    pos = i + 1
    discount = 1 / np.log2(pos + 1)
    gain = (2 ** rel) - 1
    contribution = gain * discount
    print(f"Position {pos}: gain = 2^{rel}-1 = {gain}, "
          f"discount = 1/log₂({pos}+1) = {discount:.4f}, "
          f"contribution = {contribution:.4f}")

NDCG: Normalized Discounted Cumulative Gain

DCG's absolute values depend on the number and levels of relevant items, making it hard to compare across queries. A query with five highly relevant items will have higher DCG than a query with two marginally relevant items, regardless of ranking quality.

Normalization via the Ideal Ranking:

NDCG normalizes DCG by the Ideal DCG (IDCG)—the DCG achieved by the perfect ranking that places all items in decreasing order of relevance:

$$\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$$

where IDCG@k is the DCG@k of the ideal ranking (items sorted by relevance, descending).

Properties of NDCG:

Range: $[0, 1]$ where 1 = perfect ranking
Comparability: Values are meaningful across queries with different numbers of relevant items
Position-aware: Early positions matter more than later ones
Grade-aware: Higher relevance levels contribute more

Edge Case: IDCG = 0

If no items are relevant (all relevance = 0), IDCG@k = 0, making NDCG undefined. Common conventions:

Return 0 (no relevant items means no ranking quality to measure)
Return 1 (the ranking is as good as any possible)
Exclude such queries from evaluation

Choose based on your application semantics and document the choice.

ndcg_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from typing import Optional
 
def ndcg_at_k(relevance_scores: list, k: int = None,
               use_exponential_gain: bool = True) -> float:
    """
    Compute NDCG@k (Normalized Discounted Cumulative Gain).
    
    Args:
        relevance_scores: Graded relevance in ranked order
        k: Cutoff position (None = use all)
        use_exponential_gain: Use (2^rel - 1) if True, else rel
    
    Returns:
        NDCG@k value in [0, 1]
    """
    rel = np.array(relevance_scores, dtype=float)
    if k is not None:
        rel = rel[:k]
    
    n = len(rel)
    if n == 0:
        return 0.0
    
    # Compute DCG of the given ranking
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    if use_exponential_gain:
        gains = (2 ** rel) - 1
    else:
        gains = rel
    
    dcg = np.sum(gains * discounts)
    
    # Compute IDCG (ideal ranking: sort gains descending)
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    
    # Handle IDCG = 0
    if idcg == 0:
        return 0.0  # No relevant items
    
    return dcg / idcg
 
def ndcg_at_all_k(relevance_scores: list, 
                   use_exponential_gain: bool = True) -> np.ndarray:
    """Compute NDCG at all cutoffs from 1 to n."""
    n = len(relevance_scores)
    return np.array([ndcg_at_k(relevance_scores, k=i+1, 
                               use_exponential_gain=use_exponential_gain)
                     for i in range(n)])
 
# Example comparisons
print("NDCG Demonstration")
print("=" * 60)
 
# Perfect ranking
perfect = [3, 3, 2, 1, 0, 0]
ndcg_perfect = ndcg_at_k(perfect)
print(f"Perfect ranking [3,3,2,1,0,0]: NDCG = {ndcg_perfect:.4f}")
 
# Good but imperfect
good = [3, 2, 3, 1, 0, 0]  # Swapped positions 2 and 3
ndcg_good = ndcg_at_k(good)
print(f"Slightly off [3,2,3,1,0,0]:    NDCG = {ndcg_good:.4f}")
 
# Poor ranking
poor = [0, 0, 1, 2, 3, 3]  # Reversed
ndcg_poor = ndcg_at_k(poor)
print(f"Reversed [0,0,1,2,3,3]:        NDCG = {ndcg_poor:.4f}")
 
# NDCG at different cutoffs
print("\nNDCG at Different Cutoffs:")
print("-" * 40)
relevance = [3, 0, 2, 0, 1, 3, 0, 0, 2, 1]
ndcgs = ndcg_at_all_k(relevance)
for k in [1, 3, 5, 10]:
    print(f"NDCG@{k:2d} = {ndcgs[k-1]:.4f}")
 
# Comparing two systems
print("\nComparing Two Recommendation Systems:")
print("-" * 50)
# User's true ratings for 8 items
true_ratings = [5, 4, 3, 2, 1, 3, 0, 4]  # Indices 0-7
 
# System A's ranking: item indices in order shown
system_a_order = [0, 7, 1, 3, 5, 2, 4, 6]  # Shows item 0 first, then 7, etc.
system_a_rels = [true_ratings[i] for i in system_a_order]
 
# System B's ranking
system_b_order = [2, 5, 4, 6, 0, 1, 7, 3]
system_b_rels = [true_ratings[i] for i in system_b_order]
 
print(f"System A relevances: {system_a_rels}")
print(f"System B relevances: {system_b_rels}")
print(f"NDCG@5  - System A: {ndcg_at_k(system_a_rels, k=5):.4f}, "
      f"System B: {ndcg_at_k(system_b_rels, k=5):.4f}")
print(f"NDCG@8  - System A: {ndcg_at_k(system_a_rels, k=8):.4f}, "
      f"System B: {ndcg_at_k(system_b_rels, k=8):.4f}")

Why Normalize?

Normalization serves two purposes: (1) It bounds NDCG to [0,1], making values interpretable regardless of the relevance scale or number of relevant items. (2) It enables fair comparison across queries—a query with 10 highly relevant items and a query with 2 marginally relevant items can be aggregated into a single mean NDCG.

IDCG Computation: Subtle Points

Computing IDCG@k correctly requires care. There are two common approaches, yielding slightly different semantics:

Approach 1: IDCG from the full corpus (Common in academia)

Sort all relevant items in the corpus by relevance, take the top-k, compute DCG. This is the maximum possible DCG@k if we had perfect knowledge and could select and rank any items.

Approach 2: IDCG from the returned list (Common in industry)

Sort only the items in the returned ranking by relevance, compute DCG@k. This measures how well you ranked within your own result set.

Key Difference:

Suppose you return 10 items but there are 100 relevant items in the corpus:

Approach 1: IDCG reflects the best 10 of all 100
Approach 2: IDCG reflects the best ordering of your 10 returned items

Approach 2 can give NDCG = 1 even if you missed highly relevant items—as long as you correctly ordered what you returned. Approach 1 penalizes missing relevant items.

Recommendation:

For evaluating retrieval/recommendation systems, Approach 1 is more rigorous as it penalizes poor recall. For evaluating pure ranking (where the item set is fixed), Approach 2 is appropriate.

idcg_approaches.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
 
def ndcg_approach_1(relevance_returned: list, all_relevances: list, k: int) -> float:
    """
    NDCG with IDCG computed from full corpus.
    Penalizes missing highly relevant items.
    """
    rel = np.array(relevance_returned[:k], dtype=float)
    all_rel = np.array(all_relevances, dtype=float)
    
    n = len(rel)
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    gains = (2 ** rel) - 1
    dcg = np.sum(gains * discounts)
    
    # IDCG from best k items in corpus
    ideal_gains = np.sort((2 ** all_rel) - 1)[::-1][:k]
    ideal_positions = np.arange(1, len(ideal_gains) + 1)
    ideal_discounts = 1.0 / np.log2(ideal_positions + 1)
    idcg = np.sum(ideal_gains * ideal_discounts)
    
    return dcg / idcg if idcg > 0 else 0.0
 
def ndcg_approach_2(relevance_returned: list, k: int) -> float:
    """
    NDCG with IDCG computed from returned list only.
    Measures ranking quality of returned items.
    """
    rel = np.array(relevance_returned[:k], dtype=float)
    
    n = len(rel)
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    gains = (2 ** rel) - 1
    dcg = np.sum(gains * discounts)
    
    # IDCG from returned list only
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    
    return dcg / idcg if idcg > 0 else 0.0
 
# Scenario: System returns 5 items, but corpus has more relevant items
# True corpus relevances (indices 0-9)
corpus_relevances = [5, 5, 4, 3, 3, 2, 2, 1, 1, 0]  # Some highly relevant
 
# System returns items at indices [7, 2, 9, 5, 4] (missed items 0, 1)
returned_indices = [7, 2, 9, 5, 4]
returned_relevances = [corpus_relevances[i] for i in returned_indices]
# returned_relevances = [1, 4, 0, 2, 3]
 
k = 5
print("IDCG Approach Comparison")
print("=" * 60)
print(f"Corpus has items with relevances: {corpus_relevances}")
print(f"System returned items with relevances: {returned_relevances}")
print(f"(Missed the two highest-relevance items with rel=5)")
print()
 
ndcg_1 = ndcg_approach_1(returned_relevances, corpus_relevances, k)
ndcg_2 = ndcg_approach_2(returned_relevances, k)
 
print(f"Approach 1 (IDCG from corpus):   NDCG@{k} = {ndcg_1:.4f}")
print(f"Approach 2 (IDCG from returned): NDCG@{k} = {ndcg_2:.4f}")
print()
print("Approach 1 penalizes missing the highly relevant items.")
print("Approach 2 only measures how well we ranked what we returned.")

Document Your IDCG Approach

NDCG values are not comparable if computed with different IDCG approaches. Always specify whether IDCG is computed from the full corpus or only the returned items. When using libraries like scikit-learn, check the documentation to understand which approach is implemented.

Relationship to Other Ranking Metrics

NDCG relates to other ranking metrics in important ways:

NDCG vs. MAP:

Aspect	NDCG	MAP
Relevance	Graded (0, 1, 2, ...)	Binary (relevant or not)
Position weighting	Explicit log discount	Implicit via precision at relevant positions
Normalization	By ideal ranking	By total relevant items
When to use	Graded relevance available	Only binary judgments

NDCG with Binary Relevance:

When relevance is binary (0 or 1), NDCG behaves similarly to (but not identically to) MAP:

Both reward placing relevant items early
Both are normalized to [0, 1]
NDCG uses explicit log discounting; MAP uses precision averaging

NDCG vs. Precision@k:

Aspect	NDCG@k	Precision@k
Position within top-k	Matters (log discount)	Doesn't matter (all equal)
Relevance grades	Utilized	Collapsed to binary
Use case	Fine-grained ranking evaluation	Simple binary evaluation

Connection to Expected Utility:

NDCG can be interpreted as a normalized expected utility under a user model where:

Users examine items from top to bottom
Probability of examining position i decreases logarithmically
Utility gained from an item equals $2^{\text{rel}} - 1$

ndcg_vs_other_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def all_metrics_comparison(relevance: list, k: int):
    """Compare NDCG, MAP, and P@k for a ranking."""
    rel = np.array(relevance[:k], dtype=float)
    n = len(rel)
    
    # NDCG (graded)
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    gains = (2 ** rel) - 1
    dcg = np.sum(gains * discounts)
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    ndcg = dcg / idcg if idcg > 0 else 0.0
    
    # Binary conversion for P@k and MAP
    binary_rel = (rel > 0).astype(float)
    total_relevant = np.sum(binary_rel)
    
    # P@k
    precision_at_k = np.mean(binary_rel)
    
    # AP (assuming all relevant items are in the list)
    if total_relevant > 0:
        cumsum = np.cumsum(binary_rel)
        precisions = cumsum / np.arange(1, n + 1)
        ap = np.sum(precisions * binary_rel) / total_relevant
    else:
        ap = 0.0
    
    return {
        'NDCG': ndcg,
        'P@k': precision_at_k,
        'AP': ap
    }
 
# Compare rankings with same binary relevance but different grades
print("Metric Comparison: Same Binary Relevance, Different Grades")
print("=" * 70)
 
# Ranking A: Highly relevant items in relevant positions
ranking_a = [5, 4, 0, 0, 3]  # Binary: [1, 1, 0, 0, 1]
 
# Ranking B: Marginally relevant items in same positions
ranking_b = [1, 1, 0, 0, 1]  # Binary: [1, 1, 0, 0, 1]
 
metrics_a = all_metrics_comparison(ranking_a, k=5)
metrics_b = all_metrics_comparison(ranking_b, k=5)
 
print(f"Ranking A (high grades): {ranking_a}")
print(f"Ranking B (low grades):  {ranking_b}")
print(f"Both have same binary pattern: [1, 1, 0, 0, 1]")
print()
 
print(f"{'Metric':<12} {'Ranking A':<15} {'Ranking B':<15} {'Difference'}")
print("-" * 55)
for metric in ['NDCG', 'P@k', 'AP']:
    a_val = metrics_a[metric]
    b_val = metrics_b[metric]
    print(f"{metric:<12} {a_val:<15.4f} {b_val:<15.4f} {a_val - b_val:.4f}")
 
print()
print("NDCG distinguishes between high and low grades.")
print("P@k and AP (binary metrics) cannot see the difference!")

Choosing Between NDCG and MAP

Use NDCG when you have graded relevance judgments (star ratings, annotation levels). Use MAP when you only have binary relevance. If you have graded relevance but want to compare with systems evaluated on binary, report both—you can always binarize grades for MAP computation.

NDCG Variants and Extensions

Several NDCG variants address specific needs:

1. Alternative Gain Functions:

Beyond linear and exponential, custom gain functions can model domain-specific utility:

Linear: $g(\text{rel}) = \text{rel}$
Exponential: $g(\text{rel}) = 2^{\text{rel}} - 1$ (standard)
Custom: $g(\text{rel}) = f(\text{rel})$ for application-specific utility

2. Alternative Discount Functions:

The logarithmic discount can be replaced:

Logarithmic: $d(i) = 1/\log_2(i+1)$ (standard)
Linear: $d(i) = 1/i$ (steeper decay)
Reciprocal rank: $d(i) = 1/(i^b)$ for tunable $b$
Position-based model (PBM): $d(i) = p_i$ learned from click data

3. ERR (Expected Reciprocal Rank):

A cascade-model variant where users stop after finding a satisfying result. Lower-ranked items only contribute if higher-ranked items didn't satisfy the user:

$$\text{ERR} = \sum_{r=1}^{n} \frac{1}{r} \prod_{i=1}^{r-1}(1 - R_i) \cdot R_r$$

where $R_i$ is the probability of satisfaction at position $i$.

4. Intent-Aware NDCG:

For queries with multiple intents, compute NDCG per intent and aggregate:

$$\text{IA-NDCG} = \sum_{\text{intent } j} P(j) \cdot \text{NDCG}_j$$

where $P(j)$ is the probability of intent $j$.

ndcg_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from typing import Callable
 
def generalized_ndcg(
    relevance: list,
    k: int,
    gain_fn: Callable[[float], float] = lambda r: 2**r - 1,
    discount_fn: Callable[[int], float] = lambda i: 1/np.log2(i+1)
) -> float:
    """
    Generalized NDCG with customizable gain and discount functions.
    
    Args:
        relevance: Relevance scores in ranked order
        k: Cutoff position
        gain_fn: Function mapping relevance to gain
        discount_fn: Function mapping position to discount
    
    Returns:
        Generalized NDCG value
    """
    rel = np.array(relevance[:k], dtype=float)
    n = len(rel)
    
    if n == 0:
        return 0.0
    
    # Compute gains and discounts
    gains = np.array([gain_fn(r) for r in rel])
    discounts = np.array([discount_fn(i+1) for i in range(n)])
    
    dcg = np.sum(gains * discounts)
    
    # Ideal ranking
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    
    return dcg / idcg if idcg > 0 else 0.0
 
# Compare different gain functions
relevance = [3, 1, 2, 0, 2, 1]
 
print("Impact of Gain Function on NDCG")
print("=" * 50)
print(f"Relevance scores: {relevance}")
print()
 
# Linear gain
linear_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: r)
print(f"Linear gain (g(r) = r):      NDCG = {linear_ndcg:.4f}")
 
# Exponential gain (standard)
exp_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: 2**r - 1)
print(f"Exponential (g(r) = 2^r-1):  NDCG = {exp_ndcg:.4f}")
 
# Custom: Square gain (emphasizes high relevance even more)
square_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: r**2)
print(f"Square gain (g(r) = r^2):    NDCG = {square_ndcg:.4f}")
 
# Compare different discount functions
print("\nImpact of Discount Function on NDCG")
print("=" * 50)
 
# Standard log discount
log_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/np.log2(i+1))
print(f"Log discount (d(i) = 1/log₂(i+1)):   NDCG = {log_ndcg:.4f}")
 
# Linear discount (steeper)
linear_disc_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/i)
print(f"Linear discount (d(i) = 1/i):        NDCG = {linear_disc_ndcg:.4f}")
 
# Gentle discount
gentle_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/np.log(i+np.e))
print(f"Gentle discount (d(i) = 1/ln(i+e)):  NDCG = {gentle_ndcg:.4f}")

Practical Implementation Considerations

Implementing NDCG in production requires attention to several practical issues:

Implementation Checklist

•Document the gain function: Exponential (2^rel - 1) is standard but not universal. State your choice explicitly.
•Document the discount base: log₂(i+1) is standard. Some use natural log, changing the discount curve.
•Handle missing relevance judgments: In incomplete data (common in large-scale evaluation), items without judgments are often treated as rel=0. This underestimates true NDCG.
•Choose IDCG scope: Decide whether IDCG uses all corpus items or only returned items. Document this choice.
•Handle ties in ideal ranking: When sorting for IDCG, ties in relevance can be broken arbitrarily. This doesn't affect IDCG value but documents your implementation.
•Vectorize for efficiency: When computing NDCG across many queries, use NumPy/vectorized operations rather than loops.
•Edge cases: Handle k > n (compute with available items), all rel=0 (return 0), and empty rankings.

production_ndcg.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from typing import List, Optional
from dataclasses import dataclass
 
@dataclass
class NDCGResult:
    """Container for NDCG computation results."""
    ndcg: float
    dcg: float
    idcg: float
    k_effective: int
 
class NDCGEvaluator:
    """Production-grade NDCG evaluator."""
    
    def __init__(self, 
                 use_exponential_gain: bool = True,
                 log_base: float = 2.0):
        """
        Initialize NDCG evaluator.
        
        Args:
            use_exponential_gain: Use 2^rel - 1 if True, else rel
            log_base: Base of logarithm for position discount
        """
        self.use_exp_gain = use_exponential_gain
        self.log_base = log_base
    
    def _compute_gains(self, rel: np.ndarray) -> np.ndarray:
        """Compute gains from relevance scores."""
        if self.use_exp_gain:
            return (2 ** rel) - 1
        return rel.copy()
    
    def _compute_discounts(self, n: int) -> np.ndarray:
        """Compute position discounts for n positions."""
        positions = np.arange(1, n + 1)
        return 1.0 / (np.log(positions + 1) / np.log(self.log_base))
    
    def ndcg_at_k(self, 
                  relevance: List[float],
                  k: Optional[int] = None,
                  ideal_relevance: Optional[List[float]] = None) -> NDCGResult:
        """
        Compute NDCG@k with full diagnostics.
        
        Args:
            relevance: Relevance scores in ranked order
            k: Cutoff position (None = use all)
            ideal_relevance: Full corpus relevances for IDCG
                             (None = use items in relevance list)
        
        Returns:
            NDCGResult with NDCG, DCG, IDCG, and effective k
        """
        rel = np.array(relevance, dtype=float)
        if k is not None:
            rel = rel[:k]
        
        n = len(rel)
        if n == 0:
            return NDCGResult(ndcg=0.0, dcg=0.0, idcg=0.0, k_effective=0)
        
        # Compute DCG
        gains = self._compute_gains(rel)
        discounts = self._compute_discounts(n)
        dcg = np.sum(gains * discounts)
        
        # Compute IDCG
        if ideal_relevance is not None:
            ideal_rel = np.array(ideal_relevance, dtype=float)
            ideal_gains = np.sort(self._compute_gains(ideal_rel))[::-1][:n]
        else:
            ideal_gains = np.sort(gains)[::-1]
        
        ideal_discounts = self._compute_discounts(len(ideal_gains))
        idcg = np.sum(ideal_gains * ideal_discounts)
        
        # Compute NDCG
        ndcg = dcg / idcg if idcg > 0 else 0.0
        
        return NDCGResult(
            ndcg=float(ndcg),
            dcg=float(dcg),
            idcg=float(idcg),
            k_effective=n
        )
    
    def mean_ndcg(self, 
                  query_relevances: List[List[float]],
                  k: Optional[int] = None) -> dict:
        """
        Compute mean NDCG across multiple queries.
        
        Args:
            query_relevances: List of relevance lists (one per query)
            k: Cutoff position
        
        Returns:
            Dict with mean NDCG and per-query values
        """
        results = [self.ndcg_at_k(rel, k) for rel in query_relevances]
        ndcgs = [r.ndcg for r in results]
        
        return {
            'mean_ndcg': np.mean(ndcgs),
            'std_ndcg': np.std(ndcgs),
            'per_query_ndcg': ndcgs,
            'num_queries': len(ndcgs)
        }
 
# Demo
evaluator = NDCGEvaluator(use_exponential_gain=True, log_base=2.0)
 
# Single query evaluation
rel = [3, 2, 3, 0, 1, 2, 0, 0, 1, 0]
result = evaluator.ndcg_at_k(rel, k=10)
 
print("Single Query NDCG Evaluation")
print("=" * 50)
print(f"Relevances: {rel}")
print(f"DCG@10:  {result.dcg:.4f}")
print(f"IDCG@10: {result.idcg:.4f}")
print(f"NDCG@10: {result.ndcg:.4f}")
 
# Multiple queries
queries = [
    [3, 2, 1, 0, 0],
    [0, 0, 3, 2, 1],
    [5, 4, 3, 2, 1],
]
 
mean_result = evaluator.mean_ndcg(queries, k=5)
print(f"\nMean NDCG@5: {mean_result['mean_ndcg']:.4f} ± {mean_result['std_ndcg']:.4f}")

Common Pitfalls and Best Practices

Common Pitfalls

•Using wrong gain function: Switching between linear and exponential gain changes values significantly. Verify you're using the expected function.
•Inconsistent IDCG computation: Comparing systems that use different IDCG approaches produces invalid comparisons.
•Ignoring missing judgments: Items without relevance labels are often treated as rel=0, underestimating NDCG. Consider evaluation with incomplete judgments.
•Integer overflow with high relevance: With exponential gain and high relevance values (rel > 30), 2^rel overflows. Use floating-point arithmetic.
•Averaging NDCG incorrectly: NDCG should be averaged across queries, not computed on aggregated data.

Best Practices

•Report NDCG at multiple cutoffs: NDCG@1, @3, @5, @10 tell different stories. Report the full picture.
•Include confidence intervals: Especially for small test sets, mean NDCG alone is insufficient.
•Document all parameters: Gain function, discount base, IDCG scope, handling of missing judgments.
•Compare with baselines: Random and popularity-based baselines contextualize your NDCG values.
•Use stratified analysis: Break down NDCG by query type, difficulty, or other dimensions to understand performance patterns.

Summary: NDCG

We have established NDCG as the standard metric for graded ranking evaluation. Let's consolidate our understanding:

Key Takeaways

•CG sums relevance but ignores position: $\text{CG@}k = \sum_{i=1}^k \text{rel}_i$
•DCG adds position discounting: $\text{DCG@}k = \sum_{i=1}^k \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}$
•NDCG normalizes by ideal: $\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$, giving values in [0, 1]
•Exponential gain ($2^\text{rel} - 1$) strongly emphasizes high-relevance items
•Logarithmic discount models decreasing user attention with position
•IDCG computation matters: From corpus vs. from returned list yields different semantics
•NDCG is the standard for recommendation systems and graded retrieval evaluation

Mathematical Summary:

$$\text{DCG@}k = \sum_{i=1}^{k} \frac{\text{gain}(\text{rel}_i)}{\text{discount}(i)}$$

$$\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$$

Standard choices:

$\text{gain}(r) = 2^r - 1$ (exponential)
$\text{discount}(i) = \log_2(i+1)$ (logarithmic)

What's Next:

With NDCG established for graded relevance, we turn to Mean Reciprocal Rank (MRR)—a simpler metric optimized for the common case where only the rank of the first relevant item matters, such as question answering and navigational search.

Page Complete

You now understand NDCG as the standard metric for graded ranking evaluation. This metric extends beyond binary relevance to capture degrees of relevance while respecting the crucial importance of position in item rankings.

Normalized Discounted Cumulative Gain (NDCG)

Beyond Binary Relevance: The Graded World

The metrics we've studied so far—Precision@k, Recall@k, and MAP—treat relevance as a binary property: an item is either relevant or not. But in practice, relevance is rarely black and white.

Consider rating a movie recommendation:

A movie you'd give 5 stars is highly relevant
A movie you'd give 3 stars is somewhat relevant
A movie you'd give 1 star is marginally relevant at best
A movie you'd never watch is not relevant

Normalized Discounted Cumulative Gain (NDCG) addresses this limitation with two key innovations:

Graded relevance: Items have relevance scores (0, 1, 2, 3, ...) rather than binary labels
Position discounting: Items at lower positions contribute less to the score, reflecting diminished user attention

NDCG has become the standard metric for recommendation systems, graded search evaluation, and any ranking task where relevance admits degrees.

What You Will Learn

Cumulative Gain (CG): The Foundation

We begin with the simplest aggregation of graded relevance: Cumulative Gain (CG).

Definition:

Given a ranked list of items with relevance scores $\text{rel}_1, \text{rel}_2, \ldots, \text{rel}_n$, the Cumulative Gain at position $k$ is simply the sum of relevance scores up to that position:

$$\text{CG@}k = \sum_{i=1}^{k} \text{rel}_i$$

Properties:

Range: $[0, k \cdot \max(\text{rel})]$ where $\max(\text{rel})$ is the maximum relevance level
Monotonically non-decreasing in k: Adding more positions can only increase CG
Order-independent: Permuting items within the top-k doesn't change CG@k

The Critical Limitation:

The order-independence is precisely CG's weakness. Consider two rankings:

Rank	Ranking A (relevance)	Ranking B (relevance)
1	3	0
2	2	0
3	0	2
4	0	3

Both have CG@4 = 5, but Ranking A is clearly superior for users who see fewer results!

cumulative_gain.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import numpy as np
 
def cumulative_gain(relevance_scores: list, k: int = None) -> np.ndarray:
    """
    Compute Cumulative Gain at each position.
    
    Args:
        relevance_scores: Graded relevance scores in ranked order
        k: Maximum position to consider (None = all)
    
    Returns:
        Array of CG values at each position
    """
    rel = np.array(relevance_scores)
    if k is not None:
        rel = rel[:k]
    return np.cumsum(rel)
 
# Example: CG fails to distinguish ranking quality
ranking_a = [3, 2, 0, 0]  # Best items first
ranking_b = [0, 0, 2, 3]  # Best items last
 
cg_a = cumulative_gain(ranking_a)
cg_b = cumulative_gain(ranking_b)
 
print("Cumulative Gain Comparison")
print("=" * 45)
print(f"{'Position':<10} {'CG (Ranking A)':<18} {'CG (Ranking B)':<18}")
print("-" * 45)
for i in range(4):
    print(f"{i+1:<10} {cg_a[i]:<18} {cg_b[i]:<18}")
 
print(f"\nFinal CG@4: A = {cg_a[-1]}, B = {cg_b[-1]}")
print("Despite identical final CG, Ranking A is clearly superior!")
print("CG cannot capture the value of placing good items early.")

CG's Role

CG is rarely used as a standalone metric. Its purpose is pedagogical—it establishes the baseline of summing relevance, upon which we build the position-discounting mechanism of DCG.

Discounted Cumulative Gain (DCG): Position Matters

Two Common Formulations:

Formulation 1 (Original, Järvelin & Kekäläinen 2002):

$$\text{DCG@}k = \text{rel}1 + \sum{i=2}^{k} \frac{\text{rel}_i}{\log_2(i)}$$

The first position is not discounted; subsequent positions are discounted by $\log_2(i)$.

Formulation 2 (Industry Standard, stronger emphasis on top positions):

$$\text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$$

This formulation:

Uses $\log_2(i+1)$ so position 1 has discount $\log_2(2) = 1$ (no discount)
Uses $2^{\text{rel}_i} - 1$ in the numerator, which amplifies differences between high relevance levels

The Exponential Gain (2^rel - 1):

With the exponential numerator:

rel=0 → gain = 0
rel=1 → gain = 1
rel=2 → gain = 3
rel=3 → gain = 7

This creates a non-linear relationship where highly relevant items contribute disproportionately more than marginally relevant ones.

Position Discount Factors (log₂(i+1))
Position i	log₂(i+1)	Discount Factor 1/log₂(i+1)
1	1.00	1.000
2	1.58	0.631
3	2.00	0.500
4	2.32	0.431
5	2.58	0.387
10	3.46	0.289
20	4.39	0.228
50	5.67	0.176
100	6.66	0.150

Interpreting the Discount:

The logarithmic discount means:

Position 2 is worth ~63% of position 1
Position 3 is worth ~50% of position 1
Position 10 is worth ~29% of position 1

This models the empirical observation that user attention drops rapidly with position, but not instantly. The logarithm provides a "soft" decay—gentler than exponential but still significant.

dcg_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
 
def dcg_at_k(relevance_scores: list, k: int = None, 
              use_exponential_gain: bool = True) -> float:
    """
    Compute Discounted Cumulative Gain at position k.
    
    Args:
        relevance_scores: Graded relevance scores in ranked order
        k: Cutoff position (None = use all)
        use_exponential_gain: If True, use (2^rel - 1); if False, use rel
    
    Returns:
        DCG@k value
    """
    rel = np.array(relevance_scores, dtype=float)
    if k is not None:
        rel = rel[:k]
    
    n = len(rel)
    if n == 0:
        return 0.0
    
    # Position discount: 1 / log2(i + 1) for i = 1, 2, ..., n
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    # Gain: either exponential or linear
    if use_exponential_gain:
        gains = (2 ** rel) - 1
    else:
        gains = rel
    
    return np.sum(gains * discounts)
 
def dcg_at_all_k(relevance_scores: list, 
                  use_exponential_gain: bool = True) -> np.ndarray:
    """Compute DCG at all cutoffs from 1 to n."""
    rel = np.array(relevance_scores, dtype=float)
    n = len(rel)
    
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    if use_exponential_gain:
        gains = (2 ** rel) - 1
    else:
        gains = rel
    
    return np.cumsum(gains * discounts)
 
# Compare the two rankings from before
ranking_a = [3, 2, 0, 0]  # Best items first
ranking_b = [0, 0, 2, 3]  # Best items last
 
dcg_a = dcg_at_all_k(ranking_a)
dcg_b = dcg_at_all_k(ranking_b)
 
print("DCG Comparison (with exponential gain)")
print("=" * 55)
print(f"{'Position':<10} {'DCG (Ranking A)':<20} {'DCG (Ranking B)':<20}")
print("-" * 55)
for i in range(4):
    print(f"{i+1:<10} {dcg_a[i]:<20.4f} {dcg_b[i]:<20.4f}")
 
print(f"\nDCG@4: Ranking A = {dcg_a[-1]:.4f}, Ranking B = {dcg_b[-1]:.4f}")
print(f"Improvement of A over B: {100*(dcg_a[-1] - dcg_b[-1])/dcg_b[-1]:.1f}%")
print("\nNow DCG successfully captures that Ranking A is superior!")
 
# Show the contribution breakdown
print("\n" + "=" * 60)
print("Contribution Breakdown for Ranking A (rel = [3, 2, 0, 0]):")
print("-" * 60)
for i, rel in enumerate(ranking_a):
    pos = i + 1
    discount = 1 / np.log2(pos + 1)
    gain = (2 ** rel) - 1
    contribution = gain * discount
    print(f"Position {pos}: gain = 2^{rel}-1 = {gain}, "
          f"discount = 1/log₂({pos}+1) = {discount:.4f}, "
          f"contribution = {contribution:.4f}")

NDCG: Normalized Discounted Cumulative Gain

Normalization via the Ideal Ranking:

NDCG normalizes DCG by the Ideal DCG (IDCG)—the DCG achieved by the perfect ranking that places all items in decreasing order of relevance:

$$\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$$

where IDCG@k is the DCG@k of the ideal ranking (items sorted by relevance, descending).

Properties of NDCG:

Range: $[0, 1]$ where 1 = perfect ranking
Comparability: Values are meaningful across queries with different numbers of relevant items
Position-aware: Early positions matter more than later ones
Grade-aware: Higher relevance levels contribute more

Edge Case: IDCG = 0

If no items are relevant (all relevance = 0), IDCG@k = 0, making NDCG undefined. Common conventions:

Return 0 (no relevant items means no ranking quality to measure)
Return 1 (the ranking is as good as any possible)
Exclude such queries from evaluation

Choose based on your application semantics and document the choice.

ndcg_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from typing import Optional
 
def ndcg_at_k(relevance_scores: list, k: int = None,
               use_exponential_gain: bool = True) -> float:
    """
    Compute NDCG@k (Normalized Discounted Cumulative Gain).
    
    Args:
        relevance_scores: Graded relevance in ranked order
        k: Cutoff position (None = use all)
        use_exponential_gain: Use (2^rel - 1) if True, else rel
    
    Returns:
        NDCG@k value in [0, 1]
    """
    rel = np.array(relevance_scores, dtype=float)
    if k is not None:
        rel = rel[:k]
    
    n = len(rel)
    if n == 0:
        return 0.0
    
    # Compute DCG of the given ranking
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    if use_exponential_gain:
        gains = (2 ** rel) - 1
    else:
        gains = rel
    
    dcg = np.sum(gains * discounts)
    
    # Compute IDCG (ideal ranking: sort gains descending)
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    
    # Handle IDCG = 0
    if idcg == 0:
        return 0.0  # No relevant items
    
    return dcg / idcg
 
def ndcg_at_all_k(relevance_scores: list, 
                   use_exponential_gain: bool = True) -> np.ndarray:
    """Compute NDCG at all cutoffs from 1 to n."""
    n = len(relevance_scores)
    return np.array([ndcg_at_k(relevance_scores, k=i+1, 
                               use_exponential_gain=use_exponential_gain)
                     for i in range(n)])
 
# Example comparisons
print("NDCG Demonstration")
print("=" * 60)
 
# Perfect ranking
perfect = [3, 3, 2, 1, 0, 0]
ndcg_perfect = ndcg_at_k(perfect)
print(f"Perfect ranking [3,3,2,1,0,0]: NDCG = {ndcg_perfect:.4f}")
 
# Good but imperfect
good = [3, 2, 3, 1, 0, 0]  # Swapped positions 2 and 3
ndcg_good = ndcg_at_k(good)
print(f"Slightly off [3,2,3,1,0,0]:    NDCG = {ndcg_good:.4f}")
 
# Poor ranking
poor = [0, 0, 1, 2, 3, 3]  # Reversed
ndcg_poor = ndcg_at_k(poor)
print(f"Reversed [0,0,1,2,3,3]:        NDCG = {ndcg_poor:.4f}")
 
# NDCG at different cutoffs
print("\nNDCG at Different Cutoffs:")
print("-" * 40)
relevance = [3, 0, 2, 0, 1, 3, 0, 0, 2, 1]
ndcgs = ndcg_at_all_k(relevance)
for k in [1, 3, 5, 10]:
    print(f"NDCG@{k:2d} = {ndcgs[k-1]:.4f}")
 
# Comparing two systems
print("\nComparing Two Recommendation Systems:")
print("-" * 50)
# User's true ratings for 8 items
true_ratings = [5, 4, 3, 2, 1, 3, 0, 4]  # Indices 0-7
 
# System A's ranking: item indices in order shown
system_a_order = [0, 7, 1, 3, 5, 2, 4, 6]  # Shows item 0 first, then 7, etc.
system_a_rels = [true_ratings[i] for i in system_a_order]
 
# System B's ranking
system_b_order = [2, 5, 4, 6, 0, 1, 7, 3]
system_b_rels = [true_ratings[i] for i in system_b_order]
 
print(f"System A relevances: {system_a_rels}")
print(f"System B relevances: {system_b_rels}")
print(f"NDCG@5  - System A: {ndcg_at_k(system_a_rels, k=5):.4f}, "
      f"System B: {ndcg_at_k(system_b_rels, k=5):.4f}")
print(f"NDCG@8  - System A: {ndcg_at_k(system_a_rels, k=8):.4f}, "
      f"System B: {ndcg_at_k(system_b_rels, k=8):.4f}")

Why Normalize?

IDCG Computation: Subtle Points

Computing IDCG@k correctly requires care. There are two common approaches, yielding slightly different semantics:

Approach 1: IDCG from the full corpus (Common in academia)

Sort all relevant items in the corpus by relevance, take the top-k, compute DCG. This is the maximum possible DCG@k if we had perfect knowledge and could select and rank any items.

Approach 2: IDCG from the returned list (Common in industry)

Sort only the items in the returned ranking by relevance, compute DCG@k. This measures how well you ranked within your own result set.

Key Difference:

Suppose you return 10 items but there are 100 relevant items in the corpus:

Approach 1: IDCG reflects the best 10 of all 100
Approach 2: IDCG reflects the best ordering of your 10 returned items

Approach 2 can give NDCG = 1 even if you missed highly relevant items—as long as you correctly ordered what you returned. Approach 1 penalizes missing relevant items.

Recommendation:

For evaluating retrieval/recommendation systems, Approach 1 is more rigorous as it penalizes poor recall. For evaluating pure ranking (where the item set is fixed), Approach 2 is appropriate.

idcg_approaches.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
 
def ndcg_approach_1(relevance_returned: list, all_relevances: list, k: int) -> float:
    """
    NDCG with IDCG computed from full corpus.
    Penalizes missing highly relevant items.
    """
    rel = np.array(relevance_returned[:k], dtype=float)
    all_rel = np.array(all_relevances, dtype=float)
    
    n = len(rel)
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    gains = (2 ** rel) - 1
    dcg = np.sum(gains * discounts)
    
    # IDCG from best k items in corpus
    ideal_gains = np.sort((2 ** all_rel) - 1)[::-1][:k]
    ideal_positions = np.arange(1, len(ideal_gains) + 1)
    ideal_discounts = 1.0 / np.log2(ideal_positions + 1)
    idcg = np.sum(ideal_gains * ideal_discounts)
    
    return dcg / idcg if idcg > 0 else 0.0
 
def ndcg_approach_2(relevance_returned: list, k: int) -> float:
    """
    NDCG with IDCG computed from returned list only.
    Measures ranking quality of returned items.
    """
    rel = np.array(relevance_returned[:k], dtype=float)
    
    n = len(rel)
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    
    gains = (2 ** rel) - 1
    dcg = np.sum(gains * discounts)
    
    # IDCG from returned list only
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    
    return dcg / idcg if idcg > 0 else 0.0
 
# Scenario: System returns 5 items, but corpus has more relevant items
# True corpus relevances (indices 0-9)
corpus_relevances = [5, 5, 4, 3, 3, 2, 2, 1, 1, 0]  # Some highly relevant
 
# System returns items at indices [7, 2, 9, 5, 4] (missed items 0, 1)
returned_indices = [7, 2, 9, 5, 4]
returned_relevances = [corpus_relevances[i] for i in returned_indices]
# returned_relevances = [1, 4, 0, 2, 3]
 
k = 5
print("IDCG Approach Comparison")
print("=" * 60)
print(f"Corpus has items with relevances: {corpus_relevances}")
print(f"System returned items with relevances: {returned_relevances}")
print(f"(Missed the two highest-relevance items with rel=5)")
print()
 
ndcg_1 = ndcg_approach_1(returned_relevances, corpus_relevances, k)
ndcg_2 = ndcg_approach_2(returned_relevances, k)
 
print(f"Approach 1 (IDCG from corpus):   NDCG@{k} = {ndcg_1:.4f}")
print(f"Approach 2 (IDCG from returned): NDCG@{k} = {ndcg_2:.4f}")
print()
print("Approach 1 penalizes missing the highly relevant items.")
print("Approach 2 only measures how well we ranked what we returned.")

Document Your IDCG Approach

Relationship to Other Ranking Metrics

NDCG relates to other ranking metrics in important ways:

NDCG vs. MAP:

Aspect	NDCG	MAP
Relevance	Graded (0, 1, 2, ...)	Binary (relevant or not)
Position weighting	Explicit log discount	Implicit via precision at relevant positions
Normalization	By ideal ranking	By total relevant items
When to use	Graded relevance available	Only binary judgments

NDCG with Binary Relevance:

When relevance is binary (0 or 1), NDCG behaves similarly to (but not identically to) MAP:

Both reward placing relevant items early
Both are normalized to [0, 1]
NDCG uses explicit log discounting; MAP uses precision averaging

NDCG vs. Precision@k:

Aspect	NDCG@k	Precision@k
Position within top-k	Matters (log discount)	Doesn't matter (all equal)
Relevance grades	Utilized	Collapsed to binary
Use case	Fine-grained ranking evaluation	Simple binary evaluation

Connection to Expected Utility:

NDCG can be interpreted as a normalized expected utility under a user model where:

Users examine items from top to bottom
Probability of examining position i decreases logarithmically
Utility gained from an item equals $2^{\text{rel}} - 1$

ndcg_vs_other_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
 
def all_metrics_comparison(relevance: list, k: int):
    """Compare NDCG, MAP, and P@k for a ranking."""
    rel = np.array(relevance[:k], dtype=float)
    n = len(rel)
    
    # NDCG (graded)
    positions = np.arange(1, n + 1)
    discounts = 1.0 / np.log2(positions + 1)
    gains = (2 ** rel) - 1
    dcg = np.sum(gains * discounts)
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    ndcg = dcg / idcg if idcg > 0 else 0.0
    
    # Binary conversion for P@k and MAP
    binary_rel = (rel > 0).astype(float)
    total_relevant = np.sum(binary_rel)
    
    # P@k
    precision_at_k = np.mean(binary_rel)
    
    # AP (assuming all relevant items are in the list)
    if total_relevant > 0:
        cumsum = np.cumsum(binary_rel)
        precisions = cumsum / np.arange(1, n + 1)
        ap = np.sum(precisions * binary_rel) / total_relevant
    else:
        ap = 0.0
    
    return {
        'NDCG': ndcg,
        'P@k': precision_at_k,
        'AP': ap
    }
 
# Compare rankings with same binary relevance but different grades
print("Metric Comparison: Same Binary Relevance, Different Grades")
print("=" * 70)
 
# Ranking A: Highly relevant items in relevant positions
ranking_a = [5, 4, 0, 0, 3]  # Binary: [1, 1, 0, 0, 1]
 
# Ranking B: Marginally relevant items in same positions
ranking_b = [1, 1, 0, 0, 1]  # Binary: [1, 1, 0, 0, 1]
 
metrics_a = all_metrics_comparison(ranking_a, k=5)
metrics_b = all_metrics_comparison(ranking_b, k=5)
 
print(f"Ranking A (high grades): {ranking_a}")
print(f"Ranking B (low grades):  {ranking_b}")
print(f"Both have same binary pattern: [1, 1, 0, 0, 1]")
print()
 
print(f"{'Metric':<12} {'Ranking A':<15} {'Ranking B':<15} {'Difference'}")
print("-" * 55)
for metric in ['NDCG', 'P@k', 'AP']:
    a_val = metrics_a[metric]
    b_val = metrics_b[metric]
    print(f"{metric:<12} {a_val:<15.4f} {b_val:<15.4f} {a_val - b_val:.4f}")
 
print()
print("NDCG distinguishes between high and low grades.")
print("P@k and AP (binary metrics) cannot see the difference!")

Choosing Between NDCG and MAP

NDCG Variants and Extensions

Several NDCG variants address specific needs:

1. Alternative Gain Functions:

Beyond linear and exponential, custom gain functions can model domain-specific utility:

Linear: $g(\text{rel}) = \text{rel}$
Exponential: $g(\text{rel}) = 2^{\text{rel}} - 1$ (standard)
Custom: $g(\text{rel}) = f(\text{rel})$ for application-specific utility

2. Alternative Discount Functions:

The logarithmic discount can be replaced:

Logarithmic: $d(i) = 1/\log_2(i+1)$ (standard)
Linear: $d(i) = 1/i$ (steeper decay)
Reciprocal rank: $d(i) = 1/(i^b)$ for tunable $b$
Position-based model (PBM): $d(i) = p_i$ learned from click data

3. ERR (Expected Reciprocal Rank):

A cascade-model variant where users stop after finding a satisfying result. Lower-ranked items only contribute if higher-ranked items didn't satisfy the user:

$$\text{ERR} = \sum_{r=1}^{n} \frac{1}{r} \prod_{i=1}^{r-1}(1 - R_i) \cdot R_r$$

where $R_i$ is the probability of satisfaction at position $i$.

4. Intent-Aware NDCG:

For queries with multiple intents, compute NDCG per intent and aggregate:

$$\text{IA-NDCG} = \sum_{\text{intent } j} P(j) \cdot \text{NDCG}_j$$

where $P(j)$ is the probability of intent $j$.

ndcg_variants.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from typing import Callable
 
def generalized_ndcg(
    relevance: list,
    k: int,
    gain_fn: Callable[[float], float] = lambda r: 2**r - 1,
    discount_fn: Callable[[int], float] = lambda i: 1/np.log2(i+1)
) -> float:
    """
    Generalized NDCG with customizable gain and discount functions.
    
    Args:
        relevance: Relevance scores in ranked order
        k: Cutoff position
        gain_fn: Function mapping relevance to gain
        discount_fn: Function mapping position to discount
    
    Returns:
        Generalized NDCG value
    """
    rel = np.array(relevance[:k], dtype=float)
    n = len(rel)
    
    if n == 0:
        return 0.0
    
    # Compute gains and discounts
    gains = np.array([gain_fn(r) for r in rel])
    discounts = np.array([discount_fn(i+1) for i in range(n)])
    
    dcg = np.sum(gains * discounts)
    
    # Ideal ranking
    ideal_gains = np.sort(gains)[::-1]
    idcg = np.sum(ideal_gains * discounts)
    
    return dcg / idcg if idcg > 0 else 0.0
 
# Compare different gain functions
relevance = [3, 1, 2, 0, 2, 1]
 
print("Impact of Gain Function on NDCG")
print("=" * 50)
print(f"Relevance scores: {relevance}")
print()
 
# Linear gain
linear_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: r)
print(f"Linear gain (g(r) = r):      NDCG = {linear_ndcg:.4f}")
 
# Exponential gain (standard)
exp_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: 2**r - 1)
print(f"Exponential (g(r) = 2^r-1):  NDCG = {exp_ndcg:.4f}")
 
# Custom: Square gain (emphasizes high relevance even more)
square_ndcg = generalized_ndcg(relevance, k=6, gain_fn=lambda r: r**2)
print(f"Square gain (g(r) = r^2):    NDCG = {square_ndcg:.4f}")
 
# Compare different discount functions
print("\nImpact of Discount Function on NDCG")
print("=" * 50)
 
# Standard log discount
log_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/np.log2(i+1))
print(f"Log discount (d(i) = 1/log₂(i+1)):   NDCG = {log_ndcg:.4f}")
 
# Linear discount (steeper)
linear_disc_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/i)
print(f"Linear discount (d(i) = 1/i):        NDCG = {linear_disc_ndcg:.4f}")
 
# Gentle discount
gentle_ndcg = generalized_ndcg(relevance, k=6, discount_fn=lambda i: 1/np.log(i+np.e))
print(f"Gentle discount (d(i) = 1/ln(i+e)):  NDCG = {gentle_ndcg:.4f}")

Practical Implementation Considerations

Implementing NDCG in production requires attention to several practical issues:

Implementation Checklist

•Document the gain function: Exponential (2^rel - 1) is standard but not universal. State your choice explicitly.
•Document the discount base: log₂(i+1) is standard. Some use natural log, changing the discount curve.
•Handle missing relevance judgments: In incomplete data (common in large-scale evaluation), items without judgments are often treated as rel=0. This underestimates true NDCG.
•Choose IDCG scope: Decide whether IDCG uses all corpus items or only returned items. Document this choice.
•Handle ties in ideal ranking: When sorting for IDCG, ties in relevance can be broken arbitrarily. This doesn't affect IDCG value but documents your implementation.
•Vectorize for efficiency: When computing NDCG across many queries, use NumPy/vectorized operations rather than loops.
•Edge cases: Handle k > n (compute with available items), all rel=0 (return 0), and empty rankings.

production_ndcg.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import numpy as np
from typing import List, Optional
from dataclasses import dataclass
 
@dataclass
class NDCGResult:
    """Container for NDCG computation results."""
    ndcg: float
    dcg: float
    idcg: float
    k_effective: int
 
class NDCGEvaluator:
    """Production-grade NDCG evaluator."""
    
    def __init__(self, 
                 use_exponential_gain: bool = True,
                 log_base: float = 2.0):
        """
        Initialize NDCG evaluator.
        
        Args:
            use_exponential_gain: Use 2^rel - 1 if True, else rel
            log_base: Base of logarithm for position discount
        """
        self.use_exp_gain = use_exponential_gain
        self.log_base = log_base
    
    def _compute_gains(self, rel: np.ndarray) -> np.ndarray:
        """Compute gains from relevance scores."""
        if self.use_exp_gain:
            return (2 ** rel) - 1
        return rel.copy()
    
    def _compute_discounts(self, n: int) -> np.ndarray:
        """Compute position discounts for n positions."""
        positions = np.arange(1, n + 1)
        return 1.0 / (np.log(positions + 1) / np.log(self.log_base))
    
    def ndcg_at_k(self, 
                  relevance: List[float],
                  k: Optional[int] = None,
                  ideal_relevance: Optional[List[float]] = None) -> NDCGResult:
        """
        Compute NDCG@k with full diagnostics.
        
        Args:
            relevance: Relevance scores in ranked order
            k: Cutoff position (None = use all)
            ideal_relevance: Full corpus relevances for IDCG
                             (None = use items in relevance list)
        
        Returns:
            NDCGResult with NDCG, DCG, IDCG, and effective k
        """
        rel = np.array(relevance, dtype=float)
        if k is not None:
            rel = rel[:k]
        
        n = len(rel)
        if n == 0:
            return NDCGResult(ndcg=0.0, dcg=0.0, idcg=0.0, k_effective=0)
        
        # Compute DCG
        gains = self._compute_gains(rel)
        discounts = self._compute_discounts(n)
        dcg = np.sum(gains * discounts)
        
        # Compute IDCG
        if ideal_relevance is not None:
            ideal_rel = np.array(ideal_relevance, dtype=float)
            ideal_gains = np.sort(self._compute_gains(ideal_rel))[::-1][:n]
        else:
            ideal_gains = np.sort(gains)[::-1]
        
        ideal_discounts = self._compute_discounts(len(ideal_gains))
        idcg = np.sum(ideal_gains * ideal_discounts)
        
        # Compute NDCG
        ndcg = dcg / idcg if idcg > 0 else 0.0
        
        return NDCGResult(
            ndcg=float(ndcg),
            dcg=float(dcg),
            idcg=float(idcg),
            k_effective=n
        )
    
    def mean_ndcg(self, 
                  query_relevances: List[List[float]],
                  k: Optional[int] = None) -> dict:
        """
        Compute mean NDCG across multiple queries.
        
        Args:
            query_relevances: List of relevance lists (one per query)
            k: Cutoff position
        
        Returns:
            Dict with mean NDCG and per-query values
        """
        results = [self.ndcg_at_k(rel, k) for rel in query_relevances]
        ndcgs = [r.ndcg for r in results]
        
        return {
            'mean_ndcg': np.mean(ndcgs),
            'std_ndcg': np.std(ndcgs),
            'per_query_ndcg': ndcgs,
            'num_queries': len(ndcgs)
        }
 
# Demo
evaluator = NDCGEvaluator(use_exponential_gain=True, log_base=2.0)
 
# Single query evaluation
rel = [3, 2, 3, 0, 1, 2, 0, 0, 1, 0]
result = evaluator.ndcg_at_k(rel, k=10)
 
print("Single Query NDCG Evaluation")
print("=" * 50)
print(f"Relevances: {rel}")
print(f"DCG@10:  {result.dcg:.4f}")
print(f"IDCG@10: {result.idcg:.4f}")
print(f"NDCG@10: {result.ndcg:.4f}")
 
# Multiple queries
queries = [
    [3, 2, 1, 0, 0],
    [0, 0, 3, 2, 1],
    [5, 4, 3, 2, 1],
]
 
mean_result = evaluator.mean_ndcg(queries, k=5)
print(f"\nMean NDCG@5: {mean_result['mean_ndcg']:.4f} ± {mean_result['std_ndcg']:.4f}")

Common Pitfalls and Best Practices

Common Pitfalls

•Using wrong gain function: Switching between linear and exponential gain changes values significantly. Verify you're using the expected function.
•Inconsistent IDCG computation: Comparing systems that use different IDCG approaches produces invalid comparisons.
•Ignoring missing judgments: Items without relevance labels are often treated as rel=0, underestimating NDCG. Consider evaluation with incomplete judgments.
•Integer overflow with high relevance: With exponential gain and high relevance values (rel > 30), 2^rel overflows. Use floating-point arithmetic.
•Averaging NDCG incorrectly: NDCG should be averaged across queries, not computed on aggregated data.

Best Practices

•Report NDCG at multiple cutoffs: NDCG@1, @3, @5, @10 tell different stories. Report the full picture.
•Include confidence intervals: Especially for small test sets, mean NDCG alone is insufficient.
•Document all parameters: Gain function, discount base, IDCG scope, handling of missing judgments.
•Compare with baselines: Random and popularity-based baselines contextualize your NDCG values.
•Use stratified analysis: Break down NDCG by query type, difficulty, or other dimensions to understand performance patterns.

Summary: NDCG

We have established NDCG as the standard metric for graded ranking evaluation. Let's consolidate our understanding:

Key Takeaways

•CG sums relevance but ignores position: $\text{CG@}k = \sum_{i=1}^k \text{rel}_i$
•DCG adds position discounting: $\text{DCG@}k = \sum_{i=1}^k \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}$
•NDCG normalizes by ideal: $\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$, giving values in [0, 1]
•Exponential gain ($2^\text{rel} - 1$) strongly emphasizes high-relevance items
•Logarithmic discount models decreasing user attention with position
•IDCG computation matters: From corpus vs. from returned list yields different semantics
•NDCG is the standard for recommendation systems and graded retrieval evaluation

Mathematical Summary:

$$\text{DCG@}k = \sum_{i=1}^{k} \frac{\text{gain}(\text{rel}_i)}{\text{discount}(i)}$$

$$\text{NDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k}$$

Standard choices:

$\text{gain}(r) = 2^r - 1$ (exponential)
$\text{discount}(i) = \log_2(i+1)$ (logarithmic)

What's Next:

Page Complete