Recommender System Fundamentals - Learning Module

Loading content...

0/278

Evaluation Metrics

Measuring What Matters

You've built a recommendation system. It trains without errors, makes predictions, and generates ranked lists of items. But is it actually good? How do you know if version 2.0 is better than version 1.0? How do you convince stakeholders that your model improvements translate to business value?

The answer lies in evaluation metrics—quantitative measures that capture different aspects of recommendation quality. But here's the challenge: there is no single metric that captures everything. Optimizing for one metric often comes at the cost of another. Precision and recall trade off. Accuracy and diversity trade off. Short-term engagement and long-term satisfaction trade off.

Mastering evaluation metrics means understanding not just how to compute them, but when each is appropriate and what each actually measures about user experience.

What You Will Learn

By the end of this page, you will understand the full taxonomy of recommendation metrics—from rating prediction accuracy through ranking quality, coverage, diversity, novelty, and serendipity. You'll learn to choose appropriate metrics for your use case and avoid common evaluation pitfalls that mislead engineering decisions.

The Recommendation Metrics Taxonomy

Recommendation metrics can be organized into several distinct categories, each capturing different aspects of quality:

1. Prediction Accuracy Metrics

Measure how well predicted ratings match actual ratings.

RMSE, MAE, MSE
Primarily for explicit feedback systems
Historical importance but often not the right choice

2. Ranking Metrics

Measure the quality of the ranked list shown to users.

Precision@K, Recall@K
NDCG, MAP, MRR
Most relevant for practical recommendation tasks

3. Coverage Metrics

Measure how much of the catalog is being recommended.

Catalog coverage
User coverage
Relevant for avoiding popularity bias

4. Diversity and Novelty Metrics

Measure variety and discovery in recommendations.

Intra-list diversity
Novelty
Serendipity

5. Business Metrics

Measure actual business impact.

CTR, conversion rate
Revenue per user
User retention, churn

Metric Categories and Their Focus
Category	Question Answered	Optimization Trade-off	When to Use
Prediction Accuracy	How close are predicted ratings?	May miss ranking quality	Explicit rating prediction tasks
Ranking Quality	Are the best items at the top?	May ignore long-tail items	Top-N recommendation, retrieval
Coverage	What fraction of items get recommended?	May hurt relevance	Ensuring catalog diversity
Diversity/Novelty	Is the list varied and surprising?	May reduce immediate relevance	Discovery, user engagement
Business Metrics	What's the bottom-line impact?	May have long feedback loops	Final deployment decisions

The Evaluation Paradox

Offline metrics (computed on historical data) often correlate weakly with online business metrics (measured in production). A model with 2% better NDCG might show no improvement in CTR. Always validate significant changes with online A/B tests.

Prediction Accuracy Metrics

Prediction accuracy metrics quantify how close predicted ratings are to actual ratings. They were foundational in the Netflix Prize era but have become less central as the field has shifted toward ranking.

Root Mean Squared Error (RMSE)

The most common rating prediction metric:

$$\text{RMSE} = \sqrt{\frac{1}{|T|} \sum_{(u,i) \in T} (\hat{r}{ui} - r{ui})^2}$$

Where $T$ is the test set of (user, item, rating) tuples.

Properties:

Penalizes large errors more heavily than small errors
Scale-dependent: RMSE of 0.8 on 1-5 scale ≠ same error on 1-10 scale
Netflix Prize winning RMSE: ~0.8563 on 1-5 scale

Mean Absolute Error (MAE)

$$\text{MAE} = \frac{1}{|T|} \sum_{(u,i) \in T} |\hat{r}{ui} - r{ui}|$$

Properties:

More robust to outliers than RMSE
Easier to interpret: "on average, predictions are off by X stars"
Doesn't penalize large errors as heavily

prediction_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from typing import List, Tuple
from sklearn.metrics import mean_squared_error, mean_absolute_error
 
def rmse(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> float:
    """
    Root Mean Squared Error.
    
    Standard metric for rating prediction accuracy.
    Penalizes large errors more than small ones.
    
    Args:
        y_true: Ground truth ratings
        y_pred: Predicted ratings
    
    Returns:
        RMSE value
    """
    return np.sqrt(mean_squared_error(y_true, y_pred))
 
 
def mae(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> float:
    """
    Mean Absolute Error.
    
    More robust to outliers than RMSE.
    Easier to interpret: "predictions off by X on average"
    """
    return mean_absolute_error(y_true, y_pred)
 
 
def rating_prediction_evaluation(
    test_ratings: List[Tuple[int, int, float]],  # (user_id, item_id, rating)
    predictions: dict  # {(user_id, item_id): predicted_rating}
) -> dict:
    """
    Comprehensive rating prediction evaluation.
    
    Returns multiple metrics for thorough analysis.
    """
    y_true = []
    y_pred = []
    
    for user_id, item_id, actual_rating in test_ratings:
        if (user_id, item_id) in predictions:
            y_true.append(actual_rating)
            y_pred.append(predictions[(user_id, item_id)])
    
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Compute various metrics
    results = {
        'rmse': rmse(y_true, y_pred),
        'mae': mae(y_true, y_pred),
        'n_predictions': len(y_true),
        'coverage': len(y_true) / len(test_ratings),  # What fraction we could predict
    }
    
    # Stratified analysis: error by rating level
    for rating_value in sorted(set(y_true)):
        mask = y_true == rating_value
        if mask.sum() > 0:
            results[f'rmse_rating_{rating_value}'] = rmse(y_true[mask], y_pred[mask])
    
    return results
 
 
class RatingPredictionEvaluator:
    """
    Comprehensive evaluator for rating prediction models.
    
    Includes stratified analysis by user activity, item popularity, etc.
    """
    
    def __init__(
        self,
        user_activity: dict = None,  # user_id -> n_ratings
        item_popularity: dict = None  # item_id -> n_ratings
    ):
        self.user_activity = user_activity or {}
        self.item_popularity = item_popularity or {}
    
    def evaluate(
        self,
        test_ratings: List[Tuple[int, int, float]],
        predictions: dict
    ) -> dict:
        """Full evaluation with stratification."""
        
        results = rating_prediction_evaluation(test_ratings, predictions)
        
        # Stratify by user activity
        active_users = {u for u, count in self.user_activity.items() if count > 50}
        cold_users = {u for u, count in self.user_activity.items() if count < 10}
        
        # Active user performance
        active_test = [(u, i, r) for u, i, r in test_ratings if u in active_users]
        if active_test:
            y_true = [r for u, i, r in active_test]
            y_pred = [predictions.get((u, i), np.nan) for u, i, r in active_test]
            valid = ~np.isnan(y_pred)
            if valid.sum() > 0:
                results['rmse_active_users'] = rmse(
                    np.array(y_true)[valid], 
                    np.array(y_pred)[valid]
                )
        
        # Cold user performance
        cold_test = [(u, i, r) for u, i, r in test_ratings if u in cold_users]
        if cold_test:
            y_true = [r for u, i, r in cold_test]
            y_pred = [predictions.get((u, i), np.nan) for u, i, r in cold_test]
            valid = ~np.isnan(y_pred)
            if valid.sum() > 0:
                results['rmse_cold_users'] = rmse(
                    np.array(y_true)[valid], 
                    np.array(y_pred)[valid]
                )
        
        return results

Why RMSE Isn't Enough

Low RMSE doesn't guarantee good rankings. A model could achieve low RMSE by predicting everything around the mean (e.g., always predicting 3.5 stars) while completely failing to distinguish good from bad items. For top-N recommendations, ranking metrics are more appropriate.

Classification-Based Ranking Metrics

For top-N recommendations, we care about whether the recommended items are relevant, not what exact rating they'd receive. This leads to classification-style metrics.

Precision@K

Fraction of recommended items that are relevant:

$$\text{Precision@K} = \frac{|\text{Recommended}_K \cap \text{Relevant}|}{K}$$

Example: If we recommend 10 items and 3 are relevant → Precision@10 = 0.3

Interpretation: "Of the items we showed, what fraction were good?"

Recall@K

Fraction of relevant items that are recommended:

$$\text{Recall@K} = \frac{|\text{Recommended}_K \cap \text{Relevant}|}{|\text{Relevant}|}$$

Example: If user has 15 relevant items and we recommend 10, with 3 being relevant → Recall@10 = 3/15 = 0.2

Interpretation: "Of all good items, what fraction did we find?"

The Precision-Recall Trade-off:

K	Typical Precision@K	Typical Recall@K	Comment
1	Higher	Lower	Very selective, may miss good items
5	Moderate	Moderate	Common for mobile screens
10	Moderate	Higher	Common for desktop
50	Lower	High	Casting wider net
100	Lowest	Highest	Retrieval stage metric

Increasing K typically increases recall but decreases precision.

F1@K Score:

Harmonic mean of precision and recall:

$$\text{F1@K} = 2 \cdot \frac{\text{Precision@K} \cdot \text{Recall@K}}{\text{Precision@K} + \text{Recall@K}}$$

Useful when you want a single metric balancing both.

Hit Rate@K:

Fraction of users for whom at least one recommended item is relevant:

$$\text{HitRate@K} = \frac{|{u : |\text{Recommended}_K(u) \cap \text{Relevant}(u)| \geq 1}|}{|U|}$$

Often more interpretable than Precision/Recall for stakeholders.

classification_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import numpy as np
from typing import List, Dict, Set
from collections import defaultdict
 
def precision_at_k(
    recommended: List[str], 
    relevant: Set[str], 
    k: int
) -> float:
    """
    Precision@K: Fraction of top-K recommendations that are relevant.
    
    Args:
        recommended: Ordered list of recommended item IDs (best first)
        relevant: Set of relevant item IDs for this user
        k: Number of recommendations to consider
    
    Returns:
        Precision@K value in [0, 1]
    """
    if k == 0:
        return 0.0
    
    recommended_at_k = set(recommended[:k])
    n_relevant_in_rec = len(recommended_at_k & relevant)
    
    return n_relevant_in_rec / k
 
 
def recall_at_k(
    recommended: List[str], 
    relevant: Set[str], 
    k: int
) -> float:
    """
    Recall@K: Fraction of relevant items that appear in top-K.
    
    Args:
        recommended: Ordered list of recommended item IDs
        relevant: Set of relevant item IDs for this user
        k: Number of recommendations to consider
    
    Returns:
        Recall@K value in [0, 1]
    """
    if len(relevant) == 0:
        return 0.0
    
    recommended_at_k = set(recommended[:k])
    n_relevant_in_rec = len(recommended_at_k & relevant)
    
    return n_relevant_in_rec / len(relevant)
 
 
def hit_rate_at_k(
    recommendations: Dict[str, List[str]],  # user_id -> recommended items
    relevance: Dict[str, Set[str]],          # user_id -> relevant items
    k: int
) -> float:
    """
    Hit Rate@K: Fraction of users with at least one hit in top-K.
    
    Often more intuitive for business discussions than precision/recall.
    """
    n_hits = 0
    n_users = 0
    
    for user_id, recommended in recommendations.items():
        if user_id not in relevance:
            continue
        
        relevant = relevance[user_id]
        recommended_at_k = set(recommended[:k])
        
        if len(recommended_at_k & relevant) > 0:
            n_hits += 1
        n_users += 1
    
    return n_hits / n_users if n_users > 0 else 0.0
 
 
def average_precision(
    recommended: List[str], 
    relevant: Set[str]
) -> float:
    """
    Average Precision: Mean of precision values at each relevant item position.
    
    Rewards placing relevant items higher in the list.
    
    AP = (1/|R|) * Σ (Precision@k * rel(k))
    
    where rel(k) = 1 if item at position k is relevant, else 0
    """
    if len(relevant) == 0:
        return 0.0
    
    precisions = []
    n_relevant_seen = 0
    
    for k, item in enumerate(recommended, start=1):
        if item in relevant:
            n_relevant_seen += 1
            precision_at_k = n_relevant_seen / k
            precisions.append(precision_at_k)
    
    if len(precisions) == 0:
        return 0.0
    
    return np.mean(precisions)
 
 
def mean_average_precision(
    recommendations: Dict[str, List[str]],
    relevance: Dict[str, Set[str]]
) -> float:
    """
    Mean Average Precision: MAP = mean of AP across all users.
    
    One of the most common ranking metrics in information retrieval.
    """
    aps = []
    
    for user_id, recommended in recommendations.items():
        if user_id not in relevance:
            continue
        
        relevant = relevance[user_id]
        ap = average_precision(recommended, relevant)
        aps.append(ap)
    
    return np.mean(aps) if aps else 0.0
 
 
class RankingMetricsCalculator:
    """
    Comprehensive ranking metrics computation.
    
    Handles the common pattern of computing multiple metrics
    at multiple K values for a batch of users.
    """
    
    def __init__(self, k_values: List[int] = [1, 5, 10, 20, 50]):
        self.k_values = k_values
    
    def compute_all(
        self,
        recommendations: Dict[str, List[str]],
        relevance: Dict[str, Set[str]]
    ) -> Dict[str, float]:
        """
        Compute all metrics at all K values.
        
        Returns dict with keys like 'precision@5', 'recall@10', etc.
        """
        results = {}
        
        # Per-user metrics, then average
        user_precisions = {k: [] for k in self.k_values}
        user_recalls = {k: [] for k in self.k_values}
        user_aps = []
        
        for user_id, recommended in recommendations.items():
            if user_id not in relevance:
                continue
            
            relevant = relevance[user_id]
            
            for k in self.k_values:
                user_precisions[k].append(precision_at_k(recommended, relevant, k))
                user_recalls[k].append(recall_at_k(recommended, relevant, k))
            
            user_aps.append(average_precision(recommended, relevant))
        
        # Aggregate
        for k in self.k_values:
            results[f'precision@{k}'] = np.mean(user_precisions[k])
            results[f'recall@{k}'] = np.mean(user_recalls[k])
            results[f'hit_rate@{k}'] = hit_rate_at_k(recommendations, relevance, k)
            
            # F1
            p, r = results[f'precision@{k}'], results[f'recall@{k}']
            results[f'f1@{k}'] = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
        
        results['map'] = np.mean(user_aps)
        
        return results

Rank-Aware Metrics: NDCG, MRR, and Beyond

Classification metrics treat all positions in the top-K equally. But users pay more attention to the first few items—a relevant item at position 1 is more valuable than one at position 10. Rank-aware metrics account for this.

Mean Reciprocal Rank (MRR)

For queries with a single correct answer, MRR measures how high the first relevant result ranks:

$$\text{MRR} = \frac{1}{|U|} \sum_{u \in U} \frac{1}{\text{rank}_u}$$

Where $\text{rank}_u$ is the position of the first relevant item for user $u$.

Example: First relevant at position 3 → Reciprocal Rank = 1/3 ≈ 0.33

Discounted Cumulative Gain (DCG)

Accounts for graded relevance (not just binary) with position-based discounting:

$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$$

Key insight: The denominator $\log_2(i+1)$ decreases the value of items at lower positions logarithmically.

Normalized Discounted Cumulative Gain (NDCG)

Normalizes DCG by the maximum possible (ideal) DCG:

$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$

Where IDCG is the DCG of the ideal ranking (all relevant items at top, sorted by relevance).

Properties:

Range: [0, 1] where 1 is perfect ranking
Handles graded relevance (relevance scores, not just binary)
Position-sensitive: rewards placing best items at top
Most widely used ranking metric in modern systems

Comparing Rank-Aware Metrics:

Metric	Handles Graded Relevance	Position Weighting	Normalized	Typical Use Case
MRR	No	Only first item	No	Single answer IR
DCG	Yes	Log discounting	No	Absolute ranking quality
NDCG	Yes	Log discounting	Yes	Comparing across users/queries
MAP	Binary only	1/position	Sort of	Standard IR metric

rank_aware_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
import numpy as np
from typing import List, Dict
 
def dcg_at_k(
    relevances: List[float], 
    k: int
) -> float:
    """
    Discounted Cumulative Gain at K.
    
    DCG = Σ (2^rel_i - 1) / log2(i + 1) for i in 1..K
    
    Args:
        relevances: Relevance scores in rank order (position 0 = rank 1)
        k: Number of items to consider
    
    Returns:
        DCG value
    """
    relevances = np.array(relevances[:k])
    positions = np.arange(1, len(relevances) + 1)
    
    # Gain formula: (2^rel - 1), though some use just rel
    gains = np.power(2, relevances) - 1
    
    # Log2 discount
    discounts = np.log2(positions + 1)
    
    return np.sum(gains / discounts)
 
 
def ndcg_at_k(
    relevances: List[float], 
    k: int
) -> float:
    """
    Normalized Discounted Cumulative Gain at K.
    
    NDCG = DCG / IDCG where IDCG is DCG of ideal ranking.
    
    Args:
        relevances: Relevance scores in rank order (as produced by model)
        k: Cutoff
    
    Returns:
        NDCG value in [0, 1]
    """
    dcg = dcg_at_k(relevances, k)
    
    # IDCG: DCG if we had sorted by relevance
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    
    if idcg == 0:
        return 0.0
    
    return dcg / idcg
 
 
def mrr(
    recommendations: Dict[str, List[str]],  # user -> ordered recommendations
    relevance: Dict[str, set]               # user -> relevant items
) -> float:
    """
    Mean Reciprocal Rank.
    
    For each user, find position of first relevant item.
    MRR = mean of 1/rank across users.
    """
    reciprocal_ranks = []
    
    for user_id, recs in recommendations.items():
        if user_id not in relevance:
            continue
        
        relevant = relevance[user_id]
        
        # Find first relevant item's position
        reciprocal_rank = 0.0
        for rank, item in enumerate(recs, start=1):
            if item in relevant:
                reciprocal_rank = 1.0 / rank
                break
        
        reciprocal_ranks.append(reciprocal_rank)
    
    return np.mean(reciprocal_ranks) if reciprocal_ranks else 0.0
 
 
def mean_ndcg_at_k(
    recommendations: Dict[str, List[str]],
    relevance_scores: Dict[str, Dict[str, float]],  # user -> item -> score
    k: int
) -> float:
    """
    Mean NDCG@K across all users.
    
    Handles graded relevance (not just binary).
    
    Args:
        recommendations: user_id -> ordered list of recommended items
        relevance_scores: user_id -> {item_id -> relevance_score}
        k: Cutoff
    
    Returns:
        Mean NDCG@K
    """
    ndcgs = []
    
    for user_id, recs in recommendations.items():
        if user_id not in relevance_scores:
            continue
        
        user_relevance = relevance_scores[user_id]
        
        # Build relevance vector in recommendation order
        relevances = [user_relevance.get(item, 0.0) for item in recs]
        
        ndcg = ndcg_at_k(relevances, k)
        ndcgs.append(ndcg)
    
    return np.mean(ndcgs) if ndcgs else 0.0
 
 
class NDCGEvaluator:
    """
    Production-grade NDCG evaluation with stratification and confidence intervals.
    """
    
    def __init__(self, k_values: List[int] = [5, 10, 20]):
        self.k_values = k_values
    
    def evaluate(
        self,
        recommendations: Dict[str, List[str]],
        relevance_scores: Dict[str, Dict[str, float]],
        user_segments: Dict[str, str] = None  # user -> segment for stratification
    ) -> Dict[str, any]:
        """
        Comprehensive NDCG evaluation with optional stratification.
        """
        results = {}
        
        # Compute per-user NDCG for each K
        user_ndcgs = {k: {} for k in self.k_values}  # k -> user_id -> ndcg
        
        for user_id, recs in recommendations.items():
            if user_id not in relevance_scores:
                continue
            
            user_rel = relevance_scores[user_id]
            relevances = [user_rel.get(item, 0.0) for item in recs]
            
            for k in self.k_values:
                user_ndcgs[k][user_id] = ndcg_at_k(relevances, k)
        
        # Aggregate statistics
        for k in self.k_values:
            values = list(user_ndcgs[k].values())
            
            results[f'ndcg@{k}'] = np.mean(values)
            results[f'ndcg@{k}_std'] = np.std(values)
            results[f'ndcg@{k}_median'] = np.median(values)
            
            # 95% confidence interval using bootstrap
            if len(values) >= 30:
                bootstrap_means = []
                for _ in range(1000):
                    sample = np.random.choice(values, size=len(values), replace=True)
                    bootstrap_means.append(np.mean(sample))
                
                results[f'ndcg@{k}_ci95_low'] = np.percentile(bootstrap_means, 2.5)
                results[f'ndcg@{k}_ci95_high'] = np.percentile(bootstrap_means, 97.5)
        
        # Stratified analysis if segments provided
        if user_segments:
            segment_results = {}
            
            for segment in set(user_segments.values()):
                segment_users = {u for u, s in user_segments.items() if s == segment}
                
                for k in self.k_values:
                    segment_values = [
                        user_ndcgs[k][u] for u in segment_users 
                        if u in user_ndcgs[k]
                    ]
                    
                    if segment_values:
                        segment_results[f'{segment}_ndcg@{k}'] = np.mean(segment_values)
            
            results['stratified'] = segment_results
        
        return results

NDCG is the Industry Standard

Most production recommendation systems use NDCG as their primary ranking metric. It handles graded relevance, rewards putting the best items first, and normalizes across users. When in doubt, start with NDCG@10.

Beyond-Accuracy Metrics: Diversity, Novelty, Coverage

Accuracy metrics tell you if recommendations are relevant. But relevant isn't everything. A recommender that only shows extremely popular items users would have found anyway isn't adding much value. Beyond-accuracy metrics capture these other dimensions of quality.

Catalog Coverage

What fraction of items ever gets recommended?

$$\text{Coverage} = \frac{|\bigcup_{u} \text{Recommended}(u)|}{|I|}$$

Problem: Systems often show only 1-5% of catalog (popularity bias).

Intra-List Diversity

How different are the items within a single recommendation list?

$$\text{ILD}(L) = \frac{1}{K(K-1)} \sum_{i \in L} \sum_{j \in L, j eq i} (1 - \text{sim}(i, j))$$

Measured as average pairwise dissimilarity.

Novelty

Does the system recommend less popular items users haven't seen?

$$\text{Novelty} = \frac{1}{|L|} \sum_{i \in L} \log_2 \frac{|U|}{|U_i|}$$

Where $|U_i|$ is the number of users who have interacted with item $i$.

Advanced Beyond-Accuracy Metrics

•Serendipity — Are recommendations both relevant AND unexpected? Measures pleasant surprise. Hard to compute but captures user delight.
•Unexpectedness — How far are recommendations from what a baseline (e.g., popularity) would predict?
•Long-Tail Penetration — What fraction of recommendations come from long-tail (low-popularity) items?
•Category Coverage — Are multiple categories/genres represented in recommendations?
•Temporal Diversity — Does the system avoid showing the same items repeatedly over sessions?

beyond_accuracy_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import numpy as np
from typing import List, Dict, Set
from collections import Counter
 
def catalog_coverage(
    recommendations: Dict[str, List[str]],
    total_items: int,
    k: int = 10
) -> float:
    """
    Catalog Coverage: Fraction of items that appear in any top-K list.
    
    Low coverage indicates popularity bias—most items never get shown.
    """
    recommended_items = set()
    
    for user_id, rec_list in recommendations.items():
        recommended_items.update(rec_list[:k])
    
    return len(recommended_items) / total_items
 
 
def gini_coefficient(recommendations: Dict[str, List[str]], k: int = 10) -> float:
    """
    Gini Coefficient of recommendation distribution.
    
    0 = perfect equality (all items recommended equally)
    1 = perfect inequality (one item gets all recommendations)
    
    High Gini indicates winner-take-all popularity bias.
    """
    # Count how often each item is recommended
    item_counts = Counter()
    
    for user_id, rec_list in recommendations.items():
        for item in rec_list[:k]:
            item_counts[item] += 1
    
    counts = np.array(sorted(item_counts.values()))
    n = len(counts)
    
    if n == 0:
        return 0.0
    
    # Gini formula
    cumsum = np.cumsum(counts)
    return (n + 1 - 2 * cumsum.sum() / cumsum[-1]) / n
 
 
def intra_list_diversity(
    rec_list: List[str],
    item_embeddings: Dict[str, np.ndarray]
) -> float:
    """
    Intra-List Diversity (ILD): Average pairwise dissimilarity within list.
    
    Higher = more diverse recommendations.
    """
    if len(rec_list) < 2:
        return 0.0
    
    # Get embeddings for items in list
    embeddings = []
    for item in rec_list:
        if item in item_embeddings:
            embeddings.append(item_embeddings[item])
    
    if len(embeddings) < 2:
        return 0.0
    
    embeddings = np.array(embeddings)
    
    # Compute pairwise cosine similarities
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-10)
    similarity_matrix = normalized @ normalized.T
    
    # Average dissimilarity (excluding diagonal)
    n = len(embeddings)
    mask = ~np.eye(n, dtype=bool)
    avg_dissimilarity = 1 - similarity_matrix[mask].mean()
    
    return avg_dissimilarity
 
 
def novelty(
    rec_list: List[str],
    item_popularity: Dict[str, int],
    total_users: int
) -> float:
    """
    Novelty: Tendency to recommend less popular items.
    
    Novelty(L) = mean over items of log2(|U| / popularity(item))
    
    Higher = recommending more obscure items.
    """
    novelties = []
    
    for item in rec_list:
        pop = item_popularity.get(item, 1)  # Default 1 to avoid log(0)
        item_novelty = np.log2(total_users / pop)
        novelties.append(item_novelty)
    
    return np.mean(novelties) if novelties else 0.0
 
 
def serendipity(
    rec_list: List[str],
    user_history: Set[str],
    relevant: Set[str],
    popularity_baseline: Set[str]
) -> float:
    """
    Serendipity: Relevant items that user wouldn't have expected.
    
    Serendipitous = relevant AND unexpected AND novel
    
    Args:
        rec_list: Recommended items
        user_history: Items user has already interacted with
        relevant: Items that are relevant to user
        popularity_baseline: Items a popularity baseline would recommend
    
    Returns:
        Serendipity score
    """
    serendipitous_items = []
    
    for item in rec_list:
        is_relevant = item in relevant
        is_unexpected = item not in popularity_baseline
        is_novel = item not in user_history
        
        if is_relevant and is_unexpected and is_novel:
            serendipitous_items.append(item)
    
    return len(serendipitous_items) / len(rec_list) if rec_list else 0.0
 
 
class BeyondAccuracyEvaluator:
    """
    Comprehensive beyond-accuracy evaluation.
    
    Computes coverage, diversity, novelty across all recommendations.
    """
    
    def __init__(
        self,
        item_embeddings: Dict[str, np.ndarray],
        item_popularity: Dict[str, int],
        total_items: int,
        total_users: int
    ):
        self.item_embeddings = item_embeddings
        self.item_popularity = item_popularity
        self.total_items = total_items
        self.total_users = total_users
    
    def evaluate(
        self,
        recommendations: Dict[str, List[str]],
        k: int = 10
    ) -> Dict[str, float]:
        """Compute all beyond-accuracy metrics."""
        results = {}
        
        # Coverage metrics
        results['catalog_coverage'] = catalog_coverage(
            recommendations, self.total_items, k
        )
        results['gini_coefficient'] = gini_coefficient(recommendations, k)
        
        # Per-user metrics, then average
        diversities = []
        novelties = []
        
        for user_id, rec_list in recommendations.items():
            rec_list = rec_list[:k]
            
            div = intra_list_diversity(rec_list, self.item_embeddings)
            diversities.append(div)
            
            nov = novelty(rec_list, self.item_popularity, self.total_users)
            novelties.append(nov)
        
        results['avg_intra_list_diversity'] = np.mean(diversities)
        results['avg_novelty'] = np.mean(novelties)
        
        # Long-tail analysis
        popularity_values = np.array(list(self.item_popularity.values()))
        median_popularity = np.median(popularity_values)
        
        long_tail_count = 0
        total_count = 0
        
        for user_id, rec_list in recommendations.items():
            for item in rec_list[:k]:
                total_count += 1
                if self.item_popularity.get(item, 0) < median_popularity:
                    long_tail_count += 1
        
        results['long_tail_fraction'] = long_tail_count / total_count if total_count > 0 else 0.0
        
        return results

The Diversity-Accuracy Trade-off

Aggressively optimizing for diversity can hurt accuracy. Users may click less on diverse lists even if they provide more value long-term. This is why A/B tests on retention (not just clicks) are essential when tuning diversity parameters.

Offline vs Online Evaluation

There's a fundamental gap between offline metrics (computed on historical data) and online performance (measured in production). Understanding this gap is critical for avoiding misleading conclusions.

Offline Evaluation:

Data: Historical user-item interactions
Setup: Hold out test set, predict, compute metrics
Pros: Fast, reproducible, can test many variants
Cons: Selection bias, positional bias, can't measure true business impact

Online Evaluation (A/B Testing):

Data: Live user behavior
Setup: Random user assignment to control/treatment, measure outcomes
Pros: Measures actual user response, accounts for real-world conditions
Cons: Slow, expensive, can only test few variants simultaneously

Offline vs Online Evaluation Trade-offs
Aspect	Offline	Online (A/B)
Speed	Fast (minutes to hours)	Slow (days to weeks)
Cost	Low (compute only)	High (opportunity cost, eng time)
Variants testable	Many (100s)	Few (<10 simultaneously)
Selection bias	Problem	Not a problem
Position bias	Problem	Handled by randomization
Measures actual behavior	No	Yes
Statistical power	High (full dataset)	Depends on traffic split
Reproducibility	Exact	Approximate

The Offline-Online Correlation Problem:

Research consistently shows that offline metrics correlate weakly (~0.3-0.5) with online metrics. A model with 2% better NDCG might show no improvement—or even degradation—online.

Why the gap exists:

Selection Bias: Offline test sets contain items users chose to interact with. You only observe outcomes for what was shown.
Positional Bias: Items at position 1 get clicked more often regardless of quality. Offline metrics don't account for this.
Confounders: Time of day, device, user mood—many factors affect clicks that aren't captured in offline data.
Feedback Loops: Online, your recommendations affect future user behavior. Offline, this dynamic is missing.

Best Practice: Funnel Approach

Offline filtering: Use offline metrics to shortlist promising candidates
Shadow serving: Run models in production without showing to users, compare predictions to actual performance
Online A/B: Final validation on subset of traffic
Gradual rollout: Expand winning variant while monitoring guard metrics

Counterfactual Evaluation

Advanced techniques like Inverse Propensity Scoring (IPS) can partially debias offline metrics by re-weighting based on the probability that an item was shown. This narrows (but doesn't close) the offline-online gap.

Choosing the Right Metrics for Your Use Case

Metric selection should align with your actual business goals and user experience. Different applications require different primary metrics.

Decision Framework:

Q1: What kind of task is it?

Rating prediction → RMSE, MAE
Top-N recommendations → NDCG, Recall, HR
Click prediction → AUC-ROC, LogLoss

Q2: Is there graded relevance?

Binary relevance → Precision, Recall, F1
Graded relevance → NDCG, DCG

Q3: Does position matter?

All positions equal → Recall@K
Top positions most important → NDCG, MRR

Q4: Beyond accuracy concerns?

Catalog coverage is important → Track coverage, Gini
User discovery is important → Novelty, serendipity
Filter bubbles are a risk → Diversity metrics

Metric Recommendations by Use Case
Use Case	Primary Metrics	Secondary Metrics	Business KPI
E-commerce product recs	NDCG@10, Recall@50	Coverage, GMV lift	Revenue per session
Video streaming	Watch time, completion rate	NDCG@10, diversity	Subscriber retention
Music discovery	Listen counts, saves	Novelty, intra-list diversity	MAU, session length
News feed	CTR, dwell time	Category diversity, recency	Daily active users
Search ranking	NDCG@10, MRR	Abandonment rate	Queries per session
Ad ranking	AUC-ROC, LogLoss	Calibration, diversity	Revenue, user satisfaction

Common Metric Pitfalls to Avoid

•Optimizing for wrong metric — Minimizing RMSE when ranking quality matters. Always match metric to task.
•Ignoring user segments — Global metrics can hide poor performance for cold users. Always stratify.
•Single metric tunnel vision — Improving NDCG while destroying diversity. Track multiple metrics.
•Offline-only decisions — Deploying based on offline gains without A/B validation.
•Proxy metric drift — The metric you optimize may drift from the business goal over time. Periodically re-evaluate.

Summary: Evaluation Metrics

We've comprehensively explored the metrics landscape for recommendation systems. Let's consolidate the essential knowledge:

Key Takeaways

•Multiple metric categories exist — Prediction accuracy, ranking quality, coverage, diversity, and business metrics each tell different parts of the story.
•RMSE isn't enough for ranking — Low RMSE doesn't guarantee good rankings. Use NDCG, MAP, or similar rank-aware metrics for top-N tasks.
•NDCG is the industry standard — It handles graded relevance, rewards putting best items first, and normalizes across users.
•Beyond-accuracy matters — Coverage, diversity, novelty, and serendipity capture user experience dimensions that accuracy misses.
•Offline metrics weakly predict online — Always validate significant changes with A/B tests. Use offline metrics for filtering, online for decisions.
•Match metrics to use case — E-commerce cares about conversion; streaming cares about watch time; news cares about engagement.
•Track multiple metrics — No single metric captures everything. Monitor suites of metrics including guard metrics to detect harm.

What's Next:

With evaluation metrics mastered, we'll next explore A/B testing—the gold standard for measuring recommendation system impact in production. You'll learn how to design experiments, calculate sample sizes, avoid common pitfalls, and interpret results with proper statistical rigor.

Page Complete

You now have a comprehensive understanding of recommendation system evaluation metrics. You can select appropriate metrics for your use case, compute them correctly, and understand their limitations. Next, we'll dive into A/B testing methodologies for production validation.

Evaluation Metrics

Measuring What Matters

Mastering evaluation metrics means understanding not just how to compute them, but when each is appropriate and what each actually measures about user experience.

What You Will Learn

The Recommendation Metrics Taxonomy

Recommendation metrics can be organized into several distinct categories, each capturing different aspects of quality:

1. Prediction Accuracy Metrics

Measure how well predicted ratings match actual ratings.

RMSE, MAE, MSE
Primarily for explicit feedback systems
Historical importance but often not the right choice

2. Ranking Metrics

Measure the quality of the ranked list shown to users.

Precision@K, Recall@K
NDCG, MAP, MRR
Most relevant for practical recommendation tasks

3. Coverage Metrics

Measure how much of the catalog is being recommended.

Catalog coverage
User coverage
Relevant for avoiding popularity bias

4. Diversity and Novelty Metrics

Measure variety and discovery in recommendations.

Intra-list diversity
Novelty
Serendipity

5. Business Metrics

Measure actual business impact.

CTR, conversion rate
Revenue per user
User retention, churn

Metric Categories and Their Focus
Category	Question Answered	Optimization Trade-off	When to Use
Prediction Accuracy	How close are predicted ratings?	May miss ranking quality	Explicit rating prediction tasks
Ranking Quality	Are the best items at the top?	May ignore long-tail items	Top-N recommendation, retrieval
Coverage	What fraction of items get recommended?	May hurt relevance	Ensuring catalog diversity
Diversity/Novelty	Is the list varied and surprising?	May reduce immediate relevance	Discovery, user engagement
Business Metrics	What's the bottom-line impact?	May have long feedback loops	Final deployment decisions

The Evaluation Paradox

Prediction Accuracy Metrics

Root Mean Squared Error (RMSE)

The most common rating prediction metric:

$$\text{RMSE} = \sqrt{\frac{1}{|T|} \sum_{(u,i) \in T} (\hat{r}{ui} - r{ui})^2}$$

Where $T$ is the test set of (user, item, rating) tuples.

Properties:

Penalizes large errors more heavily than small errors
Scale-dependent: RMSE of 0.8 on 1-5 scale ≠ same error on 1-10 scale
Netflix Prize winning RMSE: ~0.8563 on 1-5 scale

Mean Absolute Error (MAE)

$$\text{MAE} = \frac{1}{|T|} \sum_{(u,i) \in T} |\hat{r}{ui} - r{ui}|$$

Properties:

More robust to outliers than RMSE
Easier to interpret: "on average, predictions are off by X stars"
Doesn't penalize large errors as heavily

prediction_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from typing import List, Tuple
from sklearn.metrics import mean_squared_error, mean_absolute_error
 
def rmse(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> float:
    """
    Root Mean Squared Error.
    
    Standard metric for rating prediction accuracy.
    Penalizes large errors more than small ones.
    
    Args:
        y_true: Ground truth ratings
        y_pred: Predicted ratings
    
    Returns:
        RMSE value
    """
    return np.sqrt(mean_squared_error(y_true, y_pred))
 
 
def mae(
    y_true: np.ndarray, 
    y_pred: np.ndarray
) -> float:
    """
    Mean Absolute Error.
    
    More robust to outliers than RMSE.
    Easier to interpret: "predictions off by X on average"
    """
    return mean_absolute_error(y_true, y_pred)
 
 
def rating_prediction_evaluation(
    test_ratings: List[Tuple[int, int, float]],  # (user_id, item_id, rating)
    predictions: dict  # {(user_id, item_id): predicted_rating}
) -> dict:
    """
    Comprehensive rating prediction evaluation.
    
    Returns multiple metrics for thorough analysis.
    """
    y_true = []
    y_pred = []
    
    for user_id, item_id, actual_rating in test_ratings:
        if (user_id, item_id) in predictions:
            y_true.append(actual_rating)
            y_pred.append(predictions[(user_id, item_id)])
    
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Compute various metrics
    results = {
        'rmse': rmse(y_true, y_pred),
        'mae': mae(y_true, y_pred),
        'n_predictions': len(y_true),
        'coverage': len(y_true) / len(test_ratings),  # What fraction we could predict
    }
    
    # Stratified analysis: error by rating level
    for rating_value in sorted(set(y_true)):
        mask = y_true == rating_value
        if mask.sum() > 0:
            results[f'rmse_rating_{rating_value}'] = rmse(y_true[mask], y_pred[mask])
    
    return results
 
 
class RatingPredictionEvaluator:
    """
    Comprehensive evaluator for rating prediction models.
    
    Includes stratified analysis by user activity, item popularity, etc.
    """
    
    def __init__(
        self,
        user_activity: dict = None,  # user_id -> n_ratings
        item_popularity: dict = None  # item_id -> n_ratings
    ):
        self.user_activity = user_activity or {}
        self.item_popularity = item_popularity or {}
    
    def evaluate(
        self,
        test_ratings: List[Tuple[int, int, float]],
        predictions: dict
    ) -> dict:
        """Full evaluation with stratification."""
        
        results = rating_prediction_evaluation(test_ratings, predictions)
        
        # Stratify by user activity
        active_users = {u for u, count in self.user_activity.items() if count > 50}
        cold_users = {u for u, count in self.user_activity.items() if count < 10}
        
        # Active user performance
        active_test = [(u, i, r) for u, i, r in test_ratings if u in active_users]
        if active_test:
            y_true = [r for u, i, r in active_test]
            y_pred = [predictions.get((u, i), np.nan) for u, i, r in active_test]
            valid = ~np.isnan(y_pred)
            if valid.sum() > 0:
                results['rmse_active_users'] = rmse(
                    np.array(y_true)[valid], 
                    np.array(y_pred)[valid]
                )
        
        # Cold user performance
        cold_test = [(u, i, r) for u, i, r in test_ratings if u in cold_users]
        if cold_test:
            y_true = [r for u, i, r in cold_test]
            y_pred = [predictions.get((u, i), np.nan) for u, i, r in cold_test]
            valid = ~np.isnan(y_pred)
            if valid.sum() > 0:
                results['rmse_cold_users'] = rmse(
                    np.array(y_true)[valid], 
                    np.array(y_pred)[valid]
                )
        
        return results

Why RMSE Isn't Enough

Classification-Based Ranking Metrics

For top-N recommendations, we care about whether the recommended items are relevant, not what exact rating they'd receive. This leads to classification-style metrics.

Precision@K

Fraction of recommended items that are relevant:

$$\text{Precision@K} = \frac{|\text{Recommended}_K \cap \text{Relevant}|}{K}$$

Example: If we recommend 10 items and 3 are relevant → Precision@10 = 0.3

Interpretation: "Of the items we showed, what fraction were good?"

Recall@K

Fraction of relevant items that are recommended:

$$\text{Recall@K} = \frac{|\text{Recommended}_K \cap \text{Relevant}|}{|\text{Relevant}|}$$

Example: If user has 15 relevant items and we recommend 10, with 3 being relevant → Recall@10 = 3/15 = 0.2

Interpretation: "Of all good items, what fraction did we find?"

The Precision-Recall Trade-off:

K	Typical Precision@K	Typical Recall@K	Comment
1	Higher	Lower	Very selective, may miss good items
5	Moderate	Moderate	Common for mobile screens
10	Moderate	Higher	Common for desktop
50	Lower	High	Casting wider net
100	Lowest	Highest	Retrieval stage metric

Increasing K typically increases recall but decreases precision.

F1@K Score:

Harmonic mean of precision and recall:

$$\text{F1@K} = 2 \cdot \frac{\text{Precision@K} \cdot \text{Recall@K}}{\text{Precision@K} + \text{Recall@K}}$$

Useful when you want a single metric balancing both.

Hit Rate@K:

Fraction of users for whom at least one recommended item is relevant:

$$\text{HitRate@K} = \frac{|{u : |\text{Recommended}_K(u) \cap \text{Relevant}(u)| \geq 1}|}{|U|}$$

Often more interpretable than Precision/Recall for stakeholders.

classification_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
import numpy as np
from typing import List, Dict, Set
from collections import defaultdict
 
def precision_at_k(
    recommended: List[str], 
    relevant: Set[str], 
    k: int
) -> float:
    """
    Precision@K: Fraction of top-K recommendations that are relevant.
    
    Args:
        recommended: Ordered list of recommended item IDs (best first)
        relevant: Set of relevant item IDs for this user
        k: Number of recommendations to consider
    
    Returns:
        Precision@K value in [0, 1]
    """
    if k == 0:
        return 0.0
    
    recommended_at_k = set(recommended[:k])
    n_relevant_in_rec = len(recommended_at_k & relevant)
    
    return n_relevant_in_rec / k
 
 
def recall_at_k(
    recommended: List[str], 
    relevant: Set[str], 
    k: int
) -> float:
    """
    Recall@K: Fraction of relevant items that appear in top-K.
    
    Args:
        recommended: Ordered list of recommended item IDs
        relevant: Set of relevant item IDs for this user
        k: Number of recommendations to consider
    
    Returns:
        Recall@K value in [0, 1]
    """
    if len(relevant) == 0:
        return 0.0
    
    recommended_at_k = set(recommended[:k])
    n_relevant_in_rec = len(recommended_at_k & relevant)
    
    return n_relevant_in_rec / len(relevant)
 
 
def hit_rate_at_k(
    recommendations: Dict[str, List[str]],  # user_id -> recommended items
    relevance: Dict[str, Set[str]],          # user_id -> relevant items
    k: int
) -> float:
    """
    Hit Rate@K: Fraction of users with at least one hit in top-K.
    
    Often more intuitive for business discussions than precision/recall.
    """
    n_hits = 0
    n_users = 0
    
    for user_id, recommended in recommendations.items():
        if user_id not in relevance:
            continue
        
        relevant = relevance[user_id]
        recommended_at_k = set(recommended[:k])
        
        if len(recommended_at_k & relevant) > 0:
            n_hits += 1
        n_users += 1
    
    return n_hits / n_users if n_users > 0 else 0.0
 
 
def average_precision(
    recommended: List[str], 
    relevant: Set[str]
) -> float:
    """
    Average Precision: Mean of precision values at each relevant item position.
    
    Rewards placing relevant items higher in the list.
    
    AP = (1/|R|) * Σ (Precision@k * rel(k))
    
    where rel(k) = 1 if item at position k is relevant, else 0
    """
    if len(relevant) == 0:
        return 0.0
    
    precisions = []
    n_relevant_seen = 0
    
    for k, item in enumerate(recommended, start=1):
        if item in relevant:
            n_relevant_seen += 1
            precision_at_k = n_relevant_seen / k
            precisions.append(precision_at_k)
    
    if len(precisions) == 0:
        return 0.0
    
    return np.mean(precisions)
 
 
def mean_average_precision(
    recommendations: Dict[str, List[str]],
    relevance: Dict[str, Set[str]]
) -> float:
    """
    Mean Average Precision: MAP = mean of AP across all users.
    
    One of the most common ranking metrics in information retrieval.
    """
    aps = []
    
    for user_id, recommended in recommendations.items():
        if user_id not in relevance:
            continue
        
        relevant = relevance[user_id]
        ap = average_precision(recommended, relevant)
        aps.append(ap)
    
    return np.mean(aps) if aps else 0.0
 
 
class RankingMetricsCalculator:
    """
    Comprehensive ranking metrics computation.
    
    Handles the common pattern of computing multiple metrics
    at multiple K values for a batch of users.
    """
    
    def __init__(self, k_values: List[int] = [1, 5, 10, 20, 50]):
        self.k_values = k_values
    
    def compute_all(
        self,
        recommendations: Dict[str, List[str]],
        relevance: Dict[str, Set[str]]
    ) -> Dict[str, float]:
        """
        Compute all metrics at all K values.
        
        Returns dict with keys like 'precision@5', 'recall@10', etc.
        """
        results = {}
        
        # Per-user metrics, then average
        user_precisions = {k: [] for k in self.k_values}
        user_recalls = {k: [] for k in self.k_values}
        user_aps = []
        
        for user_id, recommended in recommendations.items():
            if user_id not in relevance:
                continue
            
            relevant = relevance[user_id]
            
            for k in self.k_values:
                user_precisions[k].append(precision_at_k(recommended, relevant, k))
                user_recalls[k].append(recall_at_k(recommended, relevant, k))
            
            user_aps.append(average_precision(recommended, relevant))
        
        # Aggregate
        for k in self.k_values:
            results[f'precision@{k}'] = np.mean(user_precisions[k])
            results[f'recall@{k}'] = np.mean(user_recalls[k])
            results[f'hit_rate@{k}'] = hit_rate_at_k(recommendations, relevance, k)
            
            # F1
            p, r = results[f'precision@{k}'], results[f'recall@{k}']
            results[f'f1@{k}'] = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
        
        results['map'] = np.mean(user_aps)
        
        return results

Rank-Aware Metrics: NDCG, MRR, and Beyond

Mean Reciprocal Rank (MRR)

For queries with a single correct answer, MRR measures how high the first relevant result ranks:

$$\text{MRR} = \frac{1}{|U|} \sum_{u \in U} \frac{1}{\text{rank}_u}$$

Where $\text{rank}_u$ is the position of the first relevant item for user $u$.

Example: First relevant at position 3 → Reciprocal Rank = 1/3 ≈ 0.33

Discounted Cumulative Gain (DCG)

Accounts for graded relevance (not just binary) with position-based discounting:

$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$$

Key insight: The denominator $\log_2(i+1)$ decreases the value of items at lower positions logarithmically.

Normalized Discounted Cumulative Gain (NDCG)

Normalizes DCG by the maximum possible (ideal) DCG:

$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$

Where IDCG is the DCG of the ideal ranking (all relevant items at top, sorted by relevance).

Properties:

Range: [0, 1] where 1 is perfect ranking
Handles graded relevance (relevance scores, not just binary)
Position-sensitive: rewards placing best items at top
Most widely used ranking metric in modern systems

Comparing Rank-Aware Metrics:

Metric	Handles Graded Relevance	Position Weighting	Normalized	Typical Use Case
MRR	No	Only first item	No	Single answer IR
DCG	Yes	Log discounting	No	Absolute ranking quality
NDCG	Yes	Log discounting	Yes	Comparing across users/queries
MAP	Binary only	1/position	Sort of	Standard IR metric

rank_aware_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
import numpy as np
from typing import List, Dict
 
def dcg_at_k(
    relevances: List[float], 
    k: int
) -> float:
    """
    Discounted Cumulative Gain at K.
    
    DCG = Σ (2^rel_i - 1) / log2(i + 1) for i in 1..K
    
    Args:
        relevances: Relevance scores in rank order (position 0 = rank 1)
        k: Number of items to consider
    
    Returns:
        DCG value
    """
    relevances = np.array(relevances[:k])
    positions = np.arange(1, len(relevances) + 1)
    
    # Gain formula: (2^rel - 1), though some use just rel
    gains = np.power(2, relevances) - 1
    
    # Log2 discount
    discounts = np.log2(positions + 1)
    
    return np.sum(gains / discounts)
 
 
def ndcg_at_k(
    relevances: List[float], 
    k: int
) -> float:
    """
    Normalized Discounted Cumulative Gain at K.
    
    NDCG = DCG / IDCG where IDCG is DCG of ideal ranking.
    
    Args:
        relevances: Relevance scores in rank order (as produced by model)
        k: Cutoff
    
    Returns:
        NDCG value in [0, 1]
    """
    dcg = dcg_at_k(relevances, k)
    
    # IDCG: DCG if we had sorted by relevance
    ideal_relevances = sorted(relevances, reverse=True)
    idcg = dcg_at_k(ideal_relevances, k)
    
    if idcg == 0:
        return 0.0
    
    return dcg / idcg
 
 
def mrr(
    recommendations: Dict[str, List[str]],  # user -> ordered recommendations
    relevance: Dict[str, set]               # user -> relevant items
) -> float:
    """
    Mean Reciprocal Rank.
    
    For each user, find position of first relevant item.
    MRR = mean of 1/rank across users.
    """
    reciprocal_ranks = []
    
    for user_id, recs in recommendations.items():
        if user_id not in relevance:
            continue
        
        relevant = relevance[user_id]
        
        # Find first relevant item's position
        reciprocal_rank = 0.0
        for rank, item in enumerate(recs, start=1):
            if item in relevant:
                reciprocal_rank = 1.0 / rank
                break
        
        reciprocal_ranks.append(reciprocal_rank)
    
    return np.mean(reciprocal_ranks) if reciprocal_ranks else 0.0
 
 
def mean_ndcg_at_k(
    recommendations: Dict[str, List[str]],
    relevance_scores: Dict[str, Dict[str, float]],  # user -> item -> score
    k: int
) -> float:
    """
    Mean NDCG@K across all users.
    
    Handles graded relevance (not just binary).
    
    Args:
        recommendations: user_id -> ordered list of recommended items
        relevance_scores: user_id -> {item_id -> relevance_score}
        k: Cutoff
    
    Returns:
        Mean NDCG@K
    """
    ndcgs = []
    
    for user_id, recs in recommendations.items():
        if user_id not in relevance_scores:
            continue
        
        user_relevance = relevance_scores[user_id]
        
        # Build relevance vector in recommendation order
        relevances = [user_relevance.get(item, 0.0) for item in recs]
        
        ndcg = ndcg_at_k(relevances, k)
        ndcgs.append(ndcg)
    
    return np.mean(ndcgs) if ndcgs else 0.0
 
 
class NDCGEvaluator:
    """
    Production-grade NDCG evaluation with stratification and confidence intervals.
    """
    
    def __init__(self, k_values: List[int] = [5, 10, 20]):
        self.k_values = k_values
    
    def evaluate(
        self,
        recommendations: Dict[str, List[str]],
        relevance_scores: Dict[str, Dict[str, float]],
        user_segments: Dict[str, str] = None  # user -> segment for stratification
    ) -> Dict[str, any]:
        """
        Comprehensive NDCG evaluation with optional stratification.
        """
        results = {}
        
        # Compute per-user NDCG for each K
        user_ndcgs = {k: {} for k in self.k_values}  # k -> user_id -> ndcg
        
        for user_id, recs in recommendations.items():
            if user_id not in relevance_scores:
                continue
            
            user_rel = relevance_scores[user_id]
            relevances = [user_rel.get(item, 0.0) for item in recs]
            
            for k in self.k_values:
                user_ndcgs[k][user_id] = ndcg_at_k(relevances, k)
        
        # Aggregate statistics
        for k in self.k_values:
            values = list(user_ndcgs[k].values())
            
            results[f'ndcg@{k}'] = np.mean(values)
            results[f'ndcg@{k}_std'] = np.std(values)
            results[f'ndcg@{k}_median'] = np.median(values)
            
            # 95% confidence interval using bootstrap
            if len(values) >= 30:
                bootstrap_means = []
                for _ in range(1000):
                    sample = np.random.choice(values, size=len(values), replace=True)
                    bootstrap_means.append(np.mean(sample))
                
                results[f'ndcg@{k}_ci95_low'] = np.percentile(bootstrap_means, 2.5)
                results[f'ndcg@{k}_ci95_high'] = np.percentile(bootstrap_means, 97.5)
        
        # Stratified analysis if segments provided
        if user_segments:
            segment_results = {}
            
            for segment in set(user_segments.values()):
                segment_users = {u for u, s in user_segments.items() if s == segment}
                
                for k in self.k_values:
                    segment_values = [
                        user_ndcgs[k][u] for u in segment_users 
                        if u in user_ndcgs[k]
                    ]
                    
                    if segment_values:
                        segment_results[f'{segment}_ndcg@{k}'] = np.mean(segment_values)
            
            results['stratified'] = segment_results
        
        return results

NDCG is the Industry Standard

Beyond-Accuracy Metrics: Diversity, Novelty, Coverage

Catalog Coverage

What fraction of items ever gets recommended?

$$\text{Coverage} = \frac{|\bigcup_{u} \text{Recommended}(u)|}{|I|}$$

Problem: Systems often show only 1-5% of catalog (popularity bias).

Intra-List Diversity

How different are the items within a single recommendation list?

$$\text{ILD}(L) = \frac{1}{K(K-1)} \sum_{i \in L} \sum_{j \in L, j eq i} (1 - \text{sim}(i, j))$$

Measured as average pairwise dissimilarity.

Novelty

Does the system recommend less popular items users haven't seen?

$$\text{Novelty} = \frac{1}{|L|} \sum_{i \in L} \log_2 \frac{|U|}{|U_i|}$$

Where $|U_i|$ is the number of users who have interacted with item $i$.

Advanced Beyond-Accuracy Metrics

•Serendipity — Are recommendations both relevant AND unexpected? Measures pleasant surprise. Hard to compute but captures user delight.
•Unexpectedness — How far are recommendations from what a baseline (e.g., popularity) would predict?
•Long-Tail Penetration — What fraction of recommendations come from long-tail (low-popularity) items?
•Category Coverage — Are multiple categories/genres represented in recommendations?
•Temporal Diversity — Does the system avoid showing the same items repeatedly over sessions?

beyond_accuracy_metrics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import numpy as np
from typing import List, Dict, Set
from collections import Counter
 
def catalog_coverage(
    recommendations: Dict[str, List[str]],
    total_items: int,
    k: int = 10
) -> float:
    """
    Catalog Coverage: Fraction of items that appear in any top-K list.
    
    Low coverage indicates popularity bias—most items never get shown.
    """
    recommended_items = set()
    
    for user_id, rec_list in recommendations.items():
        recommended_items.update(rec_list[:k])
    
    return len(recommended_items) / total_items
 
 
def gini_coefficient(recommendations: Dict[str, List[str]], k: int = 10) -> float:
    """
    Gini Coefficient of recommendation distribution.
    
    0 = perfect equality (all items recommended equally)
    1 = perfect inequality (one item gets all recommendations)
    
    High Gini indicates winner-take-all popularity bias.
    """
    # Count how often each item is recommended
    item_counts = Counter()
    
    for user_id, rec_list in recommendations.items():
        for item in rec_list[:k]:
            item_counts[item] += 1
    
    counts = np.array(sorted(item_counts.values()))
    n = len(counts)
    
    if n == 0:
        return 0.0
    
    # Gini formula
    cumsum = np.cumsum(counts)
    return (n + 1 - 2 * cumsum.sum() / cumsum[-1]) / n
 
 
def intra_list_diversity(
    rec_list: List[str],
    item_embeddings: Dict[str, np.ndarray]
) -> float:
    """
    Intra-List Diversity (ILD): Average pairwise dissimilarity within list.
    
    Higher = more diverse recommendations.
    """
    if len(rec_list) < 2:
        return 0.0
    
    # Get embeddings for items in list
    embeddings = []
    for item in rec_list:
        if item in item_embeddings:
            embeddings.append(item_embeddings[item])
    
    if len(embeddings) < 2:
        return 0.0
    
    embeddings = np.array(embeddings)
    
    # Compute pairwise cosine similarities
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-10)
    similarity_matrix = normalized @ normalized.T
    
    # Average dissimilarity (excluding diagonal)
    n = len(embeddings)
    mask = ~np.eye(n, dtype=bool)
    avg_dissimilarity = 1 - similarity_matrix[mask].mean()
    
    return avg_dissimilarity
 
 
def novelty(
    rec_list: List[str],
    item_popularity: Dict[str, int],
    total_users: int
) -> float:
    """
    Novelty: Tendency to recommend less popular items.
    
    Novelty(L) = mean over items of log2(|U| / popularity(item))
    
    Higher = recommending more obscure items.
    """
    novelties = []
    
    for item in rec_list:
        pop = item_popularity.get(item, 1)  # Default 1 to avoid log(0)
        item_novelty = np.log2(total_users / pop)
        novelties.append(item_novelty)
    
    return np.mean(novelties) if novelties else 0.0
 
 
def serendipity(
    rec_list: List[str],
    user_history: Set[str],
    relevant: Set[str],
    popularity_baseline: Set[str]
) -> float:
    """
    Serendipity: Relevant items that user wouldn't have expected.
    
    Serendipitous = relevant AND unexpected AND novel
    
    Args:
        rec_list: Recommended items
        user_history: Items user has already interacted with
        relevant: Items that are relevant to user
        popularity_baseline: Items a popularity baseline would recommend
    
    Returns:
        Serendipity score
    """
    serendipitous_items = []
    
    for item in rec_list:
        is_relevant = item in relevant
        is_unexpected = item not in popularity_baseline
        is_novel = item not in user_history
        
        if is_relevant and is_unexpected and is_novel:
            serendipitous_items.append(item)
    
    return len(serendipitous_items) / len(rec_list) if rec_list else 0.0
 
 
class BeyondAccuracyEvaluator:
    """
    Comprehensive beyond-accuracy evaluation.
    
    Computes coverage, diversity, novelty across all recommendations.
    """
    
    def __init__(
        self,
        item_embeddings: Dict[str, np.ndarray],
        item_popularity: Dict[str, int],
        total_items: int,
        total_users: int
    ):
        self.item_embeddings = item_embeddings
        self.item_popularity = item_popularity
        self.total_items = total_items
        self.total_users = total_users
    
    def evaluate(
        self,
        recommendations: Dict[str, List[str]],
        k: int = 10
    ) -> Dict[str, float]:
        """Compute all beyond-accuracy metrics."""
        results = {}
        
        # Coverage metrics
        results['catalog_coverage'] = catalog_coverage(
            recommendations, self.total_items, k
        )
        results['gini_coefficient'] = gini_coefficient(recommendations, k)
        
        # Per-user metrics, then average
        diversities = []
        novelties = []
        
        for user_id, rec_list in recommendations.items():
            rec_list = rec_list[:k]
            
            div = intra_list_diversity(rec_list, self.item_embeddings)
            diversities.append(div)
            
            nov = novelty(rec_list, self.item_popularity, self.total_users)
            novelties.append(nov)
        
        results['avg_intra_list_diversity'] = np.mean(diversities)
        results['avg_novelty'] = np.mean(novelties)
        
        # Long-tail analysis
        popularity_values = np.array(list(self.item_popularity.values()))
        median_popularity = np.median(popularity_values)
        
        long_tail_count = 0
        total_count = 0
        
        for user_id, rec_list in recommendations.items():
            for item in rec_list[:k]:
                total_count += 1
                if self.item_popularity.get(item, 0) < median_popularity:
                    long_tail_count += 1
        
        results['long_tail_fraction'] = long_tail_count / total_count if total_count > 0 else 0.0
        
        return results

The Diversity-Accuracy Trade-off

Offline vs Online Evaluation

Offline Evaluation:

Data: Historical user-item interactions
Setup: Hold out test set, predict, compute metrics
Pros: Fast, reproducible, can test many variants
Cons: Selection bias, positional bias, can't measure true business impact

Online Evaluation (A/B Testing):

Data: Live user behavior
Setup: Random user assignment to control/treatment, measure outcomes
Pros: Measures actual user response, accounts for real-world conditions
Cons: Slow, expensive, can only test few variants simultaneously

Offline vs Online Evaluation Trade-offs
Aspect	Offline	Online (A/B)
Speed	Fast (minutes to hours)	Slow (days to weeks)
Cost	Low (compute only)	High (opportunity cost, eng time)
Variants testable	Many (100s)	Few (<10 simultaneously)
Selection bias	Problem	Not a problem
Position bias	Problem	Handled by randomization
Measures actual behavior	No	Yes
Statistical power	High (full dataset)	Depends on traffic split
Reproducibility	Exact	Approximate

The Offline-Online Correlation Problem:

Research consistently shows that offline metrics correlate weakly (~0.3-0.5) with online metrics. A model with 2% better NDCG might show no improvement—or even degradation—online.

Why the gap exists:

Selection Bias: Offline test sets contain items users chose to interact with. You only observe outcomes for what was shown.
Positional Bias: Items at position 1 get clicked more often regardless of quality. Offline metrics don't account for this.
Confounders: Time of day, device, user mood—many factors affect clicks that aren't captured in offline data.
Feedback Loops: Online, your recommendations affect future user behavior. Offline, this dynamic is missing.

Best Practice: Funnel Approach

Offline filtering: Use offline metrics to shortlist promising candidates
Shadow serving: Run models in production without showing to users, compare predictions to actual performance
Online A/B: Final validation on subset of traffic
Gradual rollout: Expand winning variant while monitoring guard metrics

Counterfactual Evaluation

Choosing the Right Metrics for Your Use Case

Metric selection should align with your actual business goals and user experience. Different applications require different primary metrics.

Decision Framework:

Q1: What kind of task is it?

Rating prediction → RMSE, MAE
Top-N recommendations → NDCG, Recall, HR
Click prediction → AUC-ROC, LogLoss

Q2: Is there graded relevance?

Binary relevance → Precision, Recall, F1
Graded relevance → NDCG, DCG

Q3: Does position matter?

All positions equal → Recall@K
Top positions most important → NDCG, MRR

Q4: Beyond accuracy concerns?

Catalog coverage is important → Track coverage, Gini
User discovery is important → Novelty, serendipity
Filter bubbles are a risk → Diversity metrics

Metric Recommendations by Use Case
Use Case	Primary Metrics	Secondary Metrics	Business KPI
E-commerce product recs	NDCG@10, Recall@50	Coverage, GMV lift	Revenue per session
Video streaming	Watch time, completion rate	NDCG@10, diversity	Subscriber retention
Music discovery	Listen counts, saves	Novelty, intra-list diversity	MAU, session length
News feed	CTR, dwell time	Category diversity, recency	Daily active users
Search ranking	NDCG@10, MRR	Abandonment rate	Queries per session
Ad ranking	AUC-ROC, LogLoss	Calibration, diversity	Revenue, user satisfaction

Common Metric Pitfalls to Avoid

•Optimizing for wrong metric — Minimizing RMSE when ranking quality matters. Always match metric to task.
•Ignoring user segments — Global metrics can hide poor performance for cold users. Always stratify.
•Single metric tunnel vision — Improving NDCG while destroying diversity. Track multiple metrics.
•Offline-only decisions — Deploying based on offline gains without A/B validation.
•Proxy metric drift — The metric you optimize may drift from the business goal over time. Periodically re-evaluate.

Summary: Evaluation Metrics

We've comprehensively explored the metrics landscape for recommendation systems. Let's consolidate the essential knowledge:

Key Takeaways

•Multiple metric categories exist — Prediction accuracy, ranking quality, coverage, diversity, and business metrics each tell different parts of the story.
•RMSE isn't enough for ranking — Low RMSE doesn't guarantee good rankings. Use NDCG, MAP, or similar rank-aware metrics for top-N tasks.
•NDCG is the industry standard — It handles graded relevance, rewards putting best items first, and normalizes across users.
•Beyond-accuracy matters — Coverage, diversity, novelty, and serendipity capture user experience dimensions that accuracy misses.
•Offline metrics weakly predict online — Always validate significant changes with A/B tests. Use offline metrics for filtering, online for decisions.
•Match metrics to use case — E-commerce cares about conversion; streaming cares about watch time; news cares about engagement.
•Track multiple metrics — No single metric captures everything. Monitor suites of metrics including guard metrics to detect harm.

What's Next:

Page Complete