Loading content...
You've built a recommendation system. It trains without errors, makes predictions, and generates ranked lists of items. But is it actually good? How do you know if version 2.0 is better than version 1.0? How do you convince stakeholders that your model improvements translate to business value?
The answer lies in evaluation metrics—quantitative measures that capture different aspects of recommendation quality. But here's the challenge: there is no single metric that captures everything. Optimizing for one metric often comes at the cost of another. Precision and recall trade off. Accuracy and diversity trade off. Short-term engagement and long-term satisfaction trade off.
Mastering evaluation metrics means understanding not just how to compute them, but when each is appropriate and what each actually measures about user experience.
By the end of this page, you will understand the full taxonomy of recommendation metrics—from rating prediction accuracy through ranking quality, coverage, diversity, novelty, and serendipity. You'll learn to choose appropriate metrics for your use case and avoid common evaluation pitfalls that mislead engineering decisions.
Recommendation metrics can be organized into several distinct categories, each capturing different aspects of quality:
1. Prediction Accuracy Metrics
Measure how well predicted ratings match actual ratings.
2. Ranking Metrics
Measure the quality of the ranked list shown to users.
3. Coverage Metrics
Measure how much of the catalog is being recommended.
4. Diversity and Novelty Metrics
Measure variety and discovery in recommendations.
5. Business Metrics
Measure actual business impact.
| Category | Question Answered | Optimization Trade-off | When to Use |
|---|---|---|---|
| Prediction Accuracy | How close are predicted ratings? | May miss ranking quality | Explicit rating prediction tasks |
| Ranking Quality | Are the best items at the top? | May ignore long-tail items | Top-N recommendation, retrieval |
| Coverage | What fraction of items get recommended? | May hurt relevance | Ensuring catalog diversity |
| Diversity/Novelty | Is the list varied and surprising? | May reduce immediate relevance | Discovery, user engagement |
| Business Metrics | What's the bottom-line impact? | May have long feedback loops | Final deployment decisions |
Offline metrics (computed on historical data) often correlate weakly with online business metrics (measured in production). A model with 2% better NDCG might show no improvement in CTR. Always validate significant changes with online A/B tests.
Prediction accuracy metrics quantify how close predicted ratings are to actual ratings. They were foundational in the Netflix Prize era but have become less central as the field has shifted toward ranking.
Root Mean Squared Error (RMSE)
The most common rating prediction metric:
$$\text{RMSE} = \sqrt{\frac{1}{|T|} \sum_{(u,i) \in T} (\hat{r}{ui} - r{ui})^2}$$
Where $T$ is the test set of (user, item, rating) tuples.
Properties:
Mean Absolute Error (MAE)
$$\text{MAE} = \frac{1}{|T|} \sum_{(u,i) \in T} |\hat{r}{ui} - r{ui}|$$
Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npfrom typing import List, Tuplefrom sklearn.metrics import mean_squared_error, mean_absolute_error def rmse( y_true: np.ndarray, y_pred: np.ndarray) -> float: """ Root Mean Squared Error. Standard metric for rating prediction accuracy. Penalizes large errors more than small ones. Args: y_true: Ground truth ratings y_pred: Predicted ratings Returns: RMSE value """ return np.sqrt(mean_squared_error(y_true, y_pred)) def mae( y_true: np.ndarray, y_pred: np.ndarray) -> float: """ Mean Absolute Error. More robust to outliers than RMSE. Easier to interpret: "predictions off by X on average" """ return mean_absolute_error(y_true, y_pred) def rating_prediction_evaluation( test_ratings: List[Tuple[int, int, float]], # (user_id, item_id, rating) predictions: dict # {(user_id, item_id): predicted_rating}) -> dict: """ Comprehensive rating prediction evaluation. Returns multiple metrics for thorough analysis. """ y_true = [] y_pred = [] for user_id, item_id, actual_rating in test_ratings: if (user_id, item_id) in predictions: y_true.append(actual_rating) y_pred.append(predictions[(user_id, item_id)]) y_true = np.array(y_true) y_pred = np.array(y_pred) # Compute various metrics results = { 'rmse': rmse(y_true, y_pred), 'mae': mae(y_true, y_pred), 'n_predictions': len(y_true), 'coverage': len(y_true) / len(test_ratings), # What fraction we could predict } # Stratified analysis: error by rating level for rating_value in sorted(set(y_true)): mask = y_true == rating_value if mask.sum() > 0: results[f'rmse_rating_{rating_value}'] = rmse(y_true[mask], y_pred[mask]) return results class RatingPredictionEvaluator: """ Comprehensive evaluator for rating prediction models. Includes stratified analysis by user activity, item popularity, etc. """ def __init__( self, user_activity: dict = None, # user_id -> n_ratings item_popularity: dict = None # item_id -> n_ratings ): self.user_activity = user_activity or {} self.item_popularity = item_popularity or {} def evaluate( self, test_ratings: List[Tuple[int, int, float]], predictions: dict ) -> dict: """Full evaluation with stratification.""" results = rating_prediction_evaluation(test_ratings, predictions) # Stratify by user activity active_users = {u for u, count in self.user_activity.items() if count > 50} cold_users = {u for u, count in self.user_activity.items() if count < 10} # Active user performance active_test = [(u, i, r) for u, i, r in test_ratings if u in active_users] if active_test: y_true = [r for u, i, r in active_test] y_pred = [predictions.get((u, i), np.nan) for u, i, r in active_test] valid = ~np.isnan(y_pred) if valid.sum() > 0: results['rmse_active_users'] = rmse( np.array(y_true)[valid], np.array(y_pred)[valid] ) # Cold user performance cold_test = [(u, i, r) for u, i, r in test_ratings if u in cold_users] if cold_test: y_true = [r for u, i, r in cold_test] y_pred = [predictions.get((u, i), np.nan) for u, i, r in cold_test] valid = ~np.isnan(y_pred) if valid.sum() > 0: results['rmse_cold_users'] = rmse( np.array(y_true)[valid], np.array(y_pred)[valid] ) return resultsLow RMSE doesn't guarantee good rankings. A model could achieve low RMSE by predicting everything around the mean (e.g., always predicting 3.5 stars) while completely failing to distinguish good from bad items. For top-N recommendations, ranking metrics are more appropriate.
For top-N recommendations, we care about whether the recommended items are relevant, not what exact rating they'd receive. This leads to classification-style metrics.
Precision@K
Fraction of recommended items that are relevant:
$$\text{Precision@K} = \frac{|\text{Recommended}_K \cap \text{Relevant}|}{K}$$
Example: If we recommend 10 items and 3 are relevant → Precision@10 = 0.3
Interpretation: "Of the items we showed, what fraction were good?"
Recall@K
Fraction of relevant items that are recommended:
$$\text{Recall@K} = \frac{|\text{Recommended}_K \cap \text{Relevant}|}{|\text{Relevant}|}$$
Example: If user has 15 relevant items and we recommend 10, with 3 being relevant → Recall@10 = 3/15 = 0.2
Interpretation: "Of all good items, what fraction did we find?"
The Precision-Recall Trade-off:
| K | Typical Precision@K | Typical Recall@K | Comment |
|---|---|---|---|
| 1 | Higher | Lower | Very selective, may miss good items |
| 5 | Moderate | Moderate | Common for mobile screens |
| 10 | Moderate | Higher | Common for desktop |
| 50 | Lower | High | Casting wider net |
| 100 | Lowest | Highest | Retrieval stage metric |
Increasing K typically increases recall but decreases precision.
F1@K Score:
Harmonic mean of precision and recall:
$$\text{F1@K} = 2 \cdot \frac{\text{Precision@K} \cdot \text{Recall@K}}{\text{Precision@K} + \text{Recall@K}}$$
Useful when you want a single metric balancing both.
Hit Rate@K:
Fraction of users for whom at least one recommended item is relevant:
$$\text{HitRate@K} = \frac{|{u : |\text{Recommended}_K(u) \cap \text{Relevant}(u)| \geq 1}|}{|U|}$$
Often more interpretable than Precision/Recall for stakeholders.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187
import numpy as npfrom typing import List, Dict, Setfrom collections import defaultdict def precision_at_k( recommended: List[str], relevant: Set[str], k: int) -> float: """ Precision@K: Fraction of top-K recommendations that are relevant. Args: recommended: Ordered list of recommended item IDs (best first) relevant: Set of relevant item IDs for this user k: Number of recommendations to consider Returns: Precision@K value in [0, 1] """ if k == 0: return 0.0 recommended_at_k = set(recommended[:k]) n_relevant_in_rec = len(recommended_at_k & relevant) return n_relevant_in_rec / k def recall_at_k( recommended: List[str], relevant: Set[str], k: int) -> float: """ Recall@K: Fraction of relevant items that appear in top-K. Args: recommended: Ordered list of recommended item IDs relevant: Set of relevant item IDs for this user k: Number of recommendations to consider Returns: Recall@K value in [0, 1] """ if len(relevant) == 0: return 0.0 recommended_at_k = set(recommended[:k]) n_relevant_in_rec = len(recommended_at_k & relevant) return n_relevant_in_rec / len(relevant) def hit_rate_at_k( recommendations: Dict[str, List[str]], # user_id -> recommended items relevance: Dict[str, Set[str]], # user_id -> relevant items k: int) -> float: """ Hit Rate@K: Fraction of users with at least one hit in top-K. Often more intuitive for business discussions than precision/recall. """ n_hits = 0 n_users = 0 for user_id, recommended in recommendations.items(): if user_id not in relevance: continue relevant = relevance[user_id] recommended_at_k = set(recommended[:k]) if len(recommended_at_k & relevant) > 0: n_hits += 1 n_users += 1 return n_hits / n_users if n_users > 0 else 0.0 def average_precision( recommended: List[str], relevant: Set[str]) -> float: """ Average Precision: Mean of precision values at each relevant item position. Rewards placing relevant items higher in the list. AP = (1/|R|) * Σ (Precision@k * rel(k)) where rel(k) = 1 if item at position k is relevant, else 0 """ if len(relevant) == 0: return 0.0 precisions = [] n_relevant_seen = 0 for k, item in enumerate(recommended, start=1): if item in relevant: n_relevant_seen += 1 precision_at_k = n_relevant_seen / k precisions.append(precision_at_k) if len(precisions) == 0: return 0.0 return np.mean(precisions) def mean_average_precision( recommendations: Dict[str, List[str]], relevance: Dict[str, Set[str]]) -> float: """ Mean Average Precision: MAP = mean of AP across all users. One of the most common ranking metrics in information retrieval. """ aps = [] for user_id, recommended in recommendations.items(): if user_id not in relevance: continue relevant = relevance[user_id] ap = average_precision(recommended, relevant) aps.append(ap) return np.mean(aps) if aps else 0.0 class RankingMetricsCalculator: """ Comprehensive ranking metrics computation. Handles the common pattern of computing multiple metrics at multiple K values for a batch of users. """ def __init__(self, k_values: List[int] = [1, 5, 10, 20, 50]): self.k_values = k_values def compute_all( self, recommendations: Dict[str, List[str]], relevance: Dict[str, Set[str]] ) -> Dict[str, float]: """ Compute all metrics at all K values. Returns dict with keys like 'precision@5', 'recall@10', etc. """ results = {} # Per-user metrics, then average user_precisions = {k: [] for k in self.k_values} user_recalls = {k: [] for k in self.k_values} user_aps = [] for user_id, recommended in recommendations.items(): if user_id not in relevance: continue relevant = relevance[user_id] for k in self.k_values: user_precisions[k].append(precision_at_k(recommended, relevant, k)) user_recalls[k].append(recall_at_k(recommended, relevant, k)) user_aps.append(average_precision(recommended, relevant)) # Aggregate for k in self.k_values: results[f'precision@{k}'] = np.mean(user_precisions[k]) results[f'recall@{k}'] = np.mean(user_recalls[k]) results[f'hit_rate@{k}'] = hit_rate_at_k(recommendations, relevance, k) # F1 p, r = results[f'precision@{k}'], results[f'recall@{k}'] results[f'f1@{k}'] = 2 * p * r / (p + r) if (p + r) > 0 else 0.0 results['map'] = np.mean(user_aps) return resultsClassification metrics treat all positions in the top-K equally. But users pay more attention to the first few items—a relevant item at position 1 is more valuable than one at position 10. Rank-aware metrics account for this.
Mean Reciprocal Rank (MRR)
For queries with a single correct answer, MRR measures how high the first relevant result ranks:
$$\text{MRR} = \frac{1}{|U|} \sum_{u \in U} \frac{1}{\text{rank}_u}$$
Where $\text{rank}_u$ is the position of the first relevant item for user $u$.
Example: First relevant at position 3 → Reciprocal Rank = 1/3 ≈ 0.33
Discounted Cumulative Gain (DCG)
Accounts for graded relevance (not just binary) with position-based discounting:
$$\text{DCG@K} = \sum_{i=1}^{K} \frac{2^{\text{rel}_i} - 1}{\log_2(i + 1)}$$
Key insight: The denominator $\log_2(i+1)$ decreases the value of items at lower positions logarithmically.
Normalized Discounted Cumulative Gain (NDCG)
Normalizes DCG by the maximum possible (ideal) DCG:
$$\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}}$$
Where IDCG is the DCG of the ideal ranking (all relevant items at top, sorted by relevance).
Properties:
Comparing Rank-Aware Metrics:
| Metric | Handles Graded Relevance | Position Weighting | Normalized | Typical Use Case |
|---|---|---|---|---|
| MRR | No | Only first item | No | Single answer IR |
| DCG | Yes | Log discounting | No | Absolute ranking quality |
| NDCG | Yes | Log discounting | Yes | Comparing across users/queries |
| MAP | Binary only | 1/position | Sort of | Standard IR metric |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193
import numpy as npfrom typing import List, Dict def dcg_at_k( relevances: List[float], k: int) -> float: """ Discounted Cumulative Gain at K. DCG = Σ (2^rel_i - 1) / log2(i + 1) for i in 1..K Args: relevances: Relevance scores in rank order (position 0 = rank 1) k: Number of items to consider Returns: DCG value """ relevances = np.array(relevances[:k]) positions = np.arange(1, len(relevances) + 1) # Gain formula: (2^rel - 1), though some use just rel gains = np.power(2, relevances) - 1 # Log2 discount discounts = np.log2(positions + 1) return np.sum(gains / discounts) def ndcg_at_k( relevances: List[float], k: int) -> float: """ Normalized Discounted Cumulative Gain at K. NDCG = DCG / IDCG where IDCG is DCG of ideal ranking. Args: relevances: Relevance scores in rank order (as produced by model) k: Cutoff Returns: NDCG value in [0, 1] """ dcg = dcg_at_k(relevances, k) # IDCG: DCG if we had sorted by relevance ideal_relevances = sorted(relevances, reverse=True) idcg = dcg_at_k(ideal_relevances, k) if idcg == 0: return 0.0 return dcg / idcg def mrr( recommendations: Dict[str, List[str]], # user -> ordered recommendations relevance: Dict[str, set] # user -> relevant items) -> float: """ Mean Reciprocal Rank. For each user, find position of first relevant item. MRR = mean of 1/rank across users. """ reciprocal_ranks = [] for user_id, recs in recommendations.items(): if user_id not in relevance: continue relevant = relevance[user_id] # Find first relevant item's position reciprocal_rank = 0.0 for rank, item in enumerate(recs, start=1): if item in relevant: reciprocal_rank = 1.0 / rank break reciprocal_ranks.append(reciprocal_rank) return np.mean(reciprocal_ranks) if reciprocal_ranks else 0.0 def mean_ndcg_at_k( recommendations: Dict[str, List[str]], relevance_scores: Dict[str, Dict[str, float]], # user -> item -> score k: int) -> float: """ Mean NDCG@K across all users. Handles graded relevance (not just binary). Args: recommendations: user_id -> ordered list of recommended items relevance_scores: user_id -> {item_id -> relevance_score} k: Cutoff Returns: Mean NDCG@K """ ndcgs = [] for user_id, recs in recommendations.items(): if user_id not in relevance_scores: continue user_relevance = relevance_scores[user_id] # Build relevance vector in recommendation order relevances = [user_relevance.get(item, 0.0) for item in recs] ndcg = ndcg_at_k(relevances, k) ndcgs.append(ndcg) return np.mean(ndcgs) if ndcgs else 0.0 class NDCGEvaluator: """ Production-grade NDCG evaluation with stratification and confidence intervals. """ def __init__(self, k_values: List[int] = [5, 10, 20]): self.k_values = k_values def evaluate( self, recommendations: Dict[str, List[str]], relevance_scores: Dict[str, Dict[str, float]], user_segments: Dict[str, str] = None # user -> segment for stratification ) -> Dict[str, any]: """ Comprehensive NDCG evaluation with optional stratification. """ results = {} # Compute per-user NDCG for each K user_ndcgs = {k: {} for k in self.k_values} # k -> user_id -> ndcg for user_id, recs in recommendations.items(): if user_id not in relevance_scores: continue user_rel = relevance_scores[user_id] relevances = [user_rel.get(item, 0.0) for item in recs] for k in self.k_values: user_ndcgs[k][user_id] = ndcg_at_k(relevances, k) # Aggregate statistics for k in self.k_values: values = list(user_ndcgs[k].values()) results[f'ndcg@{k}'] = np.mean(values) results[f'ndcg@{k}_std'] = np.std(values) results[f'ndcg@{k}_median'] = np.median(values) # 95% confidence interval using bootstrap if len(values) >= 30: bootstrap_means = [] for _ in range(1000): sample = np.random.choice(values, size=len(values), replace=True) bootstrap_means.append(np.mean(sample)) results[f'ndcg@{k}_ci95_low'] = np.percentile(bootstrap_means, 2.5) results[f'ndcg@{k}_ci95_high'] = np.percentile(bootstrap_means, 97.5) # Stratified analysis if segments provided if user_segments: segment_results = {} for segment in set(user_segments.values()): segment_users = {u for u, s in user_segments.items() if s == segment} for k in self.k_values: segment_values = [ user_ndcgs[k][u] for u in segment_users if u in user_ndcgs[k] ] if segment_values: segment_results[f'{segment}_ndcg@{k}'] = np.mean(segment_values) results['stratified'] = segment_results return resultsMost production recommendation systems use NDCG as their primary ranking metric. It handles graded relevance, rewards putting the best items first, and normalizes across users. When in doubt, start with NDCG@10.
Accuracy metrics tell you if recommendations are relevant. But relevant isn't everything. A recommender that only shows extremely popular items users would have found anyway isn't adding much value. Beyond-accuracy metrics capture these other dimensions of quality.
Catalog Coverage
What fraction of items ever gets recommended?
$$\text{Coverage} = \frac{|\bigcup_{u} \text{Recommended}(u)|}{|I|}$$
Problem: Systems often show only 1-5% of catalog (popularity bias).
Intra-List Diversity
How different are the items within a single recommendation list?
$$\text{ILD}(L) = \frac{1}{K(K-1)} \sum_{i \in L} \sum_{j \in L, j eq i} (1 - \text{sim}(i, j))$$
Measured as average pairwise dissimilarity.
Novelty
Does the system recommend less popular items users haven't seen?
$$\text{Novelty} = \frac{1}{|L|} \sum_{i \in L} \log_2 \frac{|U|}{|U_i|}$$
Where $|U_i|$ is the number of users who have interacted with item $i$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205
import numpy as npfrom typing import List, Dict, Setfrom collections import Counter def catalog_coverage( recommendations: Dict[str, List[str]], total_items: int, k: int = 10) -> float: """ Catalog Coverage: Fraction of items that appear in any top-K list. Low coverage indicates popularity bias—most items never get shown. """ recommended_items = set() for user_id, rec_list in recommendations.items(): recommended_items.update(rec_list[:k]) return len(recommended_items) / total_items def gini_coefficient(recommendations: Dict[str, List[str]], k: int = 10) -> float: """ Gini Coefficient of recommendation distribution. 0 = perfect equality (all items recommended equally) 1 = perfect inequality (one item gets all recommendations) High Gini indicates winner-take-all popularity bias. """ # Count how often each item is recommended item_counts = Counter() for user_id, rec_list in recommendations.items(): for item in rec_list[:k]: item_counts[item] += 1 counts = np.array(sorted(item_counts.values())) n = len(counts) if n == 0: return 0.0 # Gini formula cumsum = np.cumsum(counts) return (n + 1 - 2 * cumsum.sum() / cumsum[-1]) / n def intra_list_diversity( rec_list: List[str], item_embeddings: Dict[str, np.ndarray]) -> float: """ Intra-List Diversity (ILD): Average pairwise dissimilarity within list. Higher = more diverse recommendations. """ if len(rec_list) < 2: return 0.0 # Get embeddings for items in list embeddings = [] for item in rec_list: if item in item_embeddings: embeddings.append(item_embeddings[item]) if len(embeddings) < 2: return 0.0 embeddings = np.array(embeddings) # Compute pairwise cosine similarities norms = np.linalg.norm(embeddings, axis=1, keepdims=True) normalized = embeddings / (norms + 1e-10) similarity_matrix = normalized @ normalized.T # Average dissimilarity (excluding diagonal) n = len(embeddings) mask = ~np.eye(n, dtype=bool) avg_dissimilarity = 1 - similarity_matrix[mask].mean() return avg_dissimilarity def novelty( rec_list: List[str], item_popularity: Dict[str, int], total_users: int) -> float: """ Novelty: Tendency to recommend less popular items. Novelty(L) = mean over items of log2(|U| / popularity(item)) Higher = recommending more obscure items. """ novelties = [] for item in rec_list: pop = item_popularity.get(item, 1) # Default 1 to avoid log(0) item_novelty = np.log2(total_users / pop) novelties.append(item_novelty) return np.mean(novelties) if novelties else 0.0 def serendipity( rec_list: List[str], user_history: Set[str], relevant: Set[str], popularity_baseline: Set[str]) -> float: """ Serendipity: Relevant items that user wouldn't have expected. Serendipitous = relevant AND unexpected AND novel Args: rec_list: Recommended items user_history: Items user has already interacted with relevant: Items that are relevant to user popularity_baseline: Items a popularity baseline would recommend Returns: Serendipity score """ serendipitous_items = [] for item in rec_list: is_relevant = item in relevant is_unexpected = item not in popularity_baseline is_novel = item not in user_history if is_relevant and is_unexpected and is_novel: serendipitous_items.append(item) return len(serendipitous_items) / len(rec_list) if rec_list else 0.0 class BeyondAccuracyEvaluator: """ Comprehensive beyond-accuracy evaluation. Computes coverage, diversity, novelty across all recommendations. """ def __init__( self, item_embeddings: Dict[str, np.ndarray], item_popularity: Dict[str, int], total_items: int, total_users: int ): self.item_embeddings = item_embeddings self.item_popularity = item_popularity self.total_items = total_items self.total_users = total_users def evaluate( self, recommendations: Dict[str, List[str]], k: int = 10 ) -> Dict[str, float]: """Compute all beyond-accuracy metrics.""" results = {} # Coverage metrics results['catalog_coverage'] = catalog_coverage( recommendations, self.total_items, k ) results['gini_coefficient'] = gini_coefficient(recommendations, k) # Per-user metrics, then average diversities = [] novelties = [] for user_id, rec_list in recommendations.items(): rec_list = rec_list[:k] div = intra_list_diversity(rec_list, self.item_embeddings) diversities.append(div) nov = novelty(rec_list, self.item_popularity, self.total_users) novelties.append(nov) results['avg_intra_list_diversity'] = np.mean(diversities) results['avg_novelty'] = np.mean(novelties) # Long-tail analysis popularity_values = np.array(list(self.item_popularity.values())) median_popularity = np.median(popularity_values) long_tail_count = 0 total_count = 0 for user_id, rec_list in recommendations.items(): for item in rec_list[:k]: total_count += 1 if self.item_popularity.get(item, 0) < median_popularity: long_tail_count += 1 results['long_tail_fraction'] = long_tail_count / total_count if total_count > 0 else 0.0 return resultsAggressively optimizing for diversity can hurt accuracy. Users may click less on diverse lists even if they provide more value long-term. This is why A/B tests on retention (not just clicks) are essential when tuning diversity parameters.
There's a fundamental gap between offline metrics (computed on historical data) and online performance (measured in production). Understanding this gap is critical for avoiding misleading conclusions.
Offline Evaluation:
Online Evaluation (A/B Testing):
| Aspect | Offline | Online (A/B) |
|---|---|---|
| Speed | Fast (minutes to hours) | Slow (days to weeks) |
| Cost | Low (compute only) | High (opportunity cost, eng time) |
| Variants testable | Many (100s) | Few (<10 simultaneously) |
| Selection bias | Problem | Not a problem |
| Position bias | Problem | Handled by randomization |
| Measures actual behavior | No | Yes |
| Statistical power | High (full dataset) | Depends on traffic split |
| Reproducibility | Exact | Approximate |
The Offline-Online Correlation Problem:
Research consistently shows that offline metrics correlate weakly (~0.3-0.5) with online metrics. A model with 2% better NDCG might show no improvement—or even degradation—online.
Why the gap exists:
Selection Bias: Offline test sets contain items users chose to interact with. You only observe outcomes for what was shown.
Positional Bias: Items at position 1 get clicked more often regardless of quality. Offline metrics don't account for this.
Confounders: Time of day, device, user mood—many factors affect clicks that aren't captured in offline data.
Feedback Loops: Online, your recommendations affect future user behavior. Offline, this dynamic is missing.
Best Practice: Funnel Approach
Advanced techniques like Inverse Propensity Scoring (IPS) can partially debias offline metrics by re-weighting based on the probability that an item was shown. This narrows (but doesn't close) the offline-online gap.
Metric selection should align with your actual business goals and user experience. Different applications require different primary metrics.
Decision Framework:
Q1: What kind of task is it?
Q2: Is there graded relevance?
Q3: Does position matter?
Q4: Beyond accuracy concerns?
| Use Case | Primary Metrics | Secondary Metrics | Business KPI |
|---|---|---|---|
| E-commerce product recs | NDCG@10, Recall@50 | Coverage, GMV lift | Revenue per session |
| Video streaming | Watch time, completion rate | NDCG@10, diversity | Subscriber retention |
| Music discovery | Listen counts, saves | Novelty, intra-list diversity | MAU, session length |
| News feed | CTR, dwell time | Category diversity, recency | Daily active users |
| Search ranking | NDCG@10, MRR | Abandonment rate | Queries per session |
| Ad ranking | AUC-ROC, LogLoss | Calibration, diversity | Revenue, user satisfaction |
We've comprehensively explored the metrics landscape for recommendation systems. Let's consolidate the essential knowledge:
What's Next:
With evaluation metrics mastered, we'll next explore A/B testing—the gold standard for measuring recommendation system impact in production. You'll learn how to design experiments, calculate sample sizes, avoid common pitfalls, and interpret results with proper statistical rigor.
You now have a comprehensive understanding of recommendation system evaluation metrics. You can select appropriate metrics for your use case, compute them correctly, and understand their limitations. Next, we'll dive into A/B testing methodologies for production validation.