Loading content...
Netflix has 250 million subscribers. Amazon catalogs 353 million products. Spotify hosts 100 million tracks. At this scale, the elegant algorithms we've studied face a brutal reality: they don't work.
Computing pairwise similarities for 250 million users requires 31 quintillion (10^18) comparisons. Even at a nanosecond per comparison, that's 1,000 years. Storing a full similarity matrix would require 500 petabytes. These aren't optimization problems—they're fundamental architectural challenges.
This page confronts the scalability wall head-on. We'll analyze where computational costs explode, explore distributed algorithms that tame them, examine approximate methods that trade exactness for tractability, and study the production architectures that make CF work at internet scale.
By the end of this page, you will understand the computational complexity barriers in CF, master distributed similarity computation using MapReduce paradigms, know when and how to apply approximate nearest neighbor methods, and understand the production architecture patterns that enable recommendations at scale.
Let's precisely analyze where collaborative filtering breaks down at scale.
Notation:
User-based CF Complexity:
| Operation | Complexity | At m=100M, n=10M, r=100 |
|---|---|---|
| Similarity matrix computation | O(m² · r) | 10^18 operations |
| Similarity matrix storage | O(m²) | 80 petabytes (float64) |
| Single prediction | O(m) | 100M comparisons |
| Top-N recommendations | O(m · n) | 10^15 comparisons |
Item-based CF Complexity:
| Operation | Complexity | At m=100M, n=10M, r=100 |
|---|---|---|
| Similarity matrix computation | O(n² · avg_raters) | 10^15 operations |
| Similarity matrix storage | O(n² · k/n) = O(n·k) | 4 GB (k=50) |
| Single prediction | O(r · k) | 5,000 operations |
| Top-N recommendations | O(n · r · k) | 5×10^10 operations |
Key Insight:
Item-based CF has dramatically better online complexity because:
This is why Amazon chose item-based CF—it's the only memory-based approach that scales to their catalog size.
| Approach | Offline Cost | Online Cost | Storage | Scalable? |
|---|---|---|---|---|
| User-based CF | O(m² · r) | O(m) | O(m²) | ❌ No |
| Item-based CF | O(n² · m/sparsity) | O(r · k) | O(n · k) | ⚠️ Moderate |
| Matrix Factorization | O(iterations · nnz · d) | O(d) | O((m+n) · d) | ✅ Yes |
| Neural CF | O(epochs · nnz · model_size) | O(d) | O(model_size) | ✅ Yes |
Any algorithm with O(m²) or O(n²) complexity is fundamentally unscalable for large catalogs. Moving from 1M to 10M users increases cost by 100x, not 10x. This is why production systems either use approximations, dimensionality reduction, or avoid pairwise computation entirely.
When single-machine computation is infeasible, we distribute work across clusters. The MapReduce paradigm is particularly well-suited to similarity computation.
MapReduce for Item-based CF:
The key insight: to compute similarity between items i and j, we need users who rated both. We can invert the computation:
Phase 1: Map by User
Input: (user_id, {item_1: r_1, item_2: r_2, ...})
Output: For each pair (i, j) in user's items:
emit((i, j), (r_i, r_j, user_mean))
Phase 2: Reduce by Item Pair
Input: ((i, j), [(r_i1, r_j1, m_1), (r_i2, r_j2, m_2), ...])
Output: Compute adjusted cosine from all user observations
emit((i, j), similarity)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197
"""Distributed Item-Item Similarity using PySpark This implementation scales to billions of ratings acrosshundreds of machines.""" from pyspark.sql import SparkSessionfrom pyspark.sql import functions as Ffrom pyspark.sql.window import Windowfrom pyspark.sql.types import StructType, StructField, IntegerType, FloatTypeimport numpy as np def compute_item_similarities_distributed( ratings_df, # DataFrame with columns: user_id, item_id, rating min_co_raters: int = 5, top_k: int = 50,): """ Compute top-k similar items for each item using distributed computation. Algorithm: 1. Compute user means 2. Create item pairs from co-ratings 3. Compute adjusted cosine for each pair 4. Keep top-k for each item """ # Step 1: Compute user means user_stats = ( ratings_df .groupBy("user_id") .agg( F.mean("rating").alias("user_mean"), F.stddev("rating").alias("user_std") ) ) # Join back to get deviations ratings_with_dev = ( ratings_df .join(user_stats, "user_id") .withColumn("deviation", F.col("rating") - F.col("user_mean")) .select("user_id", "item_id", "rating", "deviation") ) # Step 2: Self-join to create item pairs # For each user, join their ratings to themselves to get pairs ratings_left = ratings_with_dev.alias("left") ratings_right = ratings_with_dev.alias("right") item_pairs = ( ratings_left .join( ratings_right, (F.col("left.user_id") == F.col("right.user_id")) & (F.col("left.item_id") < F.col("right.item_id")) # Avoid duplicates ) .select( F.col("left.item_id").alias("item_i"), F.col("right.item_id").alias("item_j"), F.col("left.deviation").alias("dev_i"), F.col("right.deviation").alias("dev_j"), ) ) # Step 3: Aggregate by item pair pair_stats = ( item_pairs .groupBy("item_i", "item_j") .agg( F.sum(F.col("dev_i") * F.col("dev_j")).alias("dot_product"), F.sum(F.col("dev_i") ** 2).alias("norm_i_sq"), F.sum(F.col("dev_j") ** 2).alias("norm_j_sq"), F.count("*").alias("n_co_raters"), ) .filter(F.col("n_co_raters") >= min_co_raters) ) # Step 4: Compute adjusted cosine similarity similarities = ( pair_stats .withColumn( "similarity", F.col("dot_product") / ( F.sqrt(F.col("norm_i_sq")) * F.sqrt(F.col("norm_j_sq")) ) ) .filter(F.col("similarity") > 0) # Keep positive similarities only .select("item_i", "item_j", "similarity", "n_co_raters") ) # Step 5: Get top-k for each item # Need to consider both directions (i->j and j->i) sim_both_dirs = ( similarities .select( F.col("item_i").alias("item"), F.col("item_j").alias("neighbor"), "similarity" ) .union( similarities.select( F.col("item_j").alias("item"), F.col("item_i").alias("neighbor"), "similarity" ) ) ) # Rank by similarity within each item window = Window.partitionBy("item").orderBy(F.desc("similarity")) top_k_neighbors = ( sim_both_dirs .withColumn("rank", F.row_number().over(window)) .filter(F.col("rank") <= top_k) .select("item", "neighbor", "similarity", "rank") ) return top_k_neighbors # Memory-efficient variant using blockingdef compute_similarities_blocked( ratings_df, n_blocks: int = 100, min_co_raters: int = 5,): """ Block-based similarity computation for very large item catalogs. Splits items into blocks and processes block pairs sequentially. Reduces memory from O(n²) to O(n²/n_blocks²) per worker. """ # Assign items to blocks based on hash blocked_ratings = ( ratings_df .withColumn( "item_block", F.abs(F.hash("item_id")) % n_blocks ) ) all_similarities = None # Process each block pair for block_i in range(n_blocks): for block_j in range(block_i, n_blocks): # Upper triangle left = blocked_ratings.filter(F.col("item_block") == block_i) right = blocked_ratings.filter(F.col("item_block") == block_j) # Same logic as before but only for this block pair block_pairs = ( left.alias("l") .join( right.alias("r"), (F.col("l.user_id") == F.col("r.user_id")) & ( (block_i < block_j) | # Different blocks: all pairs (F.col("l.item_id") < F.col("r.item_id")) # Same block: upper triangle ) ) # ... compute similarities for this block pair ) # Accumulate results if all_similarities is None: all_similarities = block_pairs else: all_similarities = all_similarities.union(block_pairs) return all_similarities # Example cluster configurationprint("Example Spark Configuration for Item Similarity:")print("=" * 50)print("""spark-submit \\ --master yarn \\ --deploy-mode cluster \\ --num-executors 100 \\ --executor-cores 4 \\ --executor-memory 16g \\ --driver-memory 8g \\ --conf spark.sql.shuffle.partitions=1000 \\ --conf spark.dynamicAllocation.enabled=true \\ item_similarity.py Expected performance (100M users, 10M items, 10B ratings):- Time: 2-4 hours- Cluster: 400 cores, 1.6TB RAM- Output: ~500M item pairs with similarity > 0.1""")The self-join (ratings × ratings) is the expensive operation. Key optimizations: (1) Filter to users with >1 rating before join, (2) Use broadcast join if one side fits in memory, (3) Partition by user_id to keep joins local, (4) Use blocking to limit pair explosion.
Exact nearest neighbor search is O(n) per query—too slow for real-time recommendations. Approximate Nearest Neighbor (ANN) methods trade exactness for speed, typically achieving O(log n) or even O(1) query time.
Key Approaches:
1. Locality-Sensitive Hashing (LSH)
Hash similar items to the same bucket with high probability:
For cosine similarity (random hyperplanes):
- Generate r random hyperplanes
- Each item gets hash = (sign(h₁·x), sign(h₂·x), ..., sign(hᵣ·x))
- Similar items collide with probability proportional to similarity
- Use b bands of r rows for tunable precision/recall
Collision Probability: P(same bucket) = 1 - (θ/π)^r
Where θ = arccos(similarity)
2. Hierarchical Navigable Small Worlds (HNSW)
Build a multi-layer graph where each layer provides increasingly fine-grained navigation:
3. Product Quantization (PQ)
Compress vectors and compute approximate distances on compressed representations:
Compression: 128-dim float32 (512 bytes) → 16 bytes (m=16 subvectors)
4. Tree-based Methods (Annoy, Ball Trees)
Recursively partition space with random hyperplanes or balls:
| Method | Index Build | Query Time | Memory | Recall@k | Best For |
|---|---|---|---|---|---|
| LSH | O(n · r · b) | O(1) avg | O(n · r · b) | 75-90% | Very high-dimensional, streaming |
| HNSW | O(n · log n) | O(log n) | O(n · M) | 95-99% | Best quality/speed tradeoff |
| PQ/IVF-PQ | O(n · d + k · k) | O(n/k + d) | O(n · m) | 85-95% | Memory-constrained, billions of vectors |
| Annoy | O(n · log n · trees) | O(log n · trees) | O(n · trees) | 90-95% | Static data, simple deployment |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181
"""Approximate Nearest Neighbors for Fast Similarity Lookup This example shows how to use FAISS for item-based CF at scale.""" import numpy as npimport faissfrom typing import List, Tupleimport time class FastItemNeighbors: """ Fast approximate nearest neighbor lookup using FAISS. Suitable for: - 1M+ items - Real-time queries (<10ms) - 95%+ recall on true nearest neighbors """ def __init__( self, embedding_dim: int = 128, n_lists: int = 1000, # IVF clusters n_probe: int = 50, # Clusters to search at query time use_gpu: bool = False, ): self.dim = embedding_dim self.n_lists = n_lists self.n_probe = n_probe self.use_gpu = use_gpu self.index = None self.item_ids = None def build_index( self, item_embeddings: np.ndarray, item_ids: np.ndarray, ): """ Build FAISS index from item embeddings. Args: item_embeddings: (n_items, embedding_dim) float32 array item_ids: (n_items,) array of item IDs """ n_items = len(item_embeddings) self.item_ids = item_ids print(f"Building index for {n_items:,} items...") start = time.time() # Normalize for cosine similarity embeddings = item_embeddings.copy() faiss.normalize_L2(embeddings) # Create IVF index with flat (exact) quantizer quantizer = faiss.IndexFlatIP(self.dim) # Inner product = cosine on normalized self.index = faiss.IndexIVFFlat( quantizer, self.dim, self.n_lists, faiss.METRIC_INNER_PRODUCT ) # Train on subset if large if n_items > 100000: train_sample = embeddings[np.random.choice(n_items, 100000, replace=False)] else: train_sample = embeddings print("Training index...") self.index.train(train_sample) print("Adding vectors...") self.index.add(embeddings) # Set search parameters self.index.nprobe = self.n_probe # Move to GPU if available if self.use_gpu: res = faiss.StandardGpuResources() self.index = faiss.index_cpu_to_gpu(res, 0, self.index) print(f"Index built in {time.time() - start:.1f}s") def find_neighbors( self, item_ids: List[int], k: int = 50, ) -> List[List[Tuple[int, float]]]: """ Find k nearest neighbors for given items. Returns: List of [(neighbor_id, similarity), ...] for each input item """ # Get indices for requested items id_to_idx = {iid: idx for idx, iid in enumerate(self.item_ids)} indices = [id_to_idx[iid] for iid in item_ids if iid in id_to_idx] if not indices: return [[] for _ in item_ids] # Get embeddings and search # Note: In production, you'd store embeddings separately # Here we assume they're retrievable from the index # For IVF indices, we can reconstruct embeddings = np.zeros((len(indices), self.dim), dtype=np.float32) for i, idx in enumerate(indices): self.index.reconstruct(int(idx), embeddings[i]) # Search faiss.normalize_L2(embeddings) similarities, neighbor_indices = self.index.search(embeddings, k + 1) # Convert to item IDs and exclude self results = [] for i, (sims, nbrs) in enumerate(zip(similarities, neighbor_indices)): self_id = item_ids[i] neighbors = [ (int(self.item_ids[nbr_idx]), float(sim)) for nbr_idx, sim in zip(nbrs, sims) if nbr_idx >= 0 and self.item_ids[nbr_idx] != self_id ][:k] results.append(neighbors) return results def benchmark(self, n_queries: int = 1000): """Benchmark query performance.""" if self.item_ids is None: print("Build index first!") return # Random query items query_ids = np.random.choice(self.item_ids, n_queries, replace=False) start = time.time() results = self.find_neighbors(list(query_ids), k=50) elapsed = time.time() - start print(f"\nBenchmark Results:") print(f" Queries: {n_queries}") print(f" Total time: {elapsed:.2f}s") print(f" Per query: {1000 * elapsed / n_queries:.2f}ms") print(f" QPS: {n_queries / elapsed:.0f}") # Demo with synthetic dataif __name__ == "__main__": np.random.seed(42) # Simulate 1M items with 128-dim embeddings n_items = 100000 # Use 100K for demo dim = 128 print(f"Generating {n_items:,} item embeddings...") embeddings = np.random.randn(n_items, dim).astype(np.float32) item_ids = np.arange(n_items) # Build index ann = FastItemNeighbors( embedding_dim=dim, n_lists=1000, n_probe=50, ) ann.build_index(embeddings, item_ids) # Benchmark ann.benchmark(n_queries=1000) # Example query print("\nExample: Neighbors for item 0:") neighbors = ann.find_neighbors([0], k=5)[0] for item_id, sim in neighbors: print(f" Item {item_id}: similarity = {sim:.3f}")FAISS powers similarity search at Facebook (2B+ users), Spotify (100M+ tracks), and many others. With GPU support, FAISS can search 1 billion 128-dim vectors in <10ms. For CF, this means real-time neighbor lookup is no longer the bottleneck.
Rather than computing similarities in the original high-dimensional (and sparse) rating space, we can first reduce dimensionality. This transforms O(n²·m) similarity computation to O(n²·d) where d << m.
Truncated SVD:
The classic approach. Decompose the rating matrix R ≈ U·Σ·V^T where:
Item similarities in reduced space:
sim(i, j) = cosine(V[i], V[j])
With d=100 and n=10M items, we compute similarities from 100-dimensional vectors instead of 100M-dimensional sparse vectors.
Advantages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176
"""Scalable Item Similarity via Dimensionality Reduction This approach can handle 10M+ items by:1. Reducing to d=100-500 dimensions2. Computing similarities in reduced space3. Using batch operations for efficiency""" import numpy as npfrom scipy.sparse import csr_matrixfrom scipy.sparse.linalg import svdsfrom sklearn.metrics.pairwise import cosine_similarityimport time class ScalableItemSimilarity: """ Two-phase similarity computation: 1. Offline: SVD reduction 2. Similarity: Compute in reduced space """ def __init__( self, n_factors: int = 100, chunk_size: int = 10000, ): self.d = n_factors self.chunk_size = chunk_size self.item_factors = None self.item_ids = None def fit_transform( self, ratings_matrix: csr_matrix, item_ids: np.ndarray, ): """ Learn item factors via SVD. Args: ratings_matrix: n_items × n_users sparse matrix item_ids: Array of item IDs """ self.item_ids = item_ids n_items = ratings_matrix.shape[0] print(f"Computing SVD with d={self.d} on {n_items:,} items...") start = time.time() # Truncated SVD U, sigma, Vt = svds(ratings_matrix.astype(float), k=self.d) # Item factors are U * sqrt(Sigma) for balanced representation self.item_factors = U * np.sqrt(sigma) # Normalize for cosine similarity norms = np.linalg.norm(self.item_factors, axis=1, keepdims=True) norms = np.where(norms > 0, norms, 1) # Avoid division by zero self.item_factors /= norms print(f"SVD completed in {time.time() - start:.1f}s") print(f"Item factors shape: {self.item_factors.shape}") return self def compute_all_neighbors( self, k: int = 50, ) -> dict: """ Compute top-k neighbors for all items using blocked computation. Memory-efficient: processes items in chunks rather than computing full n×n similarity matrix. """ n_items = len(self.item_ids) neighbors = {} print(f"Computing {k}-nearest neighbors for {n_items:,} items...") start = time.time() # Process in chunks for chunk_start in range(0, n_items, self.chunk_size): chunk_end = min(chunk_start + self.chunk_size, n_items) # Compute similarities of this chunk against all items chunk_factors = self.item_factors[chunk_start:chunk_end] # Chunk × All similarity matrix sims = cosine_similarity(chunk_factors, self.item_factors) # Extract top-k for each item in chunk for i, row_sims in enumerate(sims): item_idx = chunk_start + i item_id = self.item_ids[item_idx] # Get top-k+1 (excluding self) top_indices = np.argpartition(row_sims, -k-1)[-k-1:] top_sims = row_sims[top_indices] # Sort by similarity sorted_order = np.argsort(-top_sims) top_indices = top_indices[sorted_order] top_sims = top_sims[sorted_order] # Exclude self and store neighbors[item_id] = [ (int(self.item_ids[idx]), float(sim)) for idx, sim in zip(top_indices, top_sims) if idx != item_idx ][:k] if (chunk_start // self.chunk_size) % 10 == 0: print(f" Processed {chunk_end:,} / {n_items:,} items") print(f"Neighbors computed in {time.time() - start:.1f}s") return neighbors def compute_similarity_matrix_gpu( item_factors: np.ndarray, top_k: int = 50,): """ GPU-accelerated similarity computation using CuPy. Can process 1M+ items in seconds on modern GPUs. """ try: import cupy as cp # Move to GPU factors_gpu = cp.asarray(item_factors) # Compute full similarity matrix in batches n_items = len(item_factors) batch_size = 10000 all_neighbors = [] for i in range(0, n_items, batch_size): batch = factors_gpu[i:i+batch_size] # Similarity: batch × all_items sims = cp.dot(batch, factors_gpu.T) # Already normalized # Top-k per row top_k_indices = cp.argpartition(sims, -top_k, axis=1)[:, -top_k:] all_neighbors.append(cp.asnumpy(top_k_indices)) return np.vstack(all_neighbors) except ImportError: print("CuPy not installed. Using CPU fallback.") return None # Complexity comparisonprint("Complexity Comparison:")print("=" * 60)print()print("Direct similarity computation (n=10M items, m=100M users):")print(" Time: O(n² · m) = 10^22 operations (infeasible)")print()print("With SVD reduction (d=100):")print(" SVD: O(n · m · d) = 10^14 (hours)")print(" Similarity: O(n² · d) = 10^16 (hours)")print(" Total: Still challenging at this scale")print()print("With SVD + chunked computation + GPU:")print(" SVD: Distributed (minutes)")print(" Similarity: GPU-accelerated chunks (minutes)")print(" Total: Practical for 10M+ items")Scaling CF requires more than algorithmic optimizations—it requires thoughtful system architecture. Here are the patterns used by major recommendation systems.
Pattern 1: Offline/Online Separation
┌─────────────────────────────────────────────────────────────┐
│ OFFLINE PIPELINE │
│ (runs hourly/daily on Spark/Flink) │
├─────────────────────────────────────────────────────────────┤
│ Raw Ratings → ETL → Model Training → Similarity Compute │
│ ↓ │
│ Precomputed Tables │
│ (user_factors, item_neighbors, etc.) │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ ONLINE SERVING │
│ (real-time, <50ms latency) │
├─────────────────────────────────────────────────────────────┤
│ Request → Lookup User → Lookup Neighbors → Score → Rank │
│ ↓ │
│ Recommendation Response │
└─────────────────────────────────────────────────────────────┘
Pattern 2: Two-Stage Retrieval + Ranking
Stage 1: Candidate Generation (broad, fast)
- Use ANN to find ~1000 candidates
- Low precision OK, high recall critical
- 5-10ms latency budget
Stage 2: Ranking (narrow, precise)
- Score 1000 candidates with full model
- Consider user context, business rules
- 20-30ms latency budget
Stage 3: Post-processing
- Diversity filters
- Business rule enforcement
- Freshness boosts
- 5-10ms
Pattern 3: Tiered Storage
| Data | Storage | Latency | Capacity |
|---|---|---|---|
| User's recent items | In-memory (Redis) | <1ms | 10M users |
| Item neighbors (top-k) | In-memory (Redis) | <1ms | 10M items |
| Full item embeddings | SSD cache | 1-5ms | 100M items |
| Historical ratings | Distributed DB | 10-100ms | Billions |
| Raw logs | Object storage (S3) | 100ms+ | Petabytes |
Netflix generates and caches personalized recommendation lists for every user nightly. When you open the app, you see pre-computed results. Real-time updates happen in background. This 'precompute everything' approach enables sub-second page loads even at 250M user scale.
Recomputing all similarities from scratch is wasteful when most interactions are unchanged. Incremental updates enable fresh recommendations without full recomputation.
What Triggers Updates:
Incremental Similarity Update:
When user u rates item i with rating r:
For each item j that user u has also rated:
# Update similarity(i, j) incrementally
old_dot = stored_dot_product[i][j]
old_norm_i = stored_norm[i]
old_norm_j = stored_norm[j]
# Add contribution from this user
dev_i = r - user_mean[u]
dev_j = ratings[u][j] - user_mean[u]
new_dot = old_dot + dev_i * dev_j
new_norm_i = sqrt(old_norm_i^2 + dev_i^2)
# new_norm_j unchanged (j's ratings didn't change)
new_similarity = new_dot / (new_norm_i * old_norm_j)
# Update neighbor lists if rank changed significantly
update_neighbor_list(i, j, new_similarity)
Stream Processing Architecture:
User Actions (Kafka) → Stream Processor (Flink) → Updates (Redis)
↓
Aggregate to batch updates
↓
Batch processor (hourly, Spark)
↓
Full model refresh
Real-time Update Strategies:
Additive updates: For embeddings, add weighted recent activity:
user_embedding = α · old_embedding + (1-α) · recent_activity_embedding
Session-based boosting: Boost items related to current session without retraining
Bandit overlays: Use multi-armed bandits to explore new items, update exploitation scores with each interaction
More frequent updates mean fresher recommendations but also more noise (a single accidental click shouldn't reshape recommendations). Most systems update similarities daily but adjust user preference signals more frequently. The right balance depends on how fast preferences change in your domain.
Scalable systems require scalable monitoring. Here's what to track for production CF systems:
Latency Metrics:
| Metric | Target | Alert Threshold |
|---|---|---|
| p50 prediction latency | <5ms | >20ms |
| p99 prediction latency | <50ms | >200ms |
| p99 recommendation list | <100ms | >500ms |
| Cache hit rate | >95% | <80% |
| ANN recall | >95% | <90% |
Quality Metrics:
| Metric | What It Measures | Computation |
|---|---|---|
| RMSE/MAE | Rating prediction accuracy | Offline on test set |
| Precision@k | Relevant items in top-k | Requires ground truth |
| Coverage | Fraction of items recommended | Track over time window |
| Diversity | Intra-list similarity | 1 - avg similarity in lists |
| Novelty | Avg popularity of recommendations | Lower = more novel |
Always A/B test algorithm changes. Offline metrics (RMSE) often don't correlate with online metrics (click-through rate, conversions). Run tests for at least 1-2 weeks to account for day-of-week effects. Use proper statistical significance testing—at scale, even tiny differences appear 'significant' with enough data.
Module Complete:
You have now completed the Collaborative Filtering module. You understand:
Collaborative filtering remains a cornerstone of modern recommendation systems. While deep learning approaches are increasingly popular, CF methods provide interpretable baselines, handle cold-start through hybrid approaches, and—when properly optimized—scale to hundreds of millions of users.
You have mastered collaborative filtering—from foundational concepts through production-scale deployment. This knowledge forms the basis for understanding matrix factorization, neural collaborative filtering, and hybrid recommendation systems covered in subsequent modules.