Collaborative Filtering - Learning Module

Loading content...

0/245

Scalability

The Scale Wall

Netflix has 250 million subscribers. Amazon catalogs 353 million products. Spotify hosts 100 million tracks. At this scale, the elegant algorithms we've studied face a brutal reality: they don't work.

Computing pairwise similarities for 250 million users requires 31 quintillion (10^18) comparisons. Even at a nanosecond per comparison, that's 1,000 years. Storing a full similarity matrix would require 500 petabytes. These aren't optimization problems—they're fundamental architectural challenges.

This page confronts the scalability wall head-on. We'll analyze where computational costs explode, explore distributed algorithms that tame them, examine approximate methods that trade exactness for tractability, and study the production architectures that make CF work at internet scale.

What You Will Master

By the end of this page, you will understand the computational complexity barriers in CF, master distributed similarity computation using MapReduce paradigms, know when and how to apply approximate nearest neighbor methods, and understand the production architecture patterns that enable recommendations at scale.

Understanding the Complexity Barriers

Let's precisely analyze where collaborative filtering breaks down at scale.

Notation:

m = number of users
n = number of items
r = average ratings per user (sparsity factor)
k = neighborhood size
d = embedding dimension (for model-based approaches)

User-based CF Complexity:

Operation	Complexity	At m=100M, n=10M, r=100
Similarity matrix computation	O(m² · r)	10^18 operations
Similarity matrix storage	O(m²)	80 petabytes (float64)
Single prediction	O(m)	100M comparisons
Top-N recommendations	O(m · n)	10^15 comparisons

Item-based CF Complexity:

Operation	Complexity	At m=100M, n=10M, r=100
Similarity matrix computation	O(n² · avg_raters)	10^15 operations
Similarity matrix storage	O(n² · k/n) = O(n·k)	4 GB (k=50)
Single prediction	O(r · k)	5,000 operations
Top-N recommendations	O(n · r · k)	5×10^10 operations

Key Insight:

Item-based CF has dramatically better online complexity because:

Users rate many fewer items than total items (r << n)
We can store truncated neighborhoods (k neighbors per item)
Prediction only requires consulting items the user rated (r operations)

This is why Amazon chose item-based CF—it's the only memory-based approach that scales to their catalog size.

Scalability Comparison
Approach	Offline Cost	Online Cost	Storage	Scalable?
User-based CF	O(m² · r)	O(m)	O(m²)	❌ No
Item-based CF	O(n² · m/sparsity)	O(r · k)	O(n · k)	⚠️ Moderate
Matrix Factorization	O(iterations · nnz · d)	O(d)	O((m+n) · d)	✅ Yes
Neural CF	O(epochs · nnz · model_size)	O(d)	O(model_size)	✅ Yes

The Quadratic Trap

Any algorithm with O(m²) or O(n²) complexity is fundamentally unscalable for large catalogs. Moving from 1M to 10M users increases cost by 100x, not 10x. This is why production systems either use approximations, dimensionality reduction, or avoid pairwise computation entirely.

Distributed Similarity Computation

When single-machine computation is infeasible, we distribute work across clusters. The MapReduce paradigm is particularly well-suited to similarity computation.

MapReduce for Item-based CF:

The key insight: to compute similarity between items i and j, we need users who rated both. We can invert the computation:

Phase 1: Map by User

Input: (user_id, {item_1: r_1, item_2: r_2, ...})
Output: For each pair (i, j) in user's items:
        emit((i, j), (r_i, r_j, user_mean))

Phase 2: Reduce by Item Pair

Input: ((i, j), [(r_i1, r_j1, m_1), (r_i2, r_j2, m_2), ...])
Output: Compute adjusted cosine from all user observations
        emit((i, j), similarity)

distributed_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
"""
Distributed Item-Item Similarity using PySpark
 
This implementation scales to billions of ratings across
hundreds of machines.
"""
 
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType
import numpy as np
 
def compute_item_similarities_distributed(
    ratings_df,  # DataFrame with columns: user_id, item_id, rating
    min_co_raters: int = 5,
    top_k: int = 50,
):
    """
    Compute top-k similar items for each item using distributed computation.
    
    Algorithm:
    1. Compute user means
    2. Create item pairs from co-ratings
    3. Compute adjusted cosine for each pair
    4. Keep top-k for each item
    """
    
    # Step 1: Compute user means
    user_stats = (
        ratings_df
        .groupBy("user_id")
        .agg(
            F.mean("rating").alias("user_mean"),
            F.stddev("rating").alias("user_std")
        )
    )
    
    # Join back to get deviations
    ratings_with_dev = (
        ratings_df
        .join(user_stats, "user_id")
        .withColumn("deviation", F.col("rating") - F.col("user_mean"))
        .select("user_id", "item_id", "rating", "deviation")
    )
    
    # Step 2: Self-join to create item pairs
    # For each user, join their ratings to themselves to get pairs
    ratings_left = ratings_with_dev.alias("left")
    ratings_right = ratings_with_dev.alias("right")
    
    item_pairs = (
        ratings_left
        .join(
            ratings_right,
            (F.col("left.user_id") == F.col("right.user_id")) &
            (F.col("left.item_id") < F.col("right.item_id"))  # Avoid duplicates
        )
        .select(
            F.col("left.item_id").alias("item_i"),
            F.col("right.item_id").alias("item_j"),
            F.col("left.deviation").alias("dev_i"),
            F.col("right.deviation").alias("dev_j"),
        )
    )
    
    # Step 3: Aggregate by item pair
    pair_stats = (
        item_pairs
        .groupBy("item_i", "item_j")
        .agg(
            F.sum(F.col("dev_i") * F.col("dev_j")).alias("dot_product"),
            F.sum(F.col("dev_i") ** 2).alias("norm_i_sq"),
            F.sum(F.col("dev_j") ** 2).alias("norm_j_sq"),
            F.count("*").alias("n_co_raters"),
        )
        .filter(F.col("n_co_raters") >= min_co_raters)
    )
    
    # Step 4: Compute adjusted cosine similarity
    similarities = (
        pair_stats
        .withColumn(
            "similarity",
            F.col("dot_product") / (
                F.sqrt(F.col("norm_i_sq")) * F.sqrt(F.col("norm_j_sq"))
            )
        )
        .filter(F.col("similarity") > 0)  # Keep positive similarities only
        .select("item_i", "item_j", "similarity", "n_co_raters")
    )
    
    # Step 5: Get top-k for each item
    # Need to consider both directions (i->j and j->i)
    sim_both_dirs = (
        similarities
        .select(
            F.col("item_i").alias("item"),
            F.col("item_j").alias("neighbor"),
            "similarity"
        )
        .union(
            similarities.select(
                F.col("item_j").alias("item"),
                F.col("item_i").alias("neighbor"),
                "similarity"
            )
        )
    )
    
    # Rank by similarity within each item
    window = Window.partitionBy("item").orderBy(F.desc("similarity"))
    
    top_k_neighbors = (
        sim_both_dirs
        .withColumn("rank", F.row_number().over(window))
        .filter(F.col("rank") <= top_k)
        .select("item", "neighbor", "similarity", "rank")
    )
    
    return top_k_neighbors
 
 
# Memory-efficient variant using blocking
def compute_similarities_blocked(
    ratings_df,
    n_blocks: int = 100,
    min_co_raters: int = 5,
):
    """
    Block-based similarity computation for very large item catalogs.
    
    Splits items into blocks and processes block pairs sequentially.
    Reduces memory from O(n²) to O(n²/n_blocks²) per worker.
    """
    
    # Assign items to blocks based on hash
    blocked_ratings = (
        ratings_df
        .withColumn(
            "item_block",
            F.abs(F.hash("item_id")) % n_blocks
        )
    )
    
    all_similarities = None
    
    # Process each block pair
    for block_i in range(n_blocks):
        for block_j in range(block_i, n_blocks):  # Upper triangle
            
            left = blocked_ratings.filter(F.col("item_block") == block_i)
            right = blocked_ratings.filter(F.col("item_block") == block_j)
            
            # Same logic as before but only for this block pair
            block_pairs = (
                left.alias("l")
                .join(
                    right.alias("r"),
                    (F.col("l.user_id") == F.col("r.user_id")) &
                    (
                        (block_i < block_j) |  # Different blocks: all pairs
                        (F.col("l.item_id") < F.col("r.item_id"))  # Same block: upper triangle
                    )
                )
                # ... compute similarities for this block pair
            )
            
            # Accumulate results
            if all_similarities is None:
                all_similarities = block_pairs
            else:
                all_similarities = all_similarities.union(block_pairs)
    
    return all_similarities
 
 
# Example cluster configuration
print("Example Spark Configuration for Item Similarity:")
print("=" * 50)
print("""
spark-submit \\
  --master yarn \\
  --deploy-mode cluster \\
  --num-executors 100 \\
  --executor-cores 4 \\
  --executor-memory 16g \\
  --driver-memory 8g \\
  --conf spark.sql.shuffle.partitions=1000 \\
  --conf spark.dynamicAllocation.enabled=true \\
  item_similarity.py
 
Expected performance (100M users, 10M items, 10B ratings):
- Time: 2-4 hours
- Cluster: 400 cores, 1.6TB RAM
- Output: ~500M item pairs with similarity > 0.1
""")

Self-Join Optimization

The self-join (ratings × ratings) is the expensive operation. Key optimizations: (1) Filter to users with >1 rating before join, (2) Use broadcast join if one side fits in memory, (3) Partition by user_id to keep joins local, (4) Use blocking to limit pair explosion.

Approximate Nearest Neighbor Methods

Exact nearest neighbor search is O(n) per query—too slow for real-time recommendations. Approximate Nearest Neighbor (ANN) methods trade exactness for speed, typically achieving O(log n) or even O(1) query time.

Key Approaches:

1. Locality-Sensitive Hashing (LSH)

Hash similar items to the same bucket with high probability:

For cosine similarity (random hyperplanes):
- Generate r random hyperplanes
- Each item gets hash = (sign(h₁·x), sign(h₂·x), ..., sign(hᵣ·x))
- Similar items collide with probability proportional to similarity
- Use b bands of r rows for tunable precision/recall

Collision Probability: P(same bucket) = 1 - (θ/π)^r

Where θ = arccos(similarity)

2. Hierarchical Navigable Small Worlds (HNSW)

Build a multi-layer graph where each layer provides increasingly fine-grained navigation:

Bottom layer: All items, connected to k nearest neighbors
Upper layers: Skip connections for fast global navigation
Search: Start at top, greedily descend to nearest neighbor region
Complexity: O(log n) query time

3. Product Quantization (PQ)

Compress vectors and compute approximate distances on compressed representations:

Split d-dimensional vector into m subvectors of d/m dimensions
Quantize each subvector to closest centroid (typically 256 centroids)
Store only centroid IDs (m bytes per vector)
Compute distances using lookup tables

Compression: 128-dim float32 (512 bytes) → 16 bytes (m=16 subvectors)

4. Tree-based Methods (Annoy, Ball Trees)

Recursively partition space with random hyperplanes or balls:

Build multiple trees with different random partitions
Query: Traverse all trees, merge candidate sets
More trees = higher recall, slower queries

ANN Method Comparison
Method	Index Build	Query Time	Memory	Recall@k	Best For
LSH	O(n · r · b)	O(1) avg	O(n · r · b)	75-90%	Very high-dimensional, streaming
HNSW	O(n · log n)	O(log n)	O(n · M)	95-99%	Best quality/speed tradeoff
PQ/IVF-PQ	O(n · d + k · k)	O(n/k + d)	O(n · m)	85-95%	Memory-constrained, billions of vectors
Annoy	O(n · log n · trees)	O(log n · trees)	O(n · trees)	90-95%	Static data, simple deployment

ann_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
"""
Approximate Nearest Neighbors for Fast Similarity Lookup
 
This example shows how to use FAISS for item-based CF at scale.
"""
 
import numpy as np
import faiss
from typing import List, Tuple
import time
 
class FastItemNeighbors:
    """
    Fast approximate nearest neighbor lookup using FAISS.
    
    Suitable for:
    - 1M+ items
    - Real-time queries (<10ms)
    - 95%+ recall on true nearest neighbors
    """
    
    def __init__(
        self,
        embedding_dim: int = 128,
        n_lists: int = 1000,  # IVF clusters
        n_probe: int = 50,    # Clusters to search at query time
        use_gpu: bool = False,
    ):
        self.dim = embedding_dim
        self.n_lists = n_lists
        self.n_probe = n_probe
        self.use_gpu = use_gpu
        
        self.index = None
        self.item_ids = None
    
    def build_index(
        self, 
        item_embeddings: np.ndarray,
        item_ids: np.ndarray,
    ):
        """
        Build FAISS index from item embeddings.
        
        Args:
            item_embeddings: (n_items, embedding_dim) float32 array
            item_ids: (n_items,) array of item IDs
        """
        n_items = len(item_embeddings)
        self.item_ids = item_ids
        
        print(f"Building index for {n_items:,} items...")
        start = time.time()
        
        # Normalize for cosine similarity
        embeddings = item_embeddings.copy()
        faiss.normalize_L2(embeddings)
        
        # Create IVF index with flat (exact) quantizer
        quantizer = faiss.IndexFlatIP(self.dim)  # Inner product = cosine on normalized
        self.index = faiss.IndexIVFFlat(
            quantizer,
            self.dim,
            self.n_lists,
            faiss.METRIC_INNER_PRODUCT
        )
        
        # Train on subset if large
        if n_items > 100000:
            train_sample = embeddings[np.random.choice(n_items, 100000, replace=False)]
        else:
            train_sample = embeddings
        
        print("Training index...")
        self.index.train(train_sample)
        
        print("Adding vectors...")
        self.index.add(embeddings)
        
        # Set search parameters
        self.index.nprobe = self.n_probe
        
        # Move to GPU if available
        if self.use_gpu:
            res = faiss.StandardGpuResources()
            self.index = faiss.index_cpu_to_gpu(res, 0, self.index)
        
        print(f"Index built in {time.time() - start:.1f}s")
    
    def find_neighbors(
        self,
        item_ids: List[int],
        k: int = 50,
    ) -> List[List[Tuple[int, float]]]:
        """
        Find k nearest neighbors for given items.
        
        Returns:
            List of [(neighbor_id, similarity), ...] for each input item
        """
        # Get indices for requested items
        id_to_idx = {iid: idx for idx, iid in enumerate(self.item_ids)}
        indices = [id_to_idx[iid] for iid in item_ids if iid in id_to_idx]
        
        if not indices:
            return [[] for _ in item_ids]
        
        # Get embeddings and search
        # Note: In production, you'd store embeddings separately
        # Here we assume they're retrievable from the index
        
        # For IVF indices, we can reconstruct
        embeddings = np.zeros((len(indices), self.dim), dtype=np.float32)
        for i, idx in enumerate(indices):
            self.index.reconstruct(int(idx), embeddings[i])
        
        # Search
        faiss.normalize_L2(embeddings)
        similarities, neighbor_indices = self.index.search(embeddings, k + 1)
        
        # Convert to item IDs and exclude self
        results = []
        for i, (sims, nbrs) in enumerate(zip(similarities, neighbor_indices)):
            self_id = item_ids[i]
            neighbors = [
                (int(self.item_ids[nbr_idx]), float(sim))
                for nbr_idx, sim in zip(nbrs, sims)
                if nbr_idx >= 0 and self.item_ids[nbr_idx] != self_id
            ][:k]
            results.append(neighbors)
        
        return results
    
    def benchmark(self, n_queries: int = 1000):
        """Benchmark query performance."""
        if self.item_ids is None:
            print("Build index first!")
            return
        
        # Random query items
        query_ids = np.random.choice(self.item_ids, n_queries, replace=False)
        
        start = time.time()
        results = self.find_neighbors(list(query_ids), k=50)
        elapsed = time.time() - start
        
        print(f"\nBenchmark Results:")
        print(f"  Queries: {n_queries}")
        print(f"  Total time: {elapsed:.2f}s")
        print(f"  Per query: {1000 * elapsed / n_queries:.2f}ms")
        print(f"  QPS: {n_queries / elapsed:.0f}")
 
 
# Demo with synthetic data
if __name__ == "__main__":
    np.random.seed(42)
    
    # Simulate 1M items with 128-dim embeddings
    n_items = 100000  # Use 100K for demo
    dim = 128
    
    print(f"Generating {n_items:,} item embeddings...")
    embeddings = np.random.randn(n_items, dim).astype(np.float32)
    item_ids = np.arange(n_items)
    
    # Build index
    ann = FastItemNeighbors(
        embedding_dim=dim,
        n_lists=1000,
        n_probe=50,
    )
    ann.build_index(embeddings, item_ids)
    
    # Benchmark
    ann.benchmark(n_queries=1000)
    
    # Example query
    print("\nExample: Neighbors for item 0:")
    neighbors = ann.find_neighbors([0], k=5)[0]
    for item_id, sim in neighbors:
        print(f"  Item {item_id}: similarity = {sim:.3f}")

Production Reality

FAISS powers similarity search at Facebook (2B+ users), Spotify (100M+ tracks), and many others. With GPU support, FAISS can search 1 billion 128-dim vectors in <10ms. For CF, this means real-time neighbor lookup is no longer the bottleneck.

Dimensionality Reduction for Scalability

Rather than computing similarities in the original high-dimensional (and sparse) rating space, we can first reduce dimensionality. This transforms O(n²·m) similarity computation to O(n²·d) where d << m.

Truncated SVD:

The classic approach. Decompose the rating matrix R ≈ U·Σ·V^T where:

U: m × d (user factors)
Σ: d × d (singular values)
V: n × d (item factors)

Item similarities in reduced space:

sim(i, j) = cosine(V[i], V[j])

With d=100 and n=10M items, we compute similarities from 100-dimensional vectors instead of 100M-dimensional sparse vectors.

Advantages:

Dense representations enable efficient matrix operations
Noise reduction (small singular values often represent noise)
Enables batch computation with optimized BLAS libraries
Natural fit for GPU acceleration

scalable_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
"""
Scalable Item Similarity via Dimensionality Reduction
 
This approach can handle 10M+ items by:
1. Reducing to d=100-500 dimensions
2. Computing similarities in reduced space
3. Using batch operations for efficiency
"""
 
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds
from sklearn.metrics.pairwise import cosine_similarity
import time
 
class ScalableItemSimilarity:
    """
    Two-phase similarity computation:
    1. Offline: SVD reduction
    2. Similarity: Compute in reduced space
    """
    
    def __init__(
        self,
        n_factors: int = 100,
        chunk_size: int = 10000,
    ):
        self.d = n_factors
        self.chunk_size = chunk_size
        
        self.item_factors = None
        self.item_ids = None
    
    def fit_transform(
        self,
        ratings_matrix: csr_matrix,
        item_ids: np.ndarray,
    ):
        """
        Learn item factors via SVD.
        
        Args:
            ratings_matrix: n_items × n_users sparse matrix
            item_ids: Array of item IDs
        """
        self.item_ids = item_ids
        n_items = ratings_matrix.shape[0]
        
        print(f"Computing SVD with d={self.d} on {n_items:,} items...")
        start = time.time()
        
        # Truncated SVD
        U, sigma, Vt = svds(ratings_matrix.astype(float), k=self.d)
        
        # Item factors are U * sqrt(Sigma) for balanced representation
        self.item_factors = U * np.sqrt(sigma)
        
        # Normalize for cosine similarity
        norms = np.linalg.norm(self.item_factors, axis=1, keepdims=True)
        norms = np.where(norms > 0, norms, 1)  # Avoid division by zero
        self.item_factors /= norms
        
        print(f"SVD completed in {time.time() - start:.1f}s")
        print(f"Item factors shape: {self.item_factors.shape}")
        
        return self
    
    def compute_all_neighbors(
        self,
        k: int = 50,
    ) -> dict:
        """
        Compute top-k neighbors for all items using blocked computation.
        
        Memory-efficient: processes items in chunks rather than
        computing full n×n similarity matrix.
        """
        n_items = len(self.item_ids)
        neighbors = {}
        
        print(f"Computing {k}-nearest neighbors for {n_items:,} items...")
        start = time.time()
        
        # Process in chunks
        for chunk_start in range(0, n_items, self.chunk_size):
            chunk_end = min(chunk_start + self.chunk_size, n_items)
            
            # Compute similarities of this chunk against all items
            chunk_factors = self.item_factors[chunk_start:chunk_end]
            
            # Chunk × All similarity matrix
            sims = cosine_similarity(chunk_factors, self.item_factors)
            
            # Extract top-k for each item in chunk
            for i, row_sims in enumerate(sims):
                item_idx = chunk_start + i
                item_id = self.item_ids[item_idx]
                
                # Get top-k+1 (excluding self)
                top_indices = np.argpartition(row_sims, -k-1)[-k-1:]
                top_sims = row_sims[top_indices]
                
                # Sort by similarity
                sorted_order = np.argsort(-top_sims)
                top_indices = top_indices[sorted_order]
                top_sims = top_sims[sorted_order]
                
                # Exclude self and store
                neighbors[item_id] = [
                    (int(self.item_ids[idx]), float(sim))
                    for idx, sim in zip(top_indices, top_sims)
                    if idx != item_idx
                ][:k]
            
            if (chunk_start // self.chunk_size) % 10 == 0:
                print(f"  Processed {chunk_end:,} / {n_items:,} items")
        
        print(f"Neighbors computed in {time.time() - start:.1f}s")
        return neighbors
 
 
def compute_similarity_matrix_gpu(
    item_factors: np.ndarray,
    top_k: int = 50,
):
    """
    GPU-accelerated similarity computation using CuPy.
    
    Can process 1M+ items in seconds on modern GPUs.
    """
    try:
        import cupy as cp
        
        # Move to GPU
        factors_gpu = cp.asarray(item_factors)
        
        # Compute full similarity matrix in batches
        n_items = len(item_factors)
        batch_size = 10000
        
        all_neighbors = []
        
        for i in range(0, n_items, batch_size):
            batch = factors_gpu[i:i+batch_size]
            
            # Similarity: batch × all_items
            sims = cp.dot(batch, factors_gpu.T)  # Already normalized
            
            # Top-k per row
            top_k_indices = cp.argpartition(sims, -top_k, axis=1)[:, -top_k:]
            
            all_neighbors.append(cp.asnumpy(top_k_indices))
        
        return np.vstack(all_neighbors)
    
    except ImportError:
        print("CuPy not installed. Using CPU fallback.")
        return None
 
 
# Complexity comparison
print("Complexity Comparison:")
print("=" * 60)
print()
print("Direct similarity computation (n=10M items, m=100M users):")
print("  Time: O(n² · m) = 10^22 operations (infeasible)")
print()
print("With SVD reduction (d=100):")
print("  SVD: O(n · m · d) = 10^14 (hours)")
print("  Similarity: O(n² · d) = 10^16 (hours)")
print("  Total: Still challenging at this scale")
print()
print("With SVD + chunked computation + GPU:")
print("  SVD: Distributed (minutes)")
print("  Similarity: GPU-accelerated chunks (minutes)")
print("  Total: Practical for 10M+ items")

Production Architecture Patterns

Scaling CF requires more than algorithmic optimizations—it requires thoughtful system architecture. Here are the patterns used by major recommendation systems.

Pattern 1: Offline/Online Separation

┌─────────────────────────────────────────────────────────────┐
│                    OFFLINE PIPELINE                        │
│  (runs hourly/daily on Spark/Flink)                        │
├─────────────────────────────────────────────────────────────┤
│  Raw Ratings → ETL → Model Training → Similarity Compute   │
│                            ↓                               │
│                    Precomputed Tables                      │
│            (user_factors, item_neighbors, etc.)            │
└─────────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────────┐
│                    ONLINE SERVING                           │
│  (real-time, <50ms latency)                                 │
├─────────────────────────────────────────────────────────────┤
│  Request → Lookup User → Lookup Neighbors → Score → Rank   │
│                            ↓                               │
│                    Recommendation Response                 │
└─────────────────────────────────────────────────────────────┘

Pattern 2: Two-Stage Retrieval + Ranking

Stage 1: Candidate Generation (broad, fast)
- Use ANN to find ~1000 candidates
- Low precision OK, high recall critical
- 5-10ms latency budget

Stage 2: Ranking (narrow, precise)
- Score 1000 candidates with full model
- Consider user context, business rules
- 20-30ms latency budget

Stage 3: Post-processing
- Diversity filters
- Business rule enforcement
- Freshness boosts
- 5-10ms

Pattern 3: Tiered Storage

Data	Storage	Latency	Capacity
User's recent items	In-memory (Redis)	<1ms	10M users
Item neighbors (top-k)	In-memory (Redis)	<1ms	10M items
Full item embeddings	SSD cache	1-5ms	100M items
Historical ratings	Distributed DB	10-100ms	Billions
Raw logs	Object storage (S3)	100ms+	Petabytes

Critical Optimizations

•Caching: Cache recommendation lists by user ID. Most users see cached results (1-hour TTL common).
•Batch lookups: When scoring candidates, batch database queries. 1000 point lookups → 1 batch query.
•Async precomputation: Continuously update neighbor tables in background. Never block serving.
•Feature stores: Centralized, versioned feature storage. Ensures consistency between training and serving.
•Graceful degradation: If personalized recommendations fail, fall back to popular items. Never return errors.

Netflix's Approach

Netflix generates and caches personalized recommendation lists for every user nightly. When you open the app, you see pre-computed results. Real-time updates happen in background. This 'precompute everything' approach enables sub-second page loads even at 250M user scale.

Incremental Updates and Real-time Adaptation

Recomputing all similarities from scratch is wasteful when most interactions are unchanged. Incremental updates enable fresh recommendations without full recomputation.

What Triggers Updates:

New rating → Update similarities for that item
New user → No similarity updates needed (only affects predictions)
New item → Compute similarities as ratings arrive
Bulk import → Trigger full refresh

Incremental Similarity Update:

When user u rates item i with rating r:

For each item j that user u has also rated:
    # Update similarity(i, j) incrementally
    old_dot = stored_dot_product[i][j]
    old_norm_i = stored_norm[i]
    old_norm_j = stored_norm[j]
    
    # Add contribution from this user
    dev_i = r - user_mean[u]
    dev_j = ratings[u][j] - user_mean[u]
    
    new_dot = old_dot + dev_i * dev_j
    new_norm_i = sqrt(old_norm_i^2 + dev_i^2)
    # new_norm_j unchanged (j's ratings didn't change)
    
    new_similarity = new_dot / (new_norm_i * old_norm_j)
    
    # Update neighbor lists if rank changed significantly
    update_neighbor_list(i, j, new_similarity)

Stream Processing Architecture:

User Actions (Kafka) → Stream Processor (Flink) → Updates (Redis)
                              ↓
                    Aggregate to batch updates
                              ↓
               Batch processor (hourly, Spark)
                              ↓
                    Full model refresh

Real-time Update Strategies:

Additive updates: For embeddings, add weighted recent activity:

user_embedding = α · old_embedding + (1-α) · recent_activity_embedding

Session-based boosting: Boost items related to current session without retraining
Bandit overlays: Use multi-armed bandits to explore new items, update exploitation scores with each interaction

Freshness vs Stability Tradeoff

More frequent updates mean fresher recommendations but also more noise (a single accidental click shouldn't reshape recommendations). Most systems update similarities daily but adjust user preference signals more frequently. The right balance depends on how fast preferences change in your domain.

Monitoring at Scale

Scalable systems require scalable monitoring. Here's what to track for production CF systems:

Latency Metrics:

Metric	Target	Alert Threshold
p50 prediction latency	<5ms	>20ms
p99 prediction latency	<50ms	>200ms
p99 recommendation list	<100ms	>500ms
Cache hit rate	>95%	<80%
ANN recall	>95%	<90%

Quality Metrics:

Metric	What It Measures	Computation
RMSE/MAE	Rating prediction accuracy	Offline on test set
Precision@k	Relevant items in top-k	Requires ground truth
Coverage	Fraction of items recommended	Track over time window
Diversity	Intra-list similarity	1 - avg similarity in lists
Novelty	Avg popularity of recommendations	Lower = more novel

Key Alerts to Configure

•Similarity staleness: Alert if last update > 24 hours
•Cold start rate spike: Alert if >5% of requests hit cold start fallback
•Coverage collapse: Alert if <50% of catalog recommended in past week
•Embedding drift: Alert if user/item embedding distributions shift significantly
•Cache thrashing: Alert if eviction rate spikes
•Model/serving skew: Alert if training vs serving features diverge

A/B Testing at Scale

Always A/B test algorithm changes. Offline metrics (RMSE) often don't correlate with online metrics (click-through rate, conversions). Run tests for at least 1-2 weeks to account for day-of-week effects. Use proper statistical significance testing—at scale, even tiny differences appear 'significant' with enough data.

Summary: Scalability in Collaborative Filtering

Key Takeaways

•Quadratic is death: O(m²) or O(n²) algorithms don't scale. Avoid or approximate.
•Item-based scales better: O(r·k) online complexity makes real-time feasible
•Distributed computation: MapReduce/Spark enables similarity computation on clusters
•Approximate NN: FAISS/HNSW enable 1M+ item neighbor lookup in <10ms
•Dimensionality reduction: SVD reduces similarity computation from O(n²·m) to O(n²·d)
•Offline/online split: Precompute expensive operations; serve from cache
•Two-stage retrieval: Fast candidate generation + precise ranking
•Incremental updates: Update incrementally; full refresh only when necessary

Module Complete:

You have now completed the Collaborative Filtering module. You understand:

User-based CF: Find similar users, aggregate their ratings
Item-based CF: Find similar items, leverage user's own ratings
Similarity measures: Pearson, cosine, adjusted cosine, and when to use each
Neighborhood methods: Selection, aggregation, and edge case handling
Scalability: Distributed computation, ANN, and production architecture

Collaborative filtering remains a cornerstone of modern recommendation systems. While deep learning approaches are increasingly popular, CF methods provide interpretable baselines, handle cold-start through hybrid approaches, and—when properly optimized—scale to hundreds of millions of users.

Module Complete

You have mastered collaborative filtering—from foundational concepts through production-scale deployment. This knowledge forms the basis for understanding matrix factorization, neural collaborative filtering, and hybrid recommendation systems covered in subsequent modules.