Machine LearningRecommendation Systems

Production Considerations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

1 / 5

Scalability

The Scale of Modern Recommendations

Your recommendation model achieves impressive offline metrics. It ranks items accurately, captures user preferences with nuance, and generalizes well to held-out data. Now comes the real challenge: serving those recommendations to 100 million users, across 50 million items, in under 100 milliseconds, thousands of times per second.

This is the scalability problem—and it fundamentally changes how you think about recommendation systems. Algorithms that work brilliantly in a Jupyter notebook can be completely impractical at production scale. A nearest-neighbor search that takes 500ms becomes a 500-second catastrophe when called 1000 times per second.

Scalability isn't an afterthought; it's a first-class design constraint that shapes every architectural decision. The companies that dominate their markets—Netflix, Amazon, Spotify, TikTok—have all built sophisticated infrastructure specifically to serve recommendations at massive scale.

What You Will Learn

By the end of this page, you will understand the architectural patterns for scalable recommendation systems, master approximate nearest neighbor techniques for similarity search, learn distributed computing strategies for training and serving, and design two-stage retrieval-ranking pipelines that balance quality with latency.

Understanding the Scale Challenge

Let's quantify what "scale" means for recommendation systems with concrete numbers:

The Netflix Scale:

230 million subscribers worldwide
~15,000 titles in catalog
Each user expects personalized home page in < 200ms
Peak load: millions of concurrent users

The Amazon Scale:

300 million active customers
350 million products
Recommendations on every page
Sub-100ms latency requirements

The TikTok Scale:

1 billion monthly active users
Billions of videos
Real-time recommendation for every scroll
< 50ms latency for seamless experience

Computational Complexity at Scale
Operation	Naive Complexity	At 100M Users × 10M Items	Practical?
User-Item Matrix Storage	O(U × I)	10^15 entries (1 petabyte)	❌ Impossible
Brute-force k-NN	O(I) per query	10^7 comparisons per request	❌ Too slow
Full Matrix Factorization	O(U × I × K)	10^18 operations	❌ Days of compute
Sparse Matrix (1% density)	O(nnz)	10^13 entries	⚠️ Still massive
Embedding Lookup	O(1)	Constant time	✅ Feasible

The Fundamental Tension:

Recommendation quality generally improves with:

More features and signals
Larger, more expressive models
More items considered
More sophisticated ranking

But latency and cost improve with:

Fewer features
Simpler models
Fewer candidates
Simpler ranking

The art of scalable recommendations is navigating this tension—achieving maximum quality within strict latency and cost budgets. This requires architectural innovation, not just algorithmic optimization.

The Latency Tax

Every 100ms of latency costs Amazon 1% of sales. Netflix found that a 1-second delay in video recommendations significantly impacts engagement. Latency isn't just a technical metric—it directly impacts business outcomes. This is why scalability is a first-class concern.

Two-Stage Retrieval and Ranking

The most important architectural pattern for scalable recommendations is the two-stage retrieval-ranking pipeline. This pattern is used by virtually every large-scale recommendation system.

Stage 1: Candidate Generation (Retrieval)

Purpose: Quickly narrow millions of items to hundreds of candidates

Latency budget: 10-50ms
Must be extremely fast—O(log N) or O(1)
Prioritizes recall over precision
Uses lightweight models (embeddings, inverted indices)

Stage 2: Ranking

Purpose: Precisely score and order the candidates

Latency budget: 50-150ms
Can use complex, feature-rich models
Prioritizes precision
Often uses deep neural networks with many features

Converting Mermaid diagram...

Why Two Stages?

The key insight is that ranking all items is computationally infeasible, but ranking a small subset is tractable.

Approach	Items Scored	Per-Item Cost	Total Cost
Score all items	10,000,000	10ms	100,000 seconds
Two-stage (1000 candidates)	1,000	10ms	10 seconds
Two-stage with batching	1,000	0.1ms	100ms

The two-stage approach reduces computation by 10,000x while maintaining quality—assuming the retrieval stage has high recall for relevant items.

Multiple Retrieval Sources:

Production systems often combine multiple retrieval methods:

Embedding-based ANN: Find similar items via learned embeddings
Collaborative signals: Items liked by similar users
Content-based: Items matching user's stated preferences
Popularity-based: Trending or globally popular items
Business rules: Promotions, new releases, inventory constraints

Each source contributes candidates, which are deduplicated and sent to ranking.

two_stage_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
from abc import ABC, abstractmethod
from typing import List, Dict, Set, Tuple
from dataclasses import dataclass
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
 
@dataclass
class Item:
    """Represents an item that can be recommended."""
    item_id: str
    embedding: np.ndarray = None
    features: Dict = None
 
@dataclass
class UserContext:
    """User context for personalization."""
    user_id: str
    user_embedding: np.ndarray
    history: List[str]
    context_features: Dict
 
@dataclass
class ScoredItem:
    """Item with retrieval and ranking scores."""
    item: Item
    retrieval_score: float
    retrieval_source: str
    ranking_score: float = None
 
 
class CandidateGenerator(ABC):
    """Abstract base class for candidate generation sources."""
    
    @abstractmethod
    def retrieve(
        self, 
        user_context: UserContext, 
        k: int
    ) -> List[Tuple[str, float]]:
        """Return (item_id, score) tuples."""
        pass
    
    @property
    @abstractmethod
    def source_name(self) -> str:
        pass
 
 
class EmbeddingRetriever(CandidateGenerator):
    """ANN-based retrieval using learned embeddings."""
    
    def __init__(self, ann_index, item_embeddings: Dict[str, np.ndarray]):
        self.ann_index = ann_index  # FAISS, ScaNN, etc.
        self.item_embeddings = item_embeddings
    
    def retrieve(self, user_context: UserContext, k: int) -> List[Tuple[str, float]]:
        # Query ANN index with user embedding
        distances, indices = self.ann_index.search(
            user_context.user_embedding.reshape(1, -1), 
            k
        )
        
        # Convert to (item_id, score) tuples
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx >= 0:  # Valid index
                item_id = self.index_to_item_id[idx]
                score = 1.0 / (1.0 + dist)  # Convert distance to similarity
                results.append((item_id, score))
        
        return results
    
    @property
    def source_name(self) -> str:
        return "embedding_ann"
 
 
class CollaborativeRetriever(CandidateGenerator):
    """Retrieve items liked by similar users."""
    
    def __init__(self, user_item_matrix, user_similarity_index):
        self.user_item_matrix = user_item_matrix
        self.user_similarity_index = user_similarity_index
    
    def retrieve(self, user_context: UserContext, k: int) -> List[Tuple[str, float]]:
        # Find similar users
        similar_users = self.user_similarity_index.get_neighbors(
            user_context.user_id, 
            k=50
        )
        
        # Aggregate items from similar users
        item_scores = {}
        for sim_user, similarity in similar_users:
            user_items = self.user_item_matrix.get_user_items(sim_user)
            for item_id, rating in user_items:
                if item_id not in user_context.history:
                    if item_id not in item_scores:
                        item_scores[item_id] = 0
                    item_scores[item_id] += similarity * rating
        
        # Return top-k
        sorted_items = sorted(
            item_scores.items(), 
            key=lambda x: x[1], 
            reverse=True
        )[:k]
        
        return sorted_items
    
    @property
    def source_name(self) -> str:
        return "collaborative"
 
 
class TwoStagePipeline:
    """
    Production two-stage recommendation pipeline.
    
    Combines multiple retrieval sources and applies a ranking model.
    """
    
    def __init__(
        self,
        retrievers: List[CandidateGenerator],
        ranker,  # Ranking model
        retrieval_k: int = 200,  # Candidates per source
        final_k: int = 20,  # Final recommendations
        max_retrieval_latency_ms: float = 50,
        max_ranking_latency_ms: float = 100,
    ):
        self.retrievers = retrievers
        self.ranker = ranker
        self.retrieval_k = retrieval_k
        self.final_k = final_k
        self.max_retrieval_latency_ms = max_retrieval_latency_ms
        self.max_ranking_latency_ms = max_ranking_latency_ms
        
        # Thread pool for parallel retrieval
        self.executor = ThreadPoolExecutor(max_workers=len(retrievers))
    
    def recommend(self, user_context: UserContext) -> List[ScoredItem]:
        """
        Generate recommendations using two-stage pipeline.
        
        Returns:
            List of ScoredItem with final ranking scores
        """
        start_time = time.time()
        
        # STAGE 1: Parallel candidate generation
        candidates = self._retrieve_candidates(user_context)
        retrieval_latency = (time.time() - start_time) * 1000
        
        if retrieval_latency > self.max_retrieval_latency_ms:
            print(f"⚠️ Retrieval latency {retrieval_latency:.1f}ms exceeds budget")
        
        # STAGE 2: Ranking
        ranking_start = time.time()
        ranked_items = self._rank_candidates(user_context, candidates)
        ranking_latency = (time.time() - ranking_start) * 1000
        
        if ranking_latency > self.max_ranking_latency_ms:
            print(f"⚠️ Ranking latency {ranking_latency:.1f}ms exceeds budget")
        
        total_latency = (time.time() - start_time) * 1000
        print(f"Pipeline: {len(candidates)} candidates → {len(ranked_items)} results")
        print(f"Latency: retrieval={retrieval_latency:.1f}ms, ranking={ranking_latency:.1f}ms")
        
        return ranked_items[:self.final_k]
    
    def _retrieve_candidates(
        self, 
        user_context: UserContext
    ) -> List[ScoredItem]:
        """
        Run all retrievers in parallel and merge results.
        """
        # Submit all retrievers in parallel
        futures = {}
        for retriever in self.retrievers:
            future = self.executor.submit(
                retriever.retrieve,
                user_context,
                self.retrieval_k
            )
            futures[future] = retriever.source_name
        
        # Collect results with timeout
        all_candidates = {}  # item_id -> ScoredItem
        timeout = self.max_retrieval_latency_ms / 1000
        
        for future in as_completed(futures, timeout=timeout):
            source = futures[future]
            try:
                results = future.result()
                for item_id, score in results:
                    if item_id not in all_candidates:
                        all_candidates[item_id] = ScoredItem(
                            item=Item(item_id=item_id),
                            retrieval_score=score,
                            retrieval_source=source
                        )
                    else:
                        # Item from multiple sources - take max score
                        existing = all_candidates[item_id]
                        if score > existing.retrieval_score:
                            existing.retrieval_score = score
                            existing.retrieval_source = source
            except Exception as e:
                print(f"Retriever {source} failed: {e}")
        
        return list(all_candidates.values())
    
    def _rank_candidates(
        self,
        user_context: UserContext,
        candidates: List[ScoredItem]
    ) -> List[ScoredItem]:
        """
        Score and rank candidates using the ranking model.
        """
        if not candidates:
            return []
        
        # Batch scoring for efficiency
        item_ids = [c.item.item_id for c in candidates]
        scores = self.ranker.score_batch(user_context, item_ids)
        
        for candidate, score in zip(candidates, scores):
            candidate.ranking_score = score
        
        # Sort by ranking score
        candidates.sort(key=lambda x: x.ranking_score, reverse=True)
        
        return candidates

Approximate Nearest Neighbor Search

The retrieval stage often relies on embedding-based similarity search: given a user embedding, find the most similar item embeddings. With millions of items, exact nearest neighbor search is too slow. The solution is Approximate Nearest Neighbor (ANN) algorithms.

The ANN Trade-off:

ANN algorithms trade exactness for speed. Instead of guaranteeing the true k-nearest neighbors, they find approximately the k-nearest neighbors with high probability.

$$\text{Recall@k} = \frac{|\text{ANN Results} \cap \text{True k-NN}|}{k}$$

Typical production systems achieve 95-99% recall while being 100-1000x faster than exact search.

ANN Algorithm Comparison
Algorithm	Index Type	Search Time	Memory	Best For
Brute Force	None	O(N × d)	O(N × d)	< 10K items
IVF (Inverted File)	Clustering	O(√N × d)	O(N × d)	1M-100M items
HNSW (Hierarchical NSW)	Graph	O(log N × d)	O(N × d × M)	High recall needs
LSH (Locality Sensitive)	Hash tables	O(L × K)	O(N × L × K)	Very high-dim
Product Quantization	Compression	O(N/C × d)	O(N × d/C)	Memory constrained
ScaNN	Hybrid	O(√N × d)	O(N × d/C)	Google-scale

Inverted File Index (IVF)

Offline: Cluster all item embeddings into C clusters using k-means
Index: Store which items belong to each cluster
Query: Find the nprobe closest clusters to query, search only those

$$\text{Speedup} \approx \frac{C}{\text{nprobe}}$$

HNSW (Hierarchical Navigable Small World)

Builds a multi-layer graph where:

Layer 0 contains all items
Higher layers contain fewer items (skip connections)
Search starts from top layer, greedily descending

HNSW typically achieves the highest recall for a given latency but uses more memory due to graph structure.

Product Quantization (PQ)

Compresses embeddings by:

Splitting d-dimensional vector into m subvectors
Quantizing each subvector to one of k centroids
Storing only centroid indices (log₂(k) bits each)

Reduces memory by 4-32x while maintaining reasonable recall.

Hybrid Approaches (ScaNN, FAISS):

Modern libraries combine multiple techniques:

IVF for coarse search
PQ for compression
Reranking with original vectors for precision

ann_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import numpy as np
import faiss
from typing import Tuple, List
import time
 
class ANNIndex:
    """
    Production ANN index using FAISS with IVF-PQ.
    
    Combines clustering (IVF) with compression (PQ) for
    memory-efficient, fast approximate search.
    """
    
    def __init__(
        self,
        dimension: int,
        n_items: int,
        n_clusters: int = None,
        pq_subvectors: int = 8,
        pq_bits: int = 8,
        use_gpu: bool = False
    ):
        self.dimension = dimension
        self.n_items = n_items
        
        # Auto-configure clusters based on dataset size
        if n_clusters is None:
            n_clusters = int(4 * np.sqrt(n_items))
        self.n_clusters = n_clusters
        
        # Build index: IVF for clustering, PQ for compression
        quantizer = faiss.IndexFlatL2(dimension)
        
        self.index = faiss.IndexIVFPQ(
            quantizer,
            dimension,
            n_clusters,
            pq_subvectors,
            pq_bits
        )
        
        if use_gpu:
            # Move to GPU for faster training/search
            res = faiss.StandardGpuResources()
            self.index = faiss.index_cpu_to_gpu(res, 0, self.index)
        
        self.is_trained = False
        self.item_ids: List[str] = []
    
    def train_and_add(
        self,
        embeddings: np.ndarray,
        item_ids: List[str],
        train_sample_size: int = 100000
    ):
        """
        Train the index and add all embeddings.
        
        Training learns the cluster centroids and PQ codebooks.
        """
        n = len(embeddings)
        
        # Sample for training if dataset is large
        if n > train_sample_size:
            train_indices = np.random.choice(n, train_sample_size, replace=False)
            train_data = embeddings[train_indices]
        else:
            train_data = embeddings
        
        # Train quantizer
        print(f"Training on {len(train_data)} samples...")
        start = time.time()
        self.index.train(train_data.astype(np.float32))
        print(f"Training took {time.time() - start:.1f}s")
        
        # Add all embeddings
        print(f"Adding {n} embeddings...")
        start = time.time()
        self.index.add(embeddings.astype(np.float32))
        print(f"Adding took {time.time() - start:.1f}s")
        
        self.item_ids = item_ids
        self.is_trained = True
    
    def search(
        self,
        query: np.ndarray,
        k: int,
        nprobe: int = 32
    ) -> Tuple[List[str], np.ndarray]:
        """
        Search for k nearest neighbors.
        
        Args:
            query: Query embedding (d,) or (n_queries, d)
            k: Number of neighbors
            nprobe: Number of clusters to search (higher = slower but better recall)
        
        Returns:
            (item_ids, distances)
        """
        if not self.is_trained:
            raise RuntimeError("Index not trained. Call train_and_add first.")
        
        # Set search parameters
        self.index.nprobe = nprobe
        
        # Reshape query if needed
        if query.ndim == 1:
            query = query.reshape(1, -1)
        
        # Search
        distances, indices = self.index.search(query.astype(np.float32), k)
        
        # Convert indices to item IDs
        result_ids = []
        for idx_row in indices:
            row_ids = [self.item_ids[i] if 0 <= i < len(self.item_ids) else None 
                      for i in idx_row]
            result_ids.append(row_ids)
        
        return result_ids, distances
    
    def benchmark(
        self,
        queries: np.ndarray,
        ground_truth: np.ndarray,
        k: int,
        nprobe_values: List[int] = [1, 4, 16, 64, 256]
    ) -> dict:
        """
        Benchmark recall vs latency trade-off.
        
        Args:
            queries: Test query embeddings
            ground_truth: True k-NN indices for each query
            k: Number of neighbors to retrieve
            nprobe_values: Different nprobe settings to test
        """
        results = []
        
        for nprobe in nprobe_values:
            self.index.nprobe = nprobe
            
            # Measure latency
            start = time.time()
            distances, indices = self.index.search(
                queries.astype(np.float32), k
            )
            elapsed = time.time() - start
            latency_ms = (elapsed / len(queries)) * 1000
            
            # Compute recall
            recall = self._compute_recall(indices, ground_truth, k)
            
            results.append({
                'nprobe': nprobe,
                'recall': recall,
                'latency_ms': latency_ms,
                'qps': len(queries) / elapsed
            })
            
            print(f"nprobe={nprobe}: recall={recall:.3f}, latency={latency_ms:.2f}ms")
        
        return results
    
    def _compute_recall(
        self,
        predictions: np.ndarray,
        ground_truth: np.ndarray,
        k: int
    ) -> float:
        """Compute recall@k."""
        n_queries = len(predictions)
        hits = 0
        
        for pred, truth in zip(predictions, ground_truth):
            pred_set = set(pred[:k])
            truth_set = set(truth[:k])
            hits += len(pred_set & truth_set)
        
        return hits / (n_queries * k)
 
 
def create_hnsw_index(
    dimension: int,
    M: int = 32,
    ef_construction: int = 200
) -> faiss.IndexHNSWFlat:
    """
    Create HNSW index for highest recall scenarios.
    
    Args:
        dimension: Embedding dimension
        M: Number of connections per layer (higher = better recall, more memory)
        ef_construction: Search depth during construction
    """
    index = faiss.IndexHNSWFlat(dimension, M)
    index.hnsw.efConstruction = ef_construction
    return index

Distributed Training and Serving

At massive scale, both training and serving must be distributed across many machines. This introduces new challenges around data partitioning, synchronization, and fault tolerance.

Distributed Training Strategies:

1. Data Parallelism

The most common approach:

Same model replicated across N workers
Training data partitioned across workers
Gradients aggregated (averaged) after each batch
Model parameters synchronized

$$\text{Effective Batch Size} = N \times \text{Local Batch Size}$$

2. Model Parallelism

For models too large for single GPU:

Model split across devices
Each device holds part of the model
Activations passed between devices

3. Embedding Parallelism

Critical for recommendation models with huge embedding tables:

Embedding tables partitioned across machines
Each machine owns a shard of items/users
Lookups routed to appropriate shard

Distributed Serving Patterns

•Sharded Model Serving — Model replicated across shards, requests load-balanced. Each shard handles full requests independently.
•Sharded Item Index — Item embeddings partitioned across shards. Query broadcast to all shards, results merged.
•Caching Layer — Redis/Memcached caches frequent recommendations. Cache hit rates of 30-60% dramatically reduce compute.
•Edge Serving — Precompute recommendations and push to CDN edge. Lowest latency but stalest recommendations.
•Fallback Cascade — Primary model → simpler backup → static fallback. Ensures availability even during failures.

Converting Mermaid diagram...

distributed_serving.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
from abc import ABC, abstractmethod
from typing import List, Dict, Optional
import asyncio
import aiohttp
from dataclasses import dataclass
import hashlib
import time
 
@dataclass
class ShardConfig:
    """Configuration for a retrieval shard."""
    shard_id: int
    host: str
    port: int
    item_range: tuple  # (start_idx, end_idx)
 
class DistributedRetrievalClient:
    """
    Client for distributed retrieval across sharded index.
    
    Broadcasts queries to all shards and merges results.
    """
    
    def __init__(
        self,
        shard_configs: List[ShardConfig],
        timeout_ms: float = 50,
        max_retries: int = 2
    ):
        self.shards = shard_configs
        self.timeout = timeout_ms / 1000
        self.max_retries = max_retries
        self.session: Optional[aiohttp.ClientSession] = None
    
    async def initialize(self):
        """Create HTTP session for async requests."""
        timeout = aiohttp.ClientTimeout(total=self.timeout)
        self.session = aiohttp.ClientSession(timeout=timeout)
    
    async def retrieve(
        self,
        user_embedding: List[float],
        k_per_shard: int = 100,
        total_k: int = 500
    ) -> List[Dict]:
        """
        Retrieve candidates from all shards in parallel.
        
        Args:
            user_embedding: User's embedding vector
            k_per_shard: Candidates to retrieve per shard
            total_k: Total candidates to return after merging
        
        Returns:
            Merged and sorted candidate list
        """
        # Fan-out: query all shards in parallel
        tasks = [
            self._query_shard(shard, user_embedding, k_per_shard)
            for shard in self.shards
        ]
        
        # Gather results with timeout handling
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Fan-in: merge results from all shards
        all_candidates = []
        for shard, result in zip(self.shards, results):
            if isinstance(result, Exception):
                print(f"Shard {shard.shard_id} failed: {result}")
                continue
            all_candidates.extend(result)
        
        # Sort by score and take top-k
        all_candidates.sort(key=lambda x: x['score'], reverse=True)
        return all_candidates[:total_k]
    
    async def _query_shard(
        self,
        shard: ShardConfig,
        embedding: List[float],
        k: int
    ) -> List[Dict]:
        """Query single shard with retry logic."""
        url = f"http://{shard.host}:{shard.port}/retrieve"
        payload = {"embedding": embedding, "k": k}
        
        for attempt in range(self.max_retries):
            try:
                async with self.session.post(url, json=payload) as resp:
                    if resp.status == 200:
                        return await resp.json()
            except asyncio.TimeoutError:
                if attempt < self.max_retries - 1:
                    await asyncio.sleep(0.01)  # Brief backoff
        
        return []  # Return empty on failure
    
    async def close(self):
        if self.session:
            await self.session.close()
 
 
class ConsistentHashRouter:
    """
    Consistent hashing for user-to-shard routing.
    
    Ensures same user always hits same cache/shard for consistency.
    """
    
    def __init__(self, shards: List[str], virtual_nodes: int = 100):
        self.ring = {}
        self.sorted_keys = []
        
        # Add virtual nodes for better distribution
        for shard in shards:
            for i in range(virtual_nodes):
                key = self._hash(f"{shard}:{i}")
                self.ring[key] = shard
                self.sorted_keys.append(key)
        
        self.sorted_keys.sort()
    
    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def get_shard(self, user_id: str) -> str:
        """Get shard for a user ID."""
        if not self.ring:
            return None
        
        key = self._hash(user_id)
        
        # Binary search for first node >= key
        for node_key in self.sorted_keys:
            if node_key >= key:
                return self.ring[node_key]
        
        # Wrap around to first node
        return self.ring[self.sorted_keys[0]]
 
 
class CachingRecommendationService:
    """
    Caching layer for recommendation serving.
    
    Caches user recommendations to reduce compute.
    """
    
    def __init__(
        self,
        redis_client,
        retrieval_client: DistributedRetrievalClient,
        cache_ttl_seconds: int = 300,  # 5 minute cache
    ):
        self.redis = redis_client
        self.retrieval = retrieval_client
        self.cache_ttl = cache_ttl_seconds
        
        # Metrics
        self.cache_hits = 0
        self.cache_misses = 0
    
    async def get_recommendations(
        self,
        user_id: str,
        user_embedding: List[float],
        k: int = 20
    ) -> List[Dict]:
        """
        Get recommendations with caching.
        """
        cache_key = f"recs:{user_id}:{k}"
        
        # Try cache first
        cached = await self.redis.get(cache_key)
        if cached:
            self.cache_hits += 1
            return cached
        
        self.cache_misses += 1
        
        # Cache miss - compute recommendations
        candidates = await self.retrieval.retrieve(
            user_embedding,
            k_per_shard=100,
            total_k=k * 5
        )
        
        # Apply ranking (simplified)
        recommendations = candidates[:k]
        
        # Cache results
        await self.redis.setex(
            cache_key,
            self.cache_ttl,
            recommendations
        )
        
        return recommendations
    
    @property
    def cache_hit_rate(self) -> float:
        total = self.cache_hits + self.cache_misses
        return self.cache_hits / total if total > 0 else 0

Summary: Scalability Principles

We've covered the essential principles and techniques for building scalable recommendation systems. Let's consolidate the key takeaways:

Key Takeaways

•Two-stage pipelines are essential — Retrieval narrows millions to hundreds, ranking provides precision. This pattern enables scalability.
•Approximate is good enough — ANN algorithms trade exactness for 100-1000x speedup while maintaining 95%+ recall.
•Latency budgets drive architecture — Every design decision must respect latency constraints. Fast and approximate beats slow and exact.
•Distribution is mandatory at scale — Sharding, replication, and caching are required for handling millions of QPS.
•Multiple retrieval sources improve coverage — Combining ANN, collaborative filtering, and business rules ensures diverse, relevant candidates.
•Caching provides massive wins — 30-60% cache hit rates dramatically reduce compute costs and latency.

Page Complete

You now understand how to architect recommendation systems for massive scale. Next, we'll explore real-time serving requirements—how to incorporate fresh signals and update recommendations in milliseconds.

1 / 5

Loading learning content...

Machine LearningRecommendation Systems

Production Considerations

LevelAdvanced

Duration90 mins

TopicRecommendation Systems

1 / 5

Scalability

The Scale of Modern Recommendations

What You Will Learn

Understanding the Scale Challenge

Let's quantify what "scale" means for recommendation systems with concrete numbers:

The Netflix Scale:

230 million subscribers worldwide
~15,000 titles in catalog
Each user expects personalized home page in < 200ms
Peak load: millions of concurrent users

The Amazon Scale:

300 million active customers
350 million products
Recommendations on every page
Sub-100ms latency requirements

The TikTok Scale:

1 billion monthly active users
Billions of videos
Real-time recommendation for every scroll
< 50ms latency for seamless experience

Computational Complexity at Scale
Operation	Naive Complexity	At 100M Users × 10M Items	Practical?
User-Item Matrix Storage	O(U × I)	10^15 entries (1 petabyte)	❌ Impossible
Brute-force k-NN	O(I) per query	10^7 comparisons per request	❌ Too slow
Full Matrix Factorization	O(U × I × K)	10^18 operations	❌ Days of compute
Sparse Matrix (1% density)	O(nnz)	10^13 entries	⚠️ Still massive
Embedding Lookup	O(1)	Constant time	✅ Feasible

The Fundamental Tension:

Recommendation quality generally improves with:

More features and signals
Larger, more expressive models
More items considered
More sophisticated ranking

But latency and cost improve with:

Fewer features
Simpler models
Fewer candidates
Simpler ranking

The Latency Tax

Two-Stage Retrieval and Ranking

The most important architectural pattern for scalable recommendations is the two-stage retrieval-ranking pipeline. This pattern is used by virtually every large-scale recommendation system.

Stage 1: Candidate Generation (Retrieval)

Purpose: Quickly narrow millions of items to hundreds of candidates

Latency budget: 10-50ms
Must be extremely fast—O(log N) or O(1)
Prioritizes recall over precision
Uses lightweight models (embeddings, inverted indices)

Stage 2: Ranking

Purpose: Precisely score and order the candidates

Latency budget: 50-150ms
Can use complex, feature-rich models
Prioritizes precision
Often uses deep neural networks with many features

Converting Mermaid diagram...

Why Two Stages?

The key insight is that ranking all items is computationally infeasible, but ranking a small subset is tractable.

Approach	Items Scored	Per-Item Cost	Total Cost
Score all items	10,000,000	10ms	100,000 seconds
Two-stage (1000 candidates)	1,000	10ms	10 seconds
Two-stage with batching	1,000	0.1ms	100ms

The two-stage approach reduces computation by 10,000x while maintaining quality—assuming the retrieval stage has high recall for relevant items.

Multiple Retrieval Sources:

Production systems often combine multiple retrieval methods:

Embedding-based ANN: Find similar items via learned embeddings
Collaborative signals: Items liked by similar users
Content-based: Items matching user's stated preferences
Popularity-based: Trending or globally popular items
Business rules: Promotions, new releases, inventory constraints

Each source contributes candidates, which are deduplicated and sent to ranking.

two_stage_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
from abc import ABC, abstractmethod
from typing import List, Dict, Set, Tuple
from dataclasses import dataclass
import numpy as np
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
 
@dataclass
class Item:
    """Represents an item that can be recommended."""
    item_id: str
    embedding: np.ndarray = None
    features: Dict = None
 
@dataclass
class UserContext:
    """User context for personalization."""
    user_id: str
    user_embedding: np.ndarray
    history: List[str]
    context_features: Dict
 
@dataclass
class ScoredItem:
    """Item with retrieval and ranking scores."""
    item: Item
    retrieval_score: float
    retrieval_source: str
    ranking_score: float = None
 
 
class CandidateGenerator(ABC):
    """Abstract base class for candidate generation sources."""
    
    @abstractmethod
    def retrieve(
        self, 
        user_context: UserContext, 
        k: int
    ) -> List[Tuple[str, float]]:
        """Return (item_id, score) tuples."""
        pass
    
    @property
    @abstractmethod
    def source_name(self) -> str:
        pass
 
 
class EmbeddingRetriever(CandidateGenerator):
    """ANN-based retrieval using learned embeddings."""
    
    def __init__(self, ann_index, item_embeddings: Dict[str, np.ndarray]):
        self.ann_index = ann_index  # FAISS, ScaNN, etc.
        self.item_embeddings = item_embeddings
    
    def retrieve(self, user_context: UserContext, k: int) -> List[Tuple[str, float]]:
        # Query ANN index with user embedding
        distances, indices = self.ann_index.search(
            user_context.user_embedding.reshape(1, -1), 
            k
        )
        
        # Convert to (item_id, score) tuples
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx >= 0:  # Valid index
                item_id = self.index_to_item_id[idx]
                score = 1.0 / (1.0 + dist)  # Convert distance to similarity
                results.append((item_id, score))
        
        return results
    
    @property
    def source_name(self) -> str:
        return "embedding_ann"
 
 
class CollaborativeRetriever(CandidateGenerator):
    """Retrieve items liked by similar users."""
    
    def __init__(self, user_item_matrix, user_similarity_index):
        self.user_item_matrix = user_item_matrix
        self.user_similarity_index = user_similarity_index
    
    def retrieve(self, user_context: UserContext, k: int) -> List[Tuple[str, float]]:
        # Find similar users
        similar_users = self.user_similarity_index.get_neighbors(
            user_context.user_id, 
            k=50
        )
        
        # Aggregate items from similar users
        item_scores = {}
        for sim_user, similarity in similar_users:
            user_items = self.user_item_matrix.get_user_items(sim_user)
            for item_id, rating in user_items:
                if item_id not in user_context.history:
                    if item_id not in item_scores:
                        item_scores[item_id] = 0
                    item_scores[item_id] += similarity * rating
        
        # Return top-k
        sorted_items = sorted(
            item_scores.items(), 
            key=lambda x: x[1], 
            reverse=True
        )[:k]
        
        return sorted_items
    
    @property
    def source_name(self) -> str:
        return "collaborative"
 
 
class TwoStagePipeline:
    """
    Production two-stage recommendation pipeline.
    
    Combines multiple retrieval sources and applies a ranking model.
    """
    
    def __init__(
        self,
        retrievers: List[CandidateGenerator],
        ranker,  # Ranking model
        retrieval_k: int = 200,  # Candidates per source
        final_k: int = 20,  # Final recommendations
        max_retrieval_latency_ms: float = 50,
        max_ranking_latency_ms: float = 100,
    ):
        self.retrievers = retrievers
        self.ranker = ranker
        self.retrieval_k = retrieval_k
        self.final_k = final_k
        self.max_retrieval_latency_ms = max_retrieval_latency_ms
        self.max_ranking_latency_ms = max_ranking_latency_ms
        
        # Thread pool for parallel retrieval
        self.executor = ThreadPoolExecutor(max_workers=len(retrievers))
    
    def recommend(self, user_context: UserContext) -> List[ScoredItem]:
        """
        Generate recommendations using two-stage pipeline.
        
        Returns:
            List of ScoredItem with final ranking scores
        """
        start_time = time.time()
        
        # STAGE 1: Parallel candidate generation
        candidates = self._retrieve_candidates(user_context)
        retrieval_latency = (time.time() - start_time) * 1000
        
        if retrieval_latency > self.max_retrieval_latency_ms:
            print(f"⚠️ Retrieval latency {retrieval_latency:.1f}ms exceeds budget")
        
        # STAGE 2: Ranking
        ranking_start = time.time()
        ranked_items = self._rank_candidates(user_context, candidates)
        ranking_latency = (time.time() - ranking_start) * 1000
        
        if ranking_latency > self.max_ranking_latency_ms:
            print(f"⚠️ Ranking latency {ranking_latency:.1f}ms exceeds budget")
        
        total_latency = (time.time() - start_time) * 1000
        print(f"Pipeline: {len(candidates)} candidates → {len(ranked_items)} results")
        print(f"Latency: retrieval={retrieval_latency:.1f}ms, ranking={ranking_latency:.1f}ms")
        
        return ranked_items[:self.final_k]
    
    def _retrieve_candidates(
        self, 
        user_context: UserContext
    ) -> List[ScoredItem]:
        """
        Run all retrievers in parallel and merge results.
        """
        # Submit all retrievers in parallel
        futures = {}
        for retriever in self.retrievers:
            future = self.executor.submit(
                retriever.retrieve,
                user_context,
                self.retrieval_k
            )
            futures[future] = retriever.source_name
        
        # Collect results with timeout
        all_candidates = {}  # item_id -> ScoredItem
        timeout = self.max_retrieval_latency_ms / 1000
        
        for future in as_completed(futures, timeout=timeout):
            source = futures[future]
            try:
                results = future.result()
                for item_id, score in results:
                    if item_id not in all_candidates:
                        all_candidates[item_id] = ScoredItem(
                            item=Item(item_id=item_id),
                            retrieval_score=score,
                            retrieval_source=source
                        )
                    else:
                        # Item from multiple sources - take max score
                        existing = all_candidates[item_id]
                        if score > existing.retrieval_score:
                            existing.retrieval_score = score
                            existing.retrieval_source = source
            except Exception as e:
                print(f"Retriever {source} failed: {e}")
        
        return list(all_candidates.values())
    
    def _rank_candidates(
        self,
        user_context: UserContext,
        candidates: List[ScoredItem]
    ) -> List[ScoredItem]:
        """
        Score and rank candidates using the ranking model.
        """
        if not candidates:
            return []
        
        # Batch scoring for efficiency
        item_ids = [c.item.item_id for c in candidates]
        scores = self.ranker.score_batch(user_context, item_ids)
        
        for candidate, score in zip(candidates, scores):
            candidate.ranking_score = score
        
        # Sort by ranking score
        candidates.sort(key=lambda x: x.ranking_score, reverse=True)
        
        return candidates

Approximate Nearest Neighbor Search

The ANN Trade-off:

ANN algorithms trade exactness for speed. Instead of guaranteeing the true k-nearest neighbors, they find approximately the k-nearest neighbors with high probability.

$$\text{Recall@k} = \frac{|\text{ANN Results} \cap \text{True k-NN}|}{k}$$

Typical production systems achieve 95-99% recall while being 100-1000x faster than exact search.

ANN Algorithm Comparison
Algorithm	Index Type	Search Time	Memory	Best For
Brute Force	None	O(N × d)	O(N × d)	< 10K items
IVF (Inverted File)	Clustering	O(√N × d)	O(N × d)	1M-100M items
HNSW (Hierarchical NSW)	Graph	O(log N × d)	O(N × d × M)	High recall needs
LSH (Locality Sensitive)	Hash tables	O(L × K)	O(N × L × K)	Very high-dim
Product Quantization	Compression	O(N/C × d)	O(N × d/C)	Memory constrained
ScaNN	Hybrid	O(√N × d)	O(N × d/C)	Google-scale

Inverted File Index (IVF)

Offline: Cluster all item embeddings into C clusters using k-means
Index: Store which items belong to each cluster
Query: Find the nprobe closest clusters to query, search only those

$$\text{Speedup} \approx \frac{C}{\text{nprobe}}$$

HNSW (Hierarchical Navigable Small World)

Builds a multi-layer graph where:

Layer 0 contains all items
Higher layers contain fewer items (skip connections)
Search starts from top layer, greedily descending

HNSW typically achieves the highest recall for a given latency but uses more memory due to graph structure.

Product Quantization (PQ)

Compresses embeddings by:

Splitting d-dimensional vector into m subvectors
Quantizing each subvector to one of k centroids
Storing only centroid indices (log₂(k) bits each)

Reduces memory by 4-32x while maintaining reasonable recall.

Hybrid Approaches (ScaNN, FAISS):

Modern libraries combine multiple techniques:

IVF for coarse search
PQ for compression
Reranking with original vectors for precision

ann_implementations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import numpy as np
import faiss
from typing import Tuple, List
import time
 
class ANNIndex:
    """
    Production ANN index using FAISS with IVF-PQ.
    
    Combines clustering (IVF) with compression (PQ) for
    memory-efficient, fast approximate search.
    """
    
    def __init__(
        self,
        dimension: int,
        n_items: int,
        n_clusters: int = None,
        pq_subvectors: int = 8,
        pq_bits: int = 8,
        use_gpu: bool = False
    ):
        self.dimension = dimension
        self.n_items = n_items
        
        # Auto-configure clusters based on dataset size
        if n_clusters is None:
            n_clusters = int(4 * np.sqrt(n_items))
        self.n_clusters = n_clusters
        
        # Build index: IVF for clustering, PQ for compression
        quantizer = faiss.IndexFlatL2(dimension)
        
        self.index = faiss.IndexIVFPQ(
            quantizer,
            dimension,
            n_clusters,
            pq_subvectors,
            pq_bits
        )
        
        if use_gpu:
            # Move to GPU for faster training/search
            res = faiss.StandardGpuResources()
            self.index = faiss.index_cpu_to_gpu(res, 0, self.index)
        
        self.is_trained = False
        self.item_ids: List[str] = []
    
    def train_and_add(
        self,
        embeddings: np.ndarray,
        item_ids: List[str],
        train_sample_size: int = 100000
    ):
        """
        Train the index and add all embeddings.
        
        Training learns the cluster centroids and PQ codebooks.
        """
        n = len(embeddings)
        
        # Sample for training if dataset is large
        if n > train_sample_size:
            train_indices = np.random.choice(n, train_sample_size, replace=False)
            train_data = embeddings[train_indices]
        else:
            train_data = embeddings
        
        # Train quantizer
        print(f"Training on {len(train_data)} samples...")
        start = time.time()
        self.index.train(train_data.astype(np.float32))
        print(f"Training took {time.time() - start:.1f}s")
        
        # Add all embeddings
        print(f"Adding {n} embeddings...")
        start = time.time()
        self.index.add(embeddings.astype(np.float32))
        print(f"Adding took {time.time() - start:.1f}s")
        
        self.item_ids = item_ids
        self.is_trained = True
    
    def search(
        self,
        query: np.ndarray,
        k: int,
        nprobe: int = 32
    ) -> Tuple[List[str], np.ndarray]:
        """
        Search for k nearest neighbors.
        
        Args:
            query: Query embedding (d,) or (n_queries, d)
            k: Number of neighbors
            nprobe: Number of clusters to search (higher = slower but better recall)
        
        Returns:
            (item_ids, distances)
        """
        if not self.is_trained:
            raise RuntimeError("Index not trained. Call train_and_add first.")
        
        # Set search parameters
        self.index.nprobe = nprobe
        
        # Reshape query if needed
        if query.ndim == 1:
            query = query.reshape(1, -1)
        
        # Search
        distances, indices = self.index.search(query.astype(np.float32), k)
        
        # Convert indices to item IDs
        result_ids = []
        for idx_row in indices:
            row_ids = [self.item_ids[i] if 0 <= i < len(self.item_ids) else None 
                      for i in idx_row]
            result_ids.append(row_ids)
        
        return result_ids, distances
    
    def benchmark(
        self,
        queries: np.ndarray,
        ground_truth: np.ndarray,
        k: int,
        nprobe_values: List[int] = [1, 4, 16, 64, 256]
    ) -> dict:
        """
        Benchmark recall vs latency trade-off.
        
        Args:
            queries: Test query embeddings
            ground_truth: True k-NN indices for each query
            k: Number of neighbors to retrieve
            nprobe_values: Different nprobe settings to test
        """
        results = []
        
        for nprobe in nprobe_values:
            self.index.nprobe = nprobe
            
            # Measure latency
            start = time.time()
            distances, indices = self.index.search(
                queries.astype(np.float32), k
            )
            elapsed = time.time() - start
            latency_ms = (elapsed / len(queries)) * 1000
            
            # Compute recall
            recall = self._compute_recall(indices, ground_truth, k)
            
            results.append({
                'nprobe': nprobe,
                'recall': recall,
                'latency_ms': latency_ms,
                'qps': len(queries) / elapsed
            })
            
            print(f"nprobe={nprobe}: recall={recall:.3f}, latency={latency_ms:.2f}ms")
        
        return results
    
    def _compute_recall(
        self,
        predictions: np.ndarray,
        ground_truth: np.ndarray,
        k: int
    ) -> float:
        """Compute recall@k."""
        n_queries = len(predictions)
        hits = 0
        
        for pred, truth in zip(predictions, ground_truth):
            pred_set = set(pred[:k])
            truth_set = set(truth[:k])
            hits += len(pred_set & truth_set)
        
        return hits / (n_queries * k)
 
 
def create_hnsw_index(
    dimension: int,
    M: int = 32,
    ef_construction: int = 200
) -> faiss.IndexHNSWFlat:
    """
    Create HNSW index for highest recall scenarios.
    
    Args:
        dimension: Embedding dimension
        M: Number of connections per layer (higher = better recall, more memory)
        ef_construction: Search depth during construction
    """
    index = faiss.IndexHNSWFlat(dimension, M)
    index.hnsw.efConstruction = ef_construction
    return index

Distributed Training and Serving

At massive scale, both training and serving must be distributed across many machines. This introduces new challenges around data partitioning, synchronization, and fault tolerance.

Distributed Training Strategies:

1. Data Parallelism

The most common approach:

Same model replicated across N workers
Training data partitioned across workers
Gradients aggregated (averaged) after each batch
Model parameters synchronized

$$\text{Effective Batch Size} = N \times \text{Local Batch Size}$$

2. Model Parallelism

For models too large for single GPU:

Model split across devices
Each device holds part of the model
Activations passed between devices

3. Embedding Parallelism

Critical for recommendation models with huge embedding tables:

Embedding tables partitioned across machines
Each machine owns a shard of items/users
Lookups routed to appropriate shard

Distributed Serving Patterns

•Sharded Model Serving — Model replicated across shards, requests load-balanced. Each shard handles full requests independently.
•Sharded Item Index — Item embeddings partitioned across shards. Query broadcast to all shards, results merged.
•Caching Layer — Redis/Memcached caches frequent recommendations. Cache hit rates of 30-60% dramatically reduce compute.
•Edge Serving — Precompute recommendations and push to CDN edge. Lowest latency but stalest recommendations.
•Fallback Cascade — Primary model → simpler backup → static fallback. Ensures availability even during failures.

Converting Mermaid diagram...

distributed_serving.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
from abc import ABC, abstractmethod
from typing import List, Dict, Optional
import asyncio
import aiohttp
from dataclasses import dataclass
import hashlib
import time
 
@dataclass
class ShardConfig:
    """Configuration for a retrieval shard."""
    shard_id: int
    host: str
    port: int
    item_range: tuple  # (start_idx, end_idx)
 
class DistributedRetrievalClient:
    """
    Client for distributed retrieval across sharded index.
    
    Broadcasts queries to all shards and merges results.
    """
    
    def __init__(
        self,
        shard_configs: List[ShardConfig],
        timeout_ms: float = 50,
        max_retries: int = 2
    ):
        self.shards = shard_configs
        self.timeout = timeout_ms / 1000
        self.max_retries = max_retries
        self.session: Optional[aiohttp.ClientSession] = None
    
    async def initialize(self):
        """Create HTTP session for async requests."""
        timeout = aiohttp.ClientTimeout(total=self.timeout)
        self.session = aiohttp.ClientSession(timeout=timeout)
    
    async def retrieve(
        self,
        user_embedding: List[float],
        k_per_shard: int = 100,
        total_k: int = 500
    ) -> List[Dict]:
        """
        Retrieve candidates from all shards in parallel.
        
        Args:
            user_embedding: User's embedding vector
            k_per_shard: Candidates to retrieve per shard
            total_k: Total candidates to return after merging
        
        Returns:
            Merged and sorted candidate list
        """
        # Fan-out: query all shards in parallel
        tasks = [
            self._query_shard(shard, user_embedding, k_per_shard)
            for shard in self.shards
        ]
        
        # Gather results with timeout handling
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # Fan-in: merge results from all shards
        all_candidates = []
        for shard, result in zip(self.shards, results):
            if isinstance(result, Exception):
                print(f"Shard {shard.shard_id} failed: {result}")
                continue
            all_candidates.extend(result)
        
        # Sort by score and take top-k
        all_candidates.sort(key=lambda x: x['score'], reverse=True)
        return all_candidates[:total_k]
    
    async def _query_shard(
        self,
        shard: ShardConfig,
        embedding: List[float],
        k: int
    ) -> List[Dict]:
        """Query single shard with retry logic."""
        url = f"http://{shard.host}:{shard.port}/retrieve"
        payload = {"embedding": embedding, "k": k}
        
        for attempt in range(self.max_retries):
            try:
                async with self.session.post(url, json=payload) as resp:
                    if resp.status == 200:
                        return await resp.json()
            except asyncio.TimeoutError:
                if attempt < self.max_retries - 1:
                    await asyncio.sleep(0.01)  # Brief backoff
        
        return []  # Return empty on failure
    
    async def close(self):
        if self.session:
            await self.session.close()
 
 
class ConsistentHashRouter:
    """
    Consistent hashing for user-to-shard routing.
    
    Ensures same user always hits same cache/shard for consistency.
    """
    
    def __init__(self, shards: List[str], virtual_nodes: int = 100):
        self.ring = {}
        self.sorted_keys = []
        
        # Add virtual nodes for better distribution
        for shard in shards:
            for i in range(virtual_nodes):
                key = self._hash(f"{shard}:{i}")
                self.ring[key] = shard
                self.sorted_keys.append(key)
        
        self.sorted_keys.sort()
    
    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def get_shard(self, user_id: str) -> str:
        """Get shard for a user ID."""
        if not self.ring:
            return None
        
        key = self._hash(user_id)
        
        # Binary search for first node >= key
        for node_key in self.sorted_keys:
            if node_key >= key:
                return self.ring[node_key]
        
        # Wrap around to first node
        return self.ring[self.sorted_keys[0]]
 
 
class CachingRecommendationService:
    """
    Caching layer for recommendation serving.
    
    Caches user recommendations to reduce compute.
    """
    
    def __init__(
        self,
        redis_client,
        retrieval_client: DistributedRetrievalClient,
        cache_ttl_seconds: int = 300,  # 5 minute cache
    ):
        self.redis = redis_client
        self.retrieval = retrieval_client
        self.cache_ttl = cache_ttl_seconds
        
        # Metrics
        self.cache_hits = 0
        self.cache_misses = 0
    
    async def get_recommendations(
        self,
        user_id: str,
        user_embedding: List[float],
        k: int = 20
    ) -> List[Dict]:
        """
        Get recommendations with caching.
        """
        cache_key = f"recs:{user_id}:{k}"
        
        # Try cache first
        cached = await self.redis.get(cache_key)
        if cached:
            self.cache_hits += 1
            return cached
        
        self.cache_misses += 1
        
        # Cache miss - compute recommendations
        candidates = await self.retrieval.retrieve(
            user_embedding,
            k_per_shard=100,
            total_k=k * 5
        )
        
        # Apply ranking (simplified)
        recommendations = candidates[:k]
        
        # Cache results
        await self.redis.setex(
            cache_key,
            self.cache_ttl,
            recommendations
        )
        
        return recommendations
    
    @property
    def cache_hit_rate(self) -> float:
        total = self.cache_hits + self.cache_misses
        return self.cache_hits / total if total > 0 else 0

Summary: Scalability Principles

We've covered the essential principles and techniques for building scalable recommendation systems. Let's consolidate the key takeaways:

Key Takeaways

•Two-stage pipelines are essential — Retrieval narrows millions to hundreds, ranking provides precision. This pattern enables scalability.
•Approximate is good enough — ANN algorithms trade exactness for 100-1000x speedup while maintaining 95%+ recall.
•Latency budgets drive architecture — Every design decision must respect latency constraints. Fast and approximate beats slow and exact.
•Distribution is mandatory at scale — Sharding, replication, and caching are required for handling millions of QPS.
•Multiple retrieval sources improve coverage — Combining ANN, collaborative filtering, and business rules ensures diverse, relevant candidates.
•Caching provides massive wins — 30-60% cache hit rates dramatically reduce compute costs and latency.

Page Complete

1 / 5