Machine LearningRecommendation Systems

Collaborative Filtering

LevelIntermediate

Duration90 mins

TopicRecommendation Systems

1 / 5

User-based Collaborative Filtering

The Wisdom of the Crowd

In 1992, researchers at Xerox PARC faced a peculiar challenge: how could they help employees navigate an overwhelming flood of digital documents? Their solution—a system called Tapestry—introduced an idea that would reshape the internet: what if we could predict what you'd like based on what similar people liked?

This fundamental insight—that human preferences cluster in meaningful patterns—gave birth to collaborative filtering, arguably the most influential paradigm in recommendation systems. When Netflix suggests movies, Spotify creates personalized playlists, or Amazon recommends products, collaborative filtering algorithms often power these predictions.

User-based collaborative filtering (User-based CF) is the most intuitive form of this approach. Its core premise is elegantly simple: users who agreed in the past will agree in the future. If you and I both loved the same five movies and I loved a sixth you haven't seen, there's a good chance you'll love it too.

What You Will Master

By the end of this page, you will understand the complete theoretical foundation of user-based collaborative filtering: the intuition behind user similarity, the mathematical formulations for computing recommendations, the algorithmic workflow from data to predictions, and the critical challenges that led to alternative approaches. You'll be equipped to implement, analyze, and critique user-based CF systems.

The Core Intuition Behind User-based CF

Before diving into mathematics, let's build deep intuition for why user-based CF works and when it should work.

The Homophily Principle:

Human preferences aren't random—they cluster around taste profiles. A person who loves indie rock probably also appreciates alternative music. Someone who enjoys hard science fiction likely appreciates space operas. These correlations exist because preferences emerge from personality traits, cultural background, life experiences, and social influences that naturally group people together.

User-based CF exploits this clustering. It doesn't need to understand why you like something—only who else likes similar things. This is both its greatest strength (domain-agnostic, no feature engineering required) and its fundamental limitation (no interpretability).

The Memory-based Paradigm:

User-based CF is a memory-based approach, meaning it stores all user-item interactions and computes predictions on-demand by directly consulting this data. There's no model training in the traditional sense—instead, the system "remembers" all historical behavior and reasons about it at query time.

The User-Item Interaction Matrix
	Movie A	Movie B	Movie C	Movie D	Movie E
User 1 (Alice)	5	3	4	?	1
User 2 (Bob)	4	?	5	4	2
User 3 (Carol)	?	1	2	4	5
User 4 (Dave)	5	4	4	3	?
User 5 (Eve)	1	2	?	5	4

The table above represents a user-item rating matrix R where R[u][i] is user u's rating for item i. The ? marks indicate missing ratings—the values we want to predict.

The Fundamental Question:

If we want to predict Alice's rating for Movie D, user-based CF asks: Which users rate movies similarly to Alice, and how did they rate Movie D?

Looking at the matrix:

Bob rated Movies A and C similarly to Alice (4 vs 5, and 5 vs 4) and gave Movie D a 4
Dave rated Movies A, B, and C very similarly to Alice and gave Movie D a 3
Carol and Eve have quite different rating patterns from Alice

Intuitively, we'd weight Bob's and Dave's opinions more heavily than Carol's or Eve's when predicting Alice's rating for Movie D.

The Key Insight

User-based CF formalizes the intuition that 'birds of a feather flock together.' Instead of understanding items or users deeply, it leverages the statistical regularity that similar users exhibit similar preferences. The entire system reduces to two questions: (1) How do we measure user similarity? (2) How do we aggregate similar users' ratings to make predictions?

The Mathematical Framework

Let's formalize user-based CF with precise mathematical notation. This framework underpins all practical implementations.

Notation:

U = set of all users, with |U| = m users
I = set of all items, with |I| = n items
R = m × n rating matrix, where Rui ∈ ℝ ∪ {∅} represents user u's rating for item i (∅ for missing)
Iu = set of items rated by user u
Ui = set of users who rated item i
r̄u = mean rating of user u across all their rated items
sim(u, v) = similarity between users u and v
Nk(u, i) = k most similar users to u who have rated item i

The Prediction Formula:

The predicted rating r̂ui for user u on item i is computed as a weighted average of ratings from similar users:

            Σ_{v ∈ N_k(u,i)} sim(u,v) · (R_vi - r̄_v)
r̂_ui = r̄_u + ─────────────────────────────────────────
                  Σ_{v ∈ N_k(u,i)} |sim(u,v)|

Let's unpack each component of this formula:

Formula Components Explained

•r̄u (baseline) — The user's average rating establishes a baseline. Some users rate generously (averaging 4.5), others harshly (averaging 2.5). Starting from the user's mean accounts for this bias.
•(Rvi - r̄v) (deviation) — Rather than using raw ratings, we use deviations from each neighbor's mean. If neighbor v rates this item 1 star above their average, that's informative regardless of their absolute scale.
•sim(u, v) (weight) — More similar users get higher weight in the aggregation. A neighbor with similarity 0.95 influences the prediction far more than one with similarity 0.3.
•Σ |sim(u, v)| (normalization) — We divide by the sum of absolute similarities to ensure the weighted average is properly scaled. Without this, predictions would drift based on how many neighbors we use.
•Nk(u, i) (neighborhood) — We only use the top-k most similar users who have actually rated item i. Using all users would introduce noise; using too few reduces coverage.

Why Use Deviations?

Using (R_vi - r̄_v) instead of raw R_vi is crucial for handling rating bias. User A might rate all movies 4-5 stars, while User B rates 1-3 stars. If both rate a movie 1 star above their average, that's equivalent enthusiasm—the deviation captures this. This is called 'mean-centering' and dramatically improves prediction accuracy.

Alternative Prediction Formulas:

The mean-centered weighted average above is the most common, but alternatives exist:

Simple Weighted Average (no mean-centering):

          Σ_{v ∈ N_k(u,i)} sim(u,v) · R_vi
r̂_ui = ─────────────────────────────────────
          Σ_{v ∈ N_k(u,i)} |sim(u,v)|

Simpler but problematic when users have different rating scales.

Z-score Normalization:

                  Σ_{v ∈ N_k(u,i)} sim(u,v) · (R_vi - r̄_v) / σ_v
r̂_ui = r̄_u + σ_u · ────────────────────────────────────────────────
                        Σ_{v ∈ N_k(u,i)} |sim(u,v)|

Accounts for both mean and variance differences between users. More robust but requires computing per-user standard deviations.

Measuring User Similarity

The choice of similarity function is perhaps the most critical design decision in user-based CF. It determines which users are considered "neighbors" and how their opinions are weighted.

The Co-rated Item Requirement:

We can only compare users based on items they've both rated. Let Iuv = Iu ∩ Iv be the set of co-rated items. If |Iuv| = 0, we cannot compute similarity—users with no overlap are incomparable.

Pearson Correlation Coefficient:

The most widely used similarity measure:

                   Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
sim_pearson(u,v) = ─────────────────────────────────────────────────────────────────
                   √[Σ_{i ∈ I_uv} (R_ui - r̄_u)²] · √[Σ_{i ∈ I_uv} (R_vi - r̄_v)²]

Pearson correlation ranges from -1 (perfectly opposite preferences) to +1 (identical preferences), with 0 indicating no linear relationship.

Why Pearson Works Well

•Handles rating bias — Mean-centering makes users with different baselines comparable
•Scale-invariant — Linear transformations of ratings don't affect correlation
•Captures negative correlations — Users with opposite tastes get negative similarity, preventing them from polluting predictions
•Well-understood statistically — Extensive theory about significance and interpretation

Cosine Similarity:

Treats rating vectors as points in n-dimensional space and measures the angle between them:

                  Σ_{i ∈ I_uv} R_ui · R_vi
sim_cosine(u,v) = ────────────────────────────────────────────
                  √[Σ_{i ∈ I_uv} R_ui²] · √[Σ_{i ∈ I_uv} R_vi²]

Ranges from 0 to 1 (for non-negative ratings). Doesn't account for rating bias directly.

Adjusted Cosine Similarity:

Modifies cosine similarity to account for rating bias by mean-centering:

                      Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
sim_adj_cosine(u,v) = ────────────────────────────────────────────────────────────────
                      √[Σ_{i ∈ I} (R_ui - r̄_u)²] · √[Σ_{i ∈ I} (R_vi - r̄_v)²]

Note: The denominator sums over all items each user rated (not just co-rated), which differentiates it from Pearson.

Similarity Measure Comparison
Measure	Range	Handles Bias?	Negative Similarity?	Best Use Case
Pearson	[-1, +1]	Yes (mean-centered)	Yes	Dense rating data with varied user scales
Cosine	[0, +1]	No	No	Sparse binary/implicit feedback
Adjusted Cosine	[-1, +1]	Yes	Yes	Item-based CF (discussed next page)
Euclidean	[0, ∞)	No	No (inverted as similarity)	When absolute differences matter
Jaccard	[0, +1]	N/A (binary)	No	Binary data (liked/not liked)

The Sparsity Problem

In real systems, the rating matrix is extremely sparse (often <1% filled). Two users might share only 2-3 co-rated items, making similarity estimates unreliable. This is a fundamental challenge for user-based CF and motivates significance weighting, shrinkage penalties, and alternative approaches like matrix factorization.

The Complete User-based CF Algorithm

Let's walk through the complete algorithmic workflow from raw data to recommendations.

Algorithm: User-based CF Prediction

Algorithm: UserBasedCF_Predict(u, i, R, k)
# Input: target user u, target item i, rating matrix R, neighborhood size k
# Output: predicted rating r̂_ui

1. Identify candidate neighbors:
   candidates = {v ∈ U : v ≠ u AND R[v][i] ≠ ∅}
   # Users who have rated item i

2. Compute similarities:
   For each v in candidates:
       sim[v] = PearsonCorrelation(u, v, R)

3. Select top-k neighborhood:
   N_k = top k users from candidates by |sim[v]|

4. Compute prediction:
   numerator = Σ_{v ∈ N_k} sim[v] · (R[v][i] - mean(R[v]))
   denominator = Σ_{v ∈ N_k} |sim[v]|
   
   If denominator == 0:
       return mean(R[u])  # fallback to user's mean
   
   return mean(R[u]) + numerator / denominator

user_based_cf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
import numpy as np
from collections import defaultdict
from typing import Dict, List, Tuple, Optional
 
class UserBasedCF:
    """
    Production-quality User-based Collaborative Filtering implementation.
    
    Supports:
    - Pearson correlation and cosine similarity
    - Configurable neighborhood size
    - Significance weighting for sparse overlaps
    - Efficient similarity caching
    """
    
    def __init__(
        self, 
        k_neighbors: int = 50,
        similarity_metric: str = "pearson",
        min_common_items: int = 3,
        significance_weight: bool = True,
        significance_threshold: int = 50
    ):
        self.k = k_neighbors
        self.similarity_metric = similarity_metric
        self.min_common_items = min_common_items
        self.use_significance_weight = significance_weight
        self.significance_threshold = significance_threshold
        
        # Data structures populated on fit()
        self.ratings: Dict[int, Dict[int, float]] = {}  # user -> {item: rating}
        self.user_means: Dict[int, float] = {}
        self.item_users: Dict[int, set] = defaultdict(set)  # item -> set of users
        self.similarity_cache: Dict[Tuple[int, int], float] = {}
    
    def fit(self, ratings_data: List[Tuple[int, int, float]]) -> 'UserBasedCF':
        """
        Fit the model with rating data.
        
        Args:
            ratings_data: List of (user_id, item_id, rating) tuples
        """
        # Build user-item rating dict
        for user_id, item_id, rating in ratings_data:
            if user_id not in self.ratings:
                self.ratings[user_id] = {}
            self.ratings[user_id][item_id] = rating
            self.item_users[item_id].add(user_id)
        
        # Compute user means
        for user_id, user_ratings in self.ratings.items():
            self.user_means[user_id] = np.mean(list(user_ratings.values()))
        
        return self
    
    def _pearson_correlation(self, user_u: int, user_v: int) -> Optional[float]:
        """Compute Pearson correlation between two users."""
        # Find co-rated items
        items_u = set(self.ratings[user_u].keys())
        items_v = set(self.ratings[user_v].keys())
        common_items = items_u & items_v
        
        if len(common_items) < self.min_common_items:
            return None
        
        mean_u = self.user_means[user_u]
        mean_v = self.user_means[user_v]
        
        # Compute correlation components
        numerator = 0.0
        sum_sq_u = 0.0
        sum_sq_v = 0.0
        
        for item in common_items:
            dev_u = self.ratings[user_u][item] - mean_u
            dev_v = self.ratings[user_v][item] - mean_v
            numerator += dev_u * dev_v
            sum_sq_u += dev_u ** 2
            sum_sq_v += dev_v ** 2
        
        denominator = np.sqrt(sum_sq_u) * np.sqrt(sum_sq_v)
        
        if denominator == 0:
            return 0.0
        
        correlation = numerator / denominator
        
        # Apply significance weighting
        if self.use_significance_weight:
            n = len(common_items)
            if n < self.significance_threshold:
                correlation *= (n / self.significance_threshold)
        
        return correlation
    
    def _get_similarity(self, user_u: int, user_v: int) -> Optional[float]:
        """Get similarity with caching."""
        cache_key = (min(user_u, user_v), max(user_u, user_v))
        
        if cache_key not in self.similarity_cache:
            if self.similarity_metric == "pearson":
                sim = self._pearson_correlation(user_u, user_v)
            else:
                raise ValueError(f"Unknown similarity metric: {self.similarity_metric}")
            self.similarity_cache[cache_key] = sim
        
        return self.similarity_cache[cache_key]
    
    def predict(self, user_id: int, item_id: int) -> float:
        """
        Predict rating for a user-item pair.
        
        Returns the predicted rating or user's mean if prediction is not possible.
        """
        if user_id not in self.ratings:
            return 3.0  # Global default for unknown users
        
        # Get users who rated this item
        candidate_neighbors = self.item_users.get(item_id, set()) - {user_id}
        
        if not candidate_neighbors:
            return self.user_means[user_id]
        
        # Compute similarities to all candidates
        similarities = []
        for neighbor_id in candidate_neighbors:
            sim = self._get_similarity(user_id, neighbor_id)
            if sim is not None:
                similarities.append((neighbor_id, sim))
        
        if not similarities:
            return self.user_means[user_id]
        
        # Select top-k neighbors by absolute similarity
        similarities.sort(key=lambda x: abs(x[1]), reverse=True)
        top_k = similarities[:self.k]
        
        # Compute weighted prediction
        numerator = 0.0
        denominator = 0.0
        
        for neighbor_id, sim in top_k:
            neighbor_rating = self.ratings[neighbor_id][item_id]
            neighbor_mean = self.user_means[neighbor_id]
            deviation = neighbor_rating - neighbor_mean
            
            numerator += sim * deviation
            denominator += abs(sim)
        
        if denominator == 0:
            return self.user_means[user_id]
        
        prediction = self.user_means[user_id] + (numerator / denominator)
        
        # Clamp to valid rating range
        return max(1.0, min(5.0, prediction))
    
    def recommend(
        self, 
        user_id: int, 
        n_recommendations: int = 10,
        exclude_rated: bool = True
    ) -> List[Tuple[int, float]]:
        """
        Generate top-N recommendations for a user.
        
        Returns list of (item_id, predicted_rating) tuples, sorted by predicted rating.
        """
        if user_id not in self.ratings:
            return []
        
        rated_items = set(self.ratings[user_id].keys()) if exclude_rated else set()
        candidate_items = set(self.item_users.keys()) - rated_items
        
        # Predict ratings for all candidates
        predictions = []
        for item_id in candidate_items:
            pred = self.predict(user_id, item_id)
            predictions.append((item_id, pred))
        
        # Sort by predicted rating descending
        predictions.sort(key=lambda x: x[1], reverse=True)
        
        return predictions[:n_recommendations]
 
 
# Example usage
if __name__ == "__main__":
    # Sample rating data: (user_id, item_id, rating)
    ratings = [
        (1, 1, 5), (1, 2, 3), (1, 3, 4), (1, 5, 1),
        (2, 1, 4), (2, 3, 5), (2, 4, 4), (2, 5, 2),
        (3, 2, 1), (3, 3, 2), (3, 4, 4), (3, 5, 5),
        (4, 1, 5), (4, 2, 4), (4, 3, 4), (4, 4, 3),
        (5, 1, 1), (5, 2, 2), (5, 4, 5), (5, 5, 4),
    ]
    
    cf = UserBasedCF(k_neighbors=3, min_common_items=2)
    cf.fit(ratings)
    
    # Predict Alice's (user 1) rating for item 4
    prediction = cf.predict(user_id=1, item_id=4)
    print(f"Predicted rating for User 1, Item 4: {prediction:.2f}")
    
    # Get recommendations for Alice
    recommendations = cf.recommend(user_id=1, n_recommendations=3)
    print(f"Top recommendations for User 1: {recommendations}")

Implementation Insights

The implementation includes significance weighting (shrinking correlations with few co-rated items), similarity caching (computing each pair only once), and rating clamping (ensuring predictions stay within valid bounds). These aren't textbook embellishments—they're essential for production systems.

Neighborhood Selection Strategies

The choice of how to form neighborhoods significantly impacts recommendation quality. Several strategies exist:

Fixed-size Neighborhoods (Top-k):

Select exactly k most similar users. This is the most common approach:

Pros: Consistent computational cost, prevents single-user domination
Cons: May include weak neighbors if k is too large; may miss useful neighbors if k is too small
Typical values: k ∈ [20, 100], with 50 often working well

Threshold-based Neighborhoods:

Include all users with similarity above threshold τ:

Pros: Only includes genuinely similar users
Cons: Variable neighborhood size; some users may have zero neighbors
Risk: For popular items, may include too many weak neighbors

Hybrid Approach:

Top-k from users exceeding threshold τ—combines benefits of both methods.

Impact of Neighborhood Size (k)
k Value	Coverage	Accuracy	Diversity	Compute Cost
Very small (5-10)	Low (many unpredictable items)	Variable	Low	Lowest
Small (20-30)	Moderate	Often optimal	Moderate	Low
Medium (50-100)	High	Good	Higher	Moderate
Large (200+)	Highest	May degrade (noise)	Highest	High

Significance Weighting:

A crucial refinement addresses the reliability of similarity estimates. When two users share only 2-3 co-rated items, their computed correlation is statistically unreliable. Significance weighting shrinks similarity estimates based on sample size:

                         |I_uv|
sim_weighted(u,v) = ─────────────────── · sim(u,v)
                    max(|I_uv|, threshold)

If users share 10 items and threshold is 50, similarity is weighted by 10/50 = 0.2, substantially reducing the influence of unreliable estimates.

Inverse User Frequency (IUF):

Another refinement is to weight items by their discriminative power. Items rated by everyone (blockbuster movies) reveal less about taste than niche items. IUF weighs agreement on rare items more heavily:

iuf(i) = log(|U| / |U_i|)

Similar to TF-IDF in text, this emphasizes distinctive preferences.

The Bias-Variance Tradeoff

Neighborhood size control is a classic bias-variance tradeoff. Small k (few neighbors) has high variance—predictions depend sensitively on which few neighbors are selected. Large k has high bias—averaging many loosely similar users washes out meaningful patterns. Cross-validation on a held-out set is essential for tuning k.

Challenges and Limitations

User-based CF, despite its elegance, faces severe challenges that limit its applicability in modern systems. Understanding these is crucial for knowing when to apply (or avoid) this approach.

Critical Limitations

•Scalability Crisis — Computing similarities requires comparing every user pair: O(m²) comparisons, each requiring O(|I_uv|) time. With millions of users, this becomes computationally prohibitive. Netflix has 200M+ subscribers—pairwise comparison is impossible.
•Sparsity Problem — Real rating matrices are >99% empty. Most user pairs share zero or very few co-rated items, making similarity estimates unreliable or undefined. This fundamentally limits the approach's applicability.
•Cold Start for Users — New users with no ratings cannot receive recommendations (no history to find neighbors). Even users with 2-3 ratings get poor recommendations since similarity estimates are unstable.
•Online Computation Cost — Predictions require scanning neighbors at query time. With no precomputation possible (user similarities change as they rate), latency can be unacceptable for real-time serving.
•Popularity Bias — Users tend to rate popular items, so neighborhoods form around popular items. Niche interests are underserved because few users share ratings for niche items.
•Lack of Explanability — While the concept is intuitive, explaining specific recommendations is hard. 'Users like you enjoyed this' is vague and doesn't help users understand why.

Scalability at Different Scales
User Count	Naive Pair Comparisons	Storage (float32)	Feasibility
1,000	500,000	2 MB	Easy
10,000	50,000,000	200 MB	Moderate
100,000	5,000,000,000	20 GB	Challenging
1,000,000	500,000,000,000	2 TB	Impractical
10,000,000	50 trillion	200 TB	Impossible

Why Companies Moved Away

These limitations explain why major recommender systems evolved beyond pure user-based CF. Netflix's famous 2006-2009 Prize competition was won by matrix factorization approaches that handle scale and sparsity far better. Modern systems use hybrid approaches combining CF with deep learning, knowledge graphs, and contextual signals.

Practical Improvements and Optimizations

Despite its limitations, user-based CF can be made practical for moderate-scale systems through several optimizations:

Dimensionality Reduction:

Project users into a lower-dimensional space before computing similarities:

Apply SVD or PCA to the rating matrix
Compute similarities in reduced space (k-dimensions instead of n)
Dramatically reduces comparison cost while preserving structure

Locality-Sensitive Hashing (LSH):

Hashsimilar users to the same buckets:

Only compare users within the same bucket
Approximate neighbors with high probability
Reduces comparisons from O(m²) to O(m · bucket_size)

Clustering-based Neighborhoods:

Cluster users offline, then search within clusters:

Cluster users using k-means or hierarchical clustering
For prediction, only search within user's cluster(s)
Trade accuracy for dramatic speedup

optimized_user_cf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors
import numpy as np
from scipy.sparse import csr_matrix
 
class ScalableUserCF:
    """
    Scalable user-based CF using dimensionality reduction
    and approximate nearest neighbor search.
    """
    
    def __init__(
        self, 
        n_neighbors: int = 50,
        n_components: int = 100,
        metric: str = "cosine"
    ):
        self.k = n_neighbors
        self.n_components = n_components
        self.metric = metric
        
        self.svd = TruncatedSVD(n_components=n_components, random_state=42)
        self.nn_model = NearestNeighbors(
            n_neighbors=n_neighbors + 1,  # +1 because user is their own neighbor
            metric=metric,
            algorithm='ball_tree'  # Efficient for moderate dimensions
        )
        
        self.user_factors = None
        self.user_ids = None
        self.user_id_to_idx = {}
        self.ratings_matrix = None
        self.user_means = None
    
    def fit(self, ratings_matrix: csr_matrix, user_ids: np.ndarray):
        """
        Fit the model using sparse ratings matrix.
        
        Args:
            ratings_matrix: Sparse m x n matrix of user-item ratings
            user_ids: Array of user IDs corresponding to matrix rows
        """
        self.ratings_matrix = ratings_matrix
        self.user_ids = user_ids
        self.user_id_to_idx = {uid: idx for idx, uid in enumerate(user_ids)}
        
        # Compute user means (for non-zero entries only)
        row_sums = np.array(ratings_matrix.sum(axis=1)).flatten()
        row_counts = np.array((ratings_matrix != 0).sum(axis=1)).flatten()
        self.user_means = np.divide(
            row_sums, row_counts, 
            out=np.full_like(row_sums, 3.0),  # Default mean
            where=row_counts != 0
        )
        
        # Reduce dimensionality
        # Mean-center before SVD for better representation
        centered = ratings_matrix.copy()
        # Note: Full mean-centering of sparse matrix requires care
        # Here we use SVD directly on ratings for simplicity
        
        self.user_factors = self.svd.fit_transform(ratings_matrix)
        
        # Build approximate nearest neighbor index
        self.nn_model.fit(self.user_factors)
        
        return self
    
    def find_neighbors(self, user_idx: int) -> tuple:
        """Find k nearest neighbors for a user."""
        user_vector = self.user_factors[user_idx].reshape(1, -1)
        distances, indices = self.nn_model.kneighbors(user_vector)
        
        # Remove self from neighbors
        mask = indices[0] != user_idx
        neighbor_indices = indices[0][mask][:self.k]
        neighbor_distances = distances[0][mask][:self.k]
        
        # Convert distances to similarities
        # For cosine distance: similarity = 1 - distance
        similarities = 1 - neighbor_distances
        
        return neighbor_indices, similarities
    
    def predict(self, user_id: int, item_idx: int) -> float:
        """Predict rating for user-item pair."""
        if user_id not in self.user_id_to_idx:
            return 3.0  # Unknown user
        
        user_idx = self.user_id_to_idx[user_id]
        neighbor_indices, similarities = self.find_neighbors(user_idx)
        
        # Get neighbors who rated this item
        neighbor_ratings = self.ratings_matrix[neighbor_indices, item_idx].toarray().flatten()
        rated_mask = neighbor_ratings != 0
        
        if not np.any(rated_mask):
            return self.user_means[user_idx]
        
        # Filter to neighbors who rated the item
        valid_similarities = similarities[rated_mask]
        valid_ratings = neighbor_ratings[rated_mask]
        valid_indices = neighbor_indices[rated_mask]
        valid_means = self.user_means[valid_indices]
        
        # Compute prediction with mean-centering
        deviations = valid_ratings - valid_means
        numerator = np.dot(valid_similarities, deviations)
        denominator = np.sum(np.abs(valid_similarities))
        
        if denominator == 0:
            return self.user_means[user_idx]
        
        prediction = self.user_means[user_idx] + numerator / denominator
        return np.clip(prediction, 1.0, 5.0)
 
 
# Complexity comparison
print("Complexity Analysis:")
print("─" * 50)
print("Naive User-based CF:")
print("  - Similarity computation: O(m² · n) precompute")
print("  - Prediction: O(m) per query")
print("  - Space: O(m²) for similarity matrix")
print()
print("Scalable User-based CF (this implementation):")
print("  - SVD: O(m · n · k) one-time")
print("  - NN Index: O(m · k · log m) one-time")
print("  - Prediction: O(k · log m) per query")
print("  - Space: O(m · k) for user factors")

Practical Scalability

With dimensionality reduction and approximate nearest neighbors, user-based CF can scale to millions of users. The trade-off is approximate rather than exact neighbors—but in practice, recommendation quality often improves because dominant latent factors matter more than noisy raw ratings.

Summary: User-based Collaborative Filtering

We've comprehensively explored user-based collaborative filtering—from its intuitive foundation to mathematical formulation, algorithmic implementation, and practical considerations.

Key Takeaways

•Core premise: Similar users have similar preferences—we predict based on what similar users liked
•Prediction formula: Weighted average of neighbor deviations, normalized and centered around user's mean
•Pearson correlation: The gold-standard similarity measure for explicit ratings, handling rating bias naturally
•Neighborhood size k: Critical hyperparameter balancing bias and variance; typically 20-100 works well
•Significance weighting: Essential for sparse data where co-rated items are few
•Scalability limitation: O(m²) pairwise comparisons make pure user-based CF impractical for large systems
•Modern practice: Dimensionality reduction, LSH, and clustering enable practical deployment

What's Next:

The scalability challenges of user-based CF motivated the development of item-based collaborative filtering. Since the number of items often grows more slowly than users, and item-item similarities are more stable over time, item-based CF offers compelling advantages for production systems. We'll explore this approach in depth on the next page.

Page Complete

You now have a deep understanding of user-based collaborative filtering—its mathematical foundation, algorithmic workflow, practical implementation, and critical limitations. This foundation prepares you for the complementary item-based approach and more advanced techniques like matrix factorization.

1 / 5

Loading learning content...

Machine LearningRecommendation Systems

Collaborative Filtering

LevelIntermediate

Duration90 mins

TopicRecommendation Systems

1 / 5

User-based Collaborative Filtering

The Wisdom of the Crowd

What You Will Master

The Core Intuition Behind User-based CF

Before diving into mathematics, let's build deep intuition for why user-based CF works and when it should work.

The Homophily Principle:

The Memory-based Paradigm:

The User-Item Interaction Matrix
	Movie A	Movie B	Movie C	Movie D	Movie E
User 1 (Alice)	5	3	4	?	1
User 2 (Bob)	4	?	5	4	2
User 3 (Carol)	?	1	2	4	5
User 4 (Dave)	5	4	4	3	?
User 5 (Eve)	1	2	?	5	4

The table above represents a user-item rating matrix R where R[u][i] is user u's rating for item i. The ? marks indicate missing ratings—the values we want to predict.

The Fundamental Question:

If we want to predict Alice's rating for Movie D, user-based CF asks: Which users rate movies similarly to Alice, and how did they rate Movie D?

Looking at the matrix:

Bob rated Movies A and C similarly to Alice (4 vs 5, and 5 vs 4) and gave Movie D a 4
Dave rated Movies A, B, and C very similarly to Alice and gave Movie D a 3
Carol and Eve have quite different rating patterns from Alice

Intuitively, we'd weight Bob's and Dave's opinions more heavily than Carol's or Eve's when predicting Alice's rating for Movie D.

The Key Insight

The Mathematical Framework

Let's formalize user-based CF with precise mathematical notation. This framework underpins all practical implementations.

Notation:

U = set of all users, with |U| = m users
I = set of all items, with |I| = n items
R = m × n rating matrix, where Rui ∈ ℝ ∪ {∅} represents user u's rating for item i (∅ for missing)
Iu = set of items rated by user u
Ui = set of users who rated item i
r̄u = mean rating of user u across all their rated items
sim(u, v) = similarity between users u and v
Nk(u, i) = k most similar users to u who have rated item i

The Prediction Formula:

The predicted rating r̂ui for user u on item i is computed as a weighted average of ratings from similar users:

            Σ_{v ∈ N_k(u,i)} sim(u,v) · (R_vi - r̄_v)
r̂_ui = r̄_u + ─────────────────────────────────────────
                  Σ_{v ∈ N_k(u,i)} |sim(u,v)|

Let's unpack each component of this formula:

Formula Components Explained

•r̄u (baseline) — The user's average rating establishes a baseline. Some users rate generously (averaging 4.5), others harshly (averaging 2.5). Starting from the user's mean accounts for this bias.
•(Rvi - r̄v) (deviation) — Rather than using raw ratings, we use deviations from each neighbor's mean. If neighbor v rates this item 1 star above their average, that's informative regardless of their absolute scale.
•sim(u, v) (weight) — More similar users get higher weight in the aggregation. A neighbor with similarity 0.95 influences the prediction far more than one with similarity 0.3.
•Σ |sim(u, v)| (normalization) — We divide by the sum of absolute similarities to ensure the weighted average is properly scaled. Without this, predictions would drift based on how many neighbors we use.
•Nk(u, i) (neighborhood) — We only use the top-k most similar users who have actually rated item i. Using all users would introduce noise; using too few reduces coverage.

Why Use Deviations?

Alternative Prediction Formulas:

The mean-centered weighted average above is the most common, but alternatives exist:

Simple Weighted Average (no mean-centering):

          Σ_{v ∈ N_k(u,i)} sim(u,v) · R_vi
r̂_ui = ─────────────────────────────────────
          Σ_{v ∈ N_k(u,i)} |sim(u,v)|

Simpler but problematic when users have different rating scales.

Z-score Normalization:

                  Σ_{v ∈ N_k(u,i)} sim(u,v) · (R_vi - r̄_v) / σ_v
r̂_ui = r̄_u + σ_u · ────────────────────────────────────────────────
                        Σ_{v ∈ N_k(u,i)} |sim(u,v)|

Accounts for both mean and variance differences between users. More robust but requires computing per-user standard deviations.

Measuring User Similarity

The choice of similarity function is perhaps the most critical design decision in user-based CF. It determines which users are considered "neighbors" and how their opinions are weighted.

The Co-rated Item Requirement:

Pearson Correlation Coefficient:

The most widely used similarity measure:

                   Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
sim_pearson(u,v) = ─────────────────────────────────────────────────────────────────
                   √[Σ_{i ∈ I_uv} (R_ui - r̄_u)²] · √[Σ_{i ∈ I_uv} (R_vi - r̄_v)²]

Pearson correlation ranges from -1 (perfectly opposite preferences) to +1 (identical preferences), with 0 indicating no linear relationship.

Why Pearson Works Well

•Handles rating bias — Mean-centering makes users with different baselines comparable
•Scale-invariant — Linear transformations of ratings don't affect correlation
•Captures negative correlations — Users with opposite tastes get negative similarity, preventing them from polluting predictions
•Well-understood statistically — Extensive theory about significance and interpretation

Cosine Similarity:

Treats rating vectors as points in n-dimensional space and measures the angle between them:

                  Σ_{i ∈ I_uv} R_ui · R_vi
sim_cosine(u,v) = ────────────────────────────────────────────
                  √[Σ_{i ∈ I_uv} R_ui²] · √[Σ_{i ∈ I_uv} R_vi²]

Ranges from 0 to 1 (for non-negative ratings). Doesn't account for rating bias directly.

Adjusted Cosine Similarity:

Modifies cosine similarity to account for rating bias by mean-centering:

                      Σ_{i ∈ I_uv} (R_ui - r̄_u)(R_vi - r̄_v)
sim_adj_cosine(u,v) = ────────────────────────────────────────────────────────────────
                      √[Σ_{i ∈ I} (R_ui - r̄_u)²] · √[Σ_{i ∈ I} (R_vi - r̄_v)²]

Note: The denominator sums over all items each user rated (not just co-rated), which differentiates it from Pearson.

Similarity Measure Comparison
Measure	Range	Handles Bias?	Negative Similarity?	Best Use Case
Pearson	[-1, +1]	Yes (mean-centered)	Yes	Dense rating data with varied user scales
Cosine	[0, +1]	No	No	Sparse binary/implicit feedback
Adjusted Cosine	[-1, +1]	Yes	Yes	Item-based CF (discussed next page)
Euclidean	[0, ∞)	No	No (inverted as similarity)	When absolute differences matter
Jaccard	[0, +1]	N/A (binary)	No	Binary data (liked/not liked)

The Sparsity Problem

The Complete User-based CF Algorithm

Let's walk through the complete algorithmic workflow from raw data to recommendations.

Algorithm: User-based CF Prediction

Algorithm: UserBasedCF_Predict(u, i, R, k)
# Input: target user u, target item i, rating matrix R, neighborhood size k
# Output: predicted rating r̂_ui

1. Identify candidate neighbors:
   candidates = {v ∈ U : v ≠ u AND R[v][i] ≠ ∅}
   # Users who have rated item i

2. Compute similarities:
   For each v in candidates:
       sim[v] = PearsonCorrelation(u, v, R)

3. Select top-k neighborhood:
   N_k = top k users from candidates by |sim[v]|

4. Compute prediction:
   numerator = Σ_{v ∈ N_k} sim[v] · (R[v][i] - mean(R[v]))
   denominator = Σ_{v ∈ N_k} |sim[v]|
   
   If denominator == 0:
       return mean(R[u])  # fallback to user's mean
   
   return mean(R[u]) + numerator / denominator

user_based_cf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
import numpy as np
from collections import defaultdict
from typing import Dict, List, Tuple, Optional
 
class UserBasedCF:
    """
    Production-quality User-based Collaborative Filtering implementation.
    
    Supports:
    - Pearson correlation and cosine similarity
    - Configurable neighborhood size
    - Significance weighting for sparse overlaps
    - Efficient similarity caching
    """
    
    def __init__(
        self, 
        k_neighbors: int = 50,
        similarity_metric: str = "pearson",
        min_common_items: int = 3,
        significance_weight: bool = True,
        significance_threshold: int = 50
    ):
        self.k = k_neighbors
        self.similarity_metric = similarity_metric
        self.min_common_items = min_common_items
        self.use_significance_weight = significance_weight
        self.significance_threshold = significance_threshold
        
        # Data structures populated on fit()
        self.ratings: Dict[int, Dict[int, float]] = {}  # user -> {item: rating}
        self.user_means: Dict[int, float] = {}
        self.item_users: Dict[int, set] = defaultdict(set)  # item -> set of users
        self.similarity_cache: Dict[Tuple[int, int], float] = {}
    
    def fit(self, ratings_data: List[Tuple[int, int, float]]) -> 'UserBasedCF':
        """
        Fit the model with rating data.
        
        Args:
            ratings_data: List of (user_id, item_id, rating) tuples
        """
        # Build user-item rating dict
        for user_id, item_id, rating in ratings_data:
            if user_id not in self.ratings:
                self.ratings[user_id] = {}
            self.ratings[user_id][item_id] = rating
            self.item_users[item_id].add(user_id)
        
        # Compute user means
        for user_id, user_ratings in self.ratings.items():
            self.user_means[user_id] = np.mean(list(user_ratings.values()))
        
        return self
    
    def _pearson_correlation(self, user_u: int, user_v: int) -> Optional[float]:
        """Compute Pearson correlation between two users."""
        # Find co-rated items
        items_u = set(self.ratings[user_u].keys())
        items_v = set(self.ratings[user_v].keys())
        common_items = items_u & items_v
        
        if len(common_items) < self.min_common_items:
            return None
        
        mean_u = self.user_means[user_u]
        mean_v = self.user_means[user_v]
        
        # Compute correlation components
        numerator = 0.0
        sum_sq_u = 0.0
        sum_sq_v = 0.0
        
        for item in common_items:
            dev_u = self.ratings[user_u][item] - mean_u
            dev_v = self.ratings[user_v][item] - mean_v
            numerator += dev_u * dev_v
            sum_sq_u += dev_u ** 2
            sum_sq_v += dev_v ** 2
        
        denominator = np.sqrt(sum_sq_u) * np.sqrt(sum_sq_v)
        
        if denominator == 0:
            return 0.0
        
        correlation = numerator / denominator
        
        # Apply significance weighting
        if self.use_significance_weight:
            n = len(common_items)
            if n < self.significance_threshold:
                correlation *= (n / self.significance_threshold)
        
        return correlation
    
    def _get_similarity(self, user_u: int, user_v: int) -> Optional[float]:
        """Get similarity with caching."""
        cache_key = (min(user_u, user_v), max(user_u, user_v))
        
        if cache_key not in self.similarity_cache:
            if self.similarity_metric == "pearson":
                sim = self._pearson_correlation(user_u, user_v)
            else:
                raise ValueError(f"Unknown similarity metric: {self.similarity_metric}")
            self.similarity_cache[cache_key] = sim
        
        return self.similarity_cache[cache_key]
    
    def predict(self, user_id: int, item_id: int) -> float:
        """
        Predict rating for a user-item pair.
        
        Returns the predicted rating or user's mean if prediction is not possible.
        """
        if user_id not in self.ratings:
            return 3.0  # Global default for unknown users
        
        # Get users who rated this item
        candidate_neighbors = self.item_users.get(item_id, set()) - {user_id}
        
        if not candidate_neighbors:
            return self.user_means[user_id]
        
        # Compute similarities to all candidates
        similarities = []
        for neighbor_id in candidate_neighbors:
            sim = self._get_similarity(user_id, neighbor_id)
            if sim is not None:
                similarities.append((neighbor_id, sim))
        
        if not similarities:
            return self.user_means[user_id]
        
        # Select top-k neighbors by absolute similarity
        similarities.sort(key=lambda x: abs(x[1]), reverse=True)
        top_k = similarities[:self.k]
        
        # Compute weighted prediction
        numerator = 0.0
        denominator = 0.0
        
        for neighbor_id, sim in top_k:
            neighbor_rating = self.ratings[neighbor_id][item_id]
            neighbor_mean = self.user_means[neighbor_id]
            deviation = neighbor_rating - neighbor_mean
            
            numerator += sim * deviation
            denominator += abs(sim)
        
        if denominator == 0:
            return self.user_means[user_id]
        
        prediction = self.user_means[user_id] + (numerator / denominator)
        
        # Clamp to valid rating range
        return max(1.0, min(5.0, prediction))
    
    def recommend(
        self, 
        user_id: int, 
        n_recommendations: int = 10,
        exclude_rated: bool = True
    ) -> List[Tuple[int, float]]:
        """
        Generate top-N recommendations for a user.
        
        Returns list of (item_id, predicted_rating) tuples, sorted by predicted rating.
        """
        if user_id not in self.ratings:
            return []
        
        rated_items = set(self.ratings[user_id].keys()) if exclude_rated else set()
        candidate_items = set(self.item_users.keys()) - rated_items
        
        # Predict ratings for all candidates
        predictions = []
        for item_id in candidate_items:
            pred = self.predict(user_id, item_id)
            predictions.append((item_id, pred))
        
        # Sort by predicted rating descending
        predictions.sort(key=lambda x: x[1], reverse=True)
        
        return predictions[:n_recommendations]
 
 
# Example usage
if __name__ == "__main__":
    # Sample rating data: (user_id, item_id, rating)
    ratings = [
        (1, 1, 5), (1, 2, 3), (1, 3, 4), (1, 5, 1),
        (2, 1, 4), (2, 3, 5), (2, 4, 4), (2, 5, 2),
        (3, 2, 1), (3, 3, 2), (3, 4, 4), (3, 5, 5),
        (4, 1, 5), (4, 2, 4), (4, 3, 4), (4, 4, 3),
        (5, 1, 1), (5, 2, 2), (5, 4, 5), (5, 5, 4),
    ]
    
    cf = UserBasedCF(k_neighbors=3, min_common_items=2)
    cf.fit(ratings)
    
    # Predict Alice's (user 1) rating for item 4
    prediction = cf.predict(user_id=1, item_id=4)
    print(f"Predicted rating for User 1, Item 4: {prediction:.2f}")
    
    # Get recommendations for Alice
    recommendations = cf.recommend(user_id=1, n_recommendations=3)
    print(f"Top recommendations for User 1: {recommendations}")

Implementation Insights

Neighborhood Selection Strategies

The choice of how to form neighborhoods significantly impacts recommendation quality. Several strategies exist:

Fixed-size Neighborhoods (Top-k):

Select exactly k most similar users. This is the most common approach:

Pros: Consistent computational cost, prevents single-user domination
Cons: May include weak neighbors if k is too large; may miss useful neighbors if k is too small
Typical values: k ∈ [20, 100], with 50 often working well

Threshold-based Neighborhoods:

Include all users with similarity above threshold τ:

Pros: Only includes genuinely similar users
Cons: Variable neighborhood size; some users may have zero neighbors
Risk: For popular items, may include too many weak neighbors

Hybrid Approach:

Top-k from users exceeding threshold τ—combines benefits of both methods.

Impact of Neighborhood Size (k)
k Value	Coverage	Accuracy	Diversity	Compute Cost
Very small (5-10)	Low (many unpredictable items)	Variable	Low	Lowest
Small (20-30)	Moderate	Often optimal	Moderate	Low
Medium (50-100)	High	Good	Higher	Moderate
Large (200+)	Highest	May degrade (noise)	Highest	High

Significance Weighting:

                         |I_uv|
sim_weighted(u,v) = ─────────────────── · sim(u,v)
                    max(|I_uv|, threshold)

If users share 10 items and threshold is 50, similarity is weighted by 10/50 = 0.2, substantially reducing the influence of unreliable estimates.

Inverse User Frequency (IUF):

iuf(i) = log(|U| / |U_i|)

Similar to TF-IDF in text, this emphasizes distinctive preferences.

The Bias-Variance Tradeoff

Challenges and Limitations

User-based CF, despite its elegance, faces severe challenges that limit its applicability in modern systems. Understanding these is crucial for knowing when to apply (or avoid) this approach.

Critical Limitations

•Scalability Crisis — Computing similarities requires comparing every user pair: O(m²) comparisons, each requiring O(|I_uv|) time. With millions of users, this becomes computationally prohibitive. Netflix has 200M+ subscribers—pairwise comparison is impossible.
•Sparsity Problem — Real rating matrices are >99% empty. Most user pairs share zero or very few co-rated items, making similarity estimates unreliable or undefined. This fundamentally limits the approach's applicability.
•Cold Start for Users — New users with no ratings cannot receive recommendations (no history to find neighbors). Even users with 2-3 ratings get poor recommendations since similarity estimates are unstable.
•Online Computation Cost — Predictions require scanning neighbors at query time. With no precomputation possible (user similarities change as they rate), latency can be unacceptable for real-time serving.
•Popularity Bias — Users tend to rate popular items, so neighborhoods form around popular items. Niche interests are underserved because few users share ratings for niche items.
•Lack of Explanability — While the concept is intuitive, explaining specific recommendations is hard. 'Users like you enjoyed this' is vague and doesn't help users understand why.

Scalability at Different Scales
User Count	Naive Pair Comparisons	Storage (float32)	Feasibility
1,000	500,000	2 MB	Easy
10,000	50,000,000	200 MB	Moderate
100,000	5,000,000,000	20 GB	Challenging
1,000,000	500,000,000,000	2 TB	Impractical
10,000,000	50 trillion	200 TB	Impossible

Why Companies Moved Away

Practical Improvements and Optimizations

Despite its limitations, user-based CF can be made practical for moderate-scale systems through several optimizations:

Dimensionality Reduction:

Project users into a lower-dimensional space before computing similarities:

Apply SVD or PCA to the rating matrix
Compute similarities in reduced space (k-dimensions instead of n)
Dramatically reduces comparison cost while preserving structure

Locality-Sensitive Hashing (LSH):

Hashsimilar users to the same buckets:

Only compare users within the same bucket
Approximate neighbors with high probability
Reduces comparisons from O(m²) to O(m · bucket_size)

Clustering-based Neighborhoods:

Cluster users offline, then search within clusters:

Cluster users using k-means or hierarchical clustering
For prediction, only search within user's cluster(s)
Trade accuracy for dramatic speedup

optimized_user_cf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors
import numpy as np
from scipy.sparse import csr_matrix
 
class ScalableUserCF:
    """
    Scalable user-based CF using dimensionality reduction
    and approximate nearest neighbor search.
    """
    
    def __init__(
        self, 
        n_neighbors: int = 50,
        n_components: int = 100,
        metric: str = "cosine"
    ):
        self.k = n_neighbors
        self.n_components = n_components
        self.metric = metric
        
        self.svd = TruncatedSVD(n_components=n_components, random_state=42)
        self.nn_model = NearestNeighbors(
            n_neighbors=n_neighbors + 1,  # +1 because user is their own neighbor
            metric=metric,
            algorithm='ball_tree'  # Efficient for moderate dimensions
        )
        
        self.user_factors = None
        self.user_ids = None
        self.user_id_to_idx = {}
        self.ratings_matrix = None
        self.user_means = None
    
    def fit(self, ratings_matrix: csr_matrix, user_ids: np.ndarray):
        """
        Fit the model using sparse ratings matrix.
        
        Args:
            ratings_matrix: Sparse m x n matrix of user-item ratings
            user_ids: Array of user IDs corresponding to matrix rows
        """
        self.ratings_matrix = ratings_matrix
        self.user_ids = user_ids
        self.user_id_to_idx = {uid: idx for idx, uid in enumerate(user_ids)}
        
        # Compute user means (for non-zero entries only)
        row_sums = np.array(ratings_matrix.sum(axis=1)).flatten()
        row_counts = np.array((ratings_matrix != 0).sum(axis=1)).flatten()
        self.user_means = np.divide(
            row_sums, row_counts, 
            out=np.full_like(row_sums, 3.0),  # Default mean
            where=row_counts != 0
        )
        
        # Reduce dimensionality
        # Mean-center before SVD for better representation
        centered = ratings_matrix.copy()
        # Note: Full mean-centering of sparse matrix requires care
        # Here we use SVD directly on ratings for simplicity
        
        self.user_factors = self.svd.fit_transform(ratings_matrix)
        
        # Build approximate nearest neighbor index
        self.nn_model.fit(self.user_factors)
        
        return self
    
    def find_neighbors(self, user_idx: int) -> tuple:
        """Find k nearest neighbors for a user."""
        user_vector = self.user_factors[user_idx].reshape(1, -1)
        distances, indices = self.nn_model.kneighbors(user_vector)
        
        # Remove self from neighbors
        mask = indices[0] != user_idx
        neighbor_indices = indices[0][mask][:self.k]
        neighbor_distances = distances[0][mask][:self.k]
        
        # Convert distances to similarities
        # For cosine distance: similarity = 1 - distance
        similarities = 1 - neighbor_distances
        
        return neighbor_indices, similarities
    
    def predict(self, user_id: int, item_idx: int) -> float:
        """Predict rating for user-item pair."""
        if user_id not in self.user_id_to_idx:
            return 3.0  # Unknown user
        
        user_idx = self.user_id_to_idx[user_id]
        neighbor_indices, similarities = self.find_neighbors(user_idx)
        
        # Get neighbors who rated this item
        neighbor_ratings = self.ratings_matrix[neighbor_indices, item_idx].toarray().flatten()
        rated_mask = neighbor_ratings != 0
        
        if not np.any(rated_mask):
            return self.user_means[user_idx]
        
        # Filter to neighbors who rated the item
        valid_similarities = similarities[rated_mask]
        valid_ratings = neighbor_ratings[rated_mask]
        valid_indices = neighbor_indices[rated_mask]
        valid_means = self.user_means[valid_indices]
        
        # Compute prediction with mean-centering
        deviations = valid_ratings - valid_means
        numerator = np.dot(valid_similarities, deviations)
        denominator = np.sum(np.abs(valid_similarities))
        
        if denominator == 0:
            return self.user_means[user_idx]
        
        prediction = self.user_means[user_idx] + numerator / denominator
        return np.clip(prediction, 1.0, 5.0)
 
 
# Complexity comparison
print("Complexity Analysis:")
print("─" * 50)
print("Naive User-based CF:")
print("  - Similarity computation: O(m² · n) precompute")
print("  - Prediction: O(m) per query")
print("  - Space: O(m²) for similarity matrix")
print()
print("Scalable User-based CF (this implementation):")
print("  - SVD: O(m · n · k) one-time")
print("  - NN Index: O(m · k · log m) one-time")
print("  - Prediction: O(k · log m) per query")
print("  - Space: O(m · k) for user factors")

Practical Scalability

Summary: User-based Collaborative Filtering

We've comprehensively explored user-based collaborative filtering—from its intuitive foundation to mathematical formulation, algorithmic implementation, and practical considerations.

Key Takeaways

•Core premise: Similar users have similar preferences—we predict based on what similar users liked
•Prediction formula: Weighted average of neighbor deviations, normalized and centered around user's mean
•Pearson correlation: The gold-standard similarity measure for explicit ratings, handling rating bias naturally
•Neighborhood size k: Critical hyperparameter balancing bias and variance; typically 20-100 works well
•Significance weighting: Essential for sparse data where co-rated items are few
•Scalability limitation: O(m²) pairwise comparisons make pure user-based CF impractical for large systems
•Modern practice: Dimensionality reduction, LSH, and clustering enable practical deployment

What's Next:

Page Complete

1 / 5