Loading content...
If item representations answer "What is this item?", then user profiles answer the equally critical question: "What does this user want?"
A user profile is a mathematical summary of a user's preferences, constructed from their interaction history. In content-based recommendation, user profiles are expressed in the same feature space as items—enabling direct comparison between what users like and what items offer.
The challenge is profound: users don't explicitly describe their preferences. We must infer them from noisy, sparse signals—a handful of clicks, purchases, or ratings scattered across a catalog of millions. From this incomplete picture, we must construct a profile that accurately predicts preferences for items the user has never seen.
By the end of this page, you will understand how to construct user profiles from interaction history, aggregate preferences across multiple items, handle temporal dynamics and evolving tastes, and balance exploitation of known preferences with exploration of new interests.
A user profile in content-based filtering is a vector representation that captures the user's preferences in the item feature space.
Formal Definition:
Given:
The user profile is a function: $$\psi: U \rightarrow \mathbb{R}^d$$
Where $\psi(u)$ is a $d$-dimensional vector representing user $u$'s preferences.
The Recommendation Score:
With both user profiles and item representations in the same space, recommendation becomes a similarity computation:
$$\text{score}(u, i) = \text{sim}(\psi(u), \phi(i))$$
Common similarity functions include:
| Aspect | Question | Impact |
|---|---|---|
| Aggregation | How to combine multiple item interactions? | Affects profile accuracy and bias |
| Weighting | Are all interactions equally important? | Recency, engagement depth matter |
| Temporal Dynamics | How to handle evolving preferences? | Old vs new tastes balance |
| Multi-Interest | Does user have multiple taste clusters? | Single vs multi-vector profiles |
| Sparsity | How to profile users with few interactions? | Cold-start handling |
Simple Averaging:
The most straightforward approach—average the representations of items the user has interacted with:
$$\psi(u) = \frac{1}{|H_u|} \sum_{i \in H_u} \phi(i)$$
Pros: Simple, interpretable, computationally cheap Cons: Ignores interaction strength, recency, negative signals
Weighted Averaging:
Weight items by interaction strength (ratings, time spent, purchase amount):
$$\psi(u) = \frac{\sum_{i \in H_u} w_{ui} \cdot \phi(i)}{\sum_{i \in H_u} w_{ui}}$$
Weighting strategies:
TF-IDF Inspired:
Weight by item popularity (rare items are more informative):
$$w_{ui} = r_{ui} \cdot \log\frac{|U|}{|{u': i \in H_{u'}}|}$$
Items consumed by everyone reveal little about individual preference.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import numpy as npfrom typing import List, Tuple, Dict, Optionalfrom dataclasses import dataclassfrom enum import Enum class WeightingStrategy(Enum): UNIFORM = "uniform" RATING = "rating" RECENCY = "recency" ENGAGEMENT = "engagement" TFIDF = "tfidf" @dataclassclass Interaction: item_id: int rating: float timestamp: float engagement_time: float = 0.0 class UserProfileBuilder: """ Builds user profiles from interaction history and item embeddings. """ def __init__( self, item_embeddings: np.ndarray, weighting: WeightingStrategy = WeightingStrategy.RATING, recency_decay: float = 0.01, normalize: bool = True ): self.item_embeddings = item_embeddings self.weighting = weighting self.recency_decay = recency_decay self.normalize = normalize self.item_popularity = None def set_item_popularity(self, interaction_counts: np.ndarray): """Set item popularity for TF-IDF weighting.""" total_users = interaction_counts.sum() self.item_popularity = np.log(total_users / (interaction_counts + 1)) def _compute_weights( self, interactions: List[Interaction], current_time: Optional[float] = None ) -> np.ndarray: """Compute interaction weights based on strategy.""" n = len(interactions) weights = np.ones(n) if self.weighting == WeightingStrategy.RATING: ratings = np.array([i.rating for i in interactions]) mean_rating = ratings.mean() weights = ratings - mean_rating + 1 # Shift to positive elif self.weighting == WeightingStrategy.RECENCY: if current_time is None: current_time = max(i.timestamp for i in interactions) for idx, interaction in enumerate(interactions): age = current_time - interaction.timestamp weights[idx] = np.exp(-self.recency_decay * age) elif self.weighting == WeightingStrategy.ENGAGEMENT: weights = np.array([ np.log1p(i.engagement_time) for i in interactions ]) elif self.weighting == WeightingStrategy.TFIDF: if self.item_popularity is None: raise ValueError("Item popularity required for TF-IDF") ratings = np.array([i.rating for i in interactions]) idf = np.array([ self.item_popularity[i.item_id] for i in interactions ]) weights = ratings * idf # Ensure positive weights weights = np.maximum(weights, 0.01) return weights def build_profile( self, interactions: List[Interaction], current_time: Optional[float] = None ) -> np.ndarray: """ Build user profile from interaction history. Returns: User profile vector (same dimensionality as item embeddings) """ if not interactions: return np.zeros(self.item_embeddings.shape[1]) weights = self._compute_weights(interactions, current_time) # Gather item embeddings item_ids = [i.item_id for i in interactions] embeddings = self.item_embeddings[item_ids] # Weighted average profile = np.average(embeddings, axis=0, weights=weights) if self.normalize: norm = np.linalg.norm(profile) if norm > 0: profile = profile / norm return profile def compute_scores( self, user_profile: np.ndarray, candidate_items: Optional[List[int]] = None ) -> np.ndarray: """Compute recommendation scores for items.""" if candidate_items is None: candidates = self.item_embeddings else: candidates = self.item_embeddings[candidate_items] # Cosine similarity (profiles are normalized) return candidates @ user_profileUser preferences are not static—they evolve over time. A user's music taste at 20 differs from their taste at 40. Seasonal patterns emerge: holiday movies in December, fitness content in January.
Types of Temporal Effects:
1. Long-term Drift: Gradual evolution of preferences (e.g., maturing taste in wine)
2. Short-term Context: Recent items influence immediate preferences
3. Periodic Patterns: Recurring preferences based on time cycles
4. Life Events: Discrete changes (new baby → parenting content)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
import numpy as npfrom collections import dequefrom typing import List class TemporalUserProfile: """ User profile with temporal decay and session awareness. Maintains both long-term and short-term preference signals. """ def __init__( self, embedding_dim: int, long_term_decay: float = 0.001, short_term_window: int = 10, long_short_ratio: float = 0.7 ): self.embedding_dim = embedding_dim self.long_term_decay = long_term_decay self.short_term_window = short_term_window self.long_short_ratio = long_short_ratio # Long-term profile (exponential moving average) self.long_term_profile = np.zeros(embedding_dim) self.long_term_weight = 0.0 # Short-term profile (recent window) self.recent_embeddings = deque(maxlen=short_term_window) self.last_update_time = 0.0 def update( self, item_embedding: np.ndarray, timestamp: float, interaction_weight: float = 1.0 ): """Update profile with new interaction.""" # Apply decay to long-term profile if self.long_term_weight > 0: time_delta = timestamp - self.last_update_time decay = np.exp(-self.long_term_decay * time_delta) self.long_term_profile *= decay self.long_term_weight *= decay # Add new item to long-term self.long_term_profile += interaction_weight * item_embedding self.long_term_weight += interaction_weight # Add to short-term window self.recent_embeddings.append( (item_embedding, interaction_weight) ) self.last_update_time = timestamp def get_profile(self) -> np.ndarray: """Get combined long-term and short-term profile.""" # Long-term component if self.long_term_weight > 0: lt_profile = self.long_term_profile / self.long_term_weight else: lt_profile = np.zeros(self.embedding_dim) # Short-term component if self.recent_embeddings: embeddings = [e for e, w in self.recent_embeddings] weights = [w for e, w in self.recent_embeddings] st_profile = np.average(embeddings, axis=0, weights=weights) else: st_profile = np.zeros(self.embedding_dim) # Combine with ratio alpha = self.long_short_ratio combined = alpha * lt_profile + (1 - alpha) * st_profile # Normalize norm = np.linalg.norm(combined) return combined / norm if norm > 0 else combinedMany systems maintain both: a persistent long-term profile reflecting overall taste, and a session profile capturing current context. The final profile blends both—using session signals for immediate relevance while anchoring to long-term preferences.
A single vector often fails to capture the diversity of user interests. Someone who enjoys both classical music and heavy metal has preferences that a single averaged vector poorly represents.
The Averaging Problem:
If a user loves items at opposite ends of a dimension:
Multi-Interest Solutions:
1. Multiple Profile Vectors: Maintain $K$ separate interest vectors per user: $$\Psi(u) = {\psi_1(u), \psi_2(u), ..., \psi_K(u)}$$
Score via maximum: $\text{score}(u, i) = \max_k \text{sim}(\psi_k(u), \phi(i))$
2. Clustering-Based:
3. Attention-Based:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
import numpy as npfrom sklearn.cluster import KMeansfrom typing import List, Tuple class MultiInterestProfile: """ Represents user with multiple interest clusters. """ def __init__( self, max_interests: int = 5, min_items_per_interest: int = 3 ): self.max_interests = max_interests self.min_items_per_interest = min_items_per_interest self.interest_vectors: List[np.ndarray] = [] self.interest_weights: List[float] = [] def build_from_history( self, item_embeddings: np.ndarray, interaction_weights: np.ndarray ): """Build multi-interest profile from item history.""" n_items = len(item_embeddings) if n_items < self.min_items_per_interest: # Too few items - use single profile profile = np.average( item_embeddings, axis=0, weights=interaction_weights ) self.interest_vectors = [profile / np.linalg.norm(profile)] self.interest_weights = [1.0] return # Determine number of clusters n_clusters = min( self.max_interests, n_items // self.min_items_per_interest ) # Cluster items kmeans = KMeans(n_clusters=n_clusters, random_state=42) clusters = kmeans.fit_predict(item_embeddings) # Build profile per cluster self.interest_vectors = [] self.interest_weights = [] for c in range(n_clusters): mask = clusters == c if mask.sum() < self.min_items_per_interest: continue cluster_embeddings = item_embeddings[mask] cluster_weights = interaction_weights[mask] profile = np.average( cluster_embeddings, axis=0, weights=cluster_weights ) profile = profile / np.linalg.norm(profile) self.interest_vectors.append(profile) self.interest_weights.append(cluster_weights.sum()) # Normalize weights total = sum(self.interest_weights) self.interest_weights = [w/total for w in self.interest_weights] def score_item(self, item_embedding: np.ndarray) -> float: """Score item using max-similarity across interests.""" if not self.interest_vectors: return 0.0 item_norm = item_embedding / np.linalg.norm(item_embedding) similarities = [ np.dot(interest, item_norm) for interest in self.interest_vectors ] return max(similarities) def score_item_weighted(self, item_embedding: np.ndarray) -> float: """Score using weighted combination of interest matches.""" if not self.interest_vectors: return 0.0 item_norm = item_embedding / np.linalg.norm(item_embedding) score = sum( weight * np.dot(interest, item_norm) for interest, weight in zip( self.interest_vectors, self.interest_weights ) ) return scoreUsers express preferences through both positive and negative signals. Knowing what users dislike is as valuable as knowing what they like.
Types of Negative Signals:
| Signal | Interpretation | Strength |
|---|---|---|
| Low rating (1-2 stars) | Explicit dislike | Strong |
| Skip/scroll past | Likely not interested | Weak |
| Short engagement | Content didn't resonate | Moderate |
| Explicit 'not interested' | Direct feedback | Strong |
| Return/refund | Post-purchase regret | Strong |
Incorporating Negative Signals:
Approach 1: Subtraction $$\psi(u) = \frac{\sum_{i \in H^+u} \phi(i) - \beta \sum{i \in H^-_u} \phi(i)}{|H^+_u| + \beta|H^-_u|}$$
Approach 2: Separate Profiles Maintain positive profile $\psi^+(u)$ and negative profile $\psi^-(u)$: $$\text{score}(u, i) = \text{sim}(\psi^+(u), \phi(i)) - \gamma \cdot \text{sim}(\psi^-(u), \phi(i))$$
Approach 3: Contrastive Learning Learn to push positive items closer and negative items farther in embedding space.
Absence of interaction is not neutral. Users don't interact with items they're unaware of or items they've already consumed. Treating non-interactions as negative signals biases against long-tail items. Advanced approaches model the selection process explicitly.
New users have no interaction history—the cold-start problem. Content-based systems can still function using alternative signals.
Cold-Start Strategies:
1. Demographic Defaults: Initialize profiles based on demographic features: $$\psi_0(u) = f(\text{age}, \text{location}, \text{device}, ...)$$
2. Onboarding Preferences: Ask users to indicate interests during signup:
3. Contextual Signals: Use available context for initial recommendations:
4. Popularity Fallback: Recommend popular items until profile develops:
5. Exploration: Strategically show diverse items to learn preferences quickly:
Production systems must efficiently store and update millions of user profiles.
Storage Considerations:
Update Strategies:
Batch Updates:
Real-Time Updates:
Hybrid:
For weighted averages, maintain running sums: profile = Σ(w·e) and total_weight = Σw. On new interaction, add new term to both. Profile = running_sum / total_weight. This enables O(d) updates without reprocessing history.
What's Next:
With item representations and user profiles established, we'll dive deep into TF-IDF and embeddings—the core techniques for representing textual content that powers many content-based recommendation systems.
You now understand how to construct, maintain, and evolve user profiles for content-based recommendation. You can handle diverse preferences, temporal dynamics, and cold-start scenarios while building systems that scale to millions of users.