Loading content...
Before a recommendation system can suggest that you might enjoy a particular movie, book, or product, it must first understand what that item is. This understanding is not human intuition—it's a mathematical representation that captures the essential characteristics of items in a form that algorithms can process and compare.
Item representation is the foundation of content-based recommendation. While collaborative filtering learns from user behavior patterns alone, content-based methods require a principled way to describe what items are—their features, attributes, semantics, and relationships. The quality of recommendations is directly bounded by the quality of these representations.
Consider the challenge: How do you mathematically represent a movie like Inception? Is it the genre (science fiction)? The director (Christopher Nolan)? The actors? The visual style? The narrative complexity? The emotional arc? A great item representation captures the dimensions that matter for user preference—and this is both an art and a science.
By the end of this page, you will understand how to construct effective item representations from structured metadata, unstructured content, and learned embeddings. You'll master feature engineering for different content types, understand the trade-offs between hand-crafted and learned representations, and be equipped to build robust item understanding systems for production recommendations.
In content-based recommendation, items are represented as feature vectors in a mathematical space. The fundamental assumption is that similar items should have similar representations, and users who liked certain items will prefer other items with similar features.
Formal Definition:
An item representation is a function that maps each item to a fixed-dimensional vector:
$$\phi: I \rightarrow \mathbb{R}^d$$
Where:
The quality of $\phi$ determines the system's ability to:
Why Representation Quality Matters:
Imagine recommending movies using only the decade of release as a feature. Two movies from the 1990s would be considered identical, regardless of genre, tone, or quality. Clearly, this representation fails to capture what makes movies similar from a preference perspective.
| Quality Dimension | Good Representation | Poor Representation |
|---|---|---|
| Preference Relevance | Captures dimensions users care about | Captures irrelevant metadata |
| Discriminative Power | Distinguishes items users perceive as different | Groups dissimilar items together |
| Semantic Coherence | Similar vectors = similar items | Similar items may have distant vectors |
| Completeness | Represents all relevant aspects | Missing important dimensions |
| Computational Tractability | Efficient similarity computation | Prohibitively expensive to compare |
No algorithm can overcome a poor item representation. If two items that users perceive as vastly different have identical representations, the system cannot distinguish them. Investing in high-quality item representations often yields larger improvements than sophisticated model architectures built on weak features.
The most straightforward source of item features is structured metadata—categorical and numerical attributes stored in databases. This data is typically well-organized, easy to access, and interpretable.
Common Structured Metadata Types:
Categorical Attributes:
Numerical Attributes:
Hierarchical Attributes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175
import numpy as npfrom typing import Dict, List, Anyfrom sklearn.preprocessing import OneHotEncoder, StandardScaler, MultiLabelBinarizerfrom collections import defaultdict class StructuredItemFeatureExtractor: """ Extracts feature vectors from structured item metadata. This class handles the conversion of heterogeneous structured attributes into a unified numerical representation suitable for content-based recommendation. """ def __init__(self): self.categorical_encoders: Dict[str, OneHotEncoder] = {} self.multilabel_encoders: Dict[str, MultiLabelBinarizer] = {} self.numerical_scalers: Dict[str, StandardScaler] = {} self.feature_dims: Dict[str, int] = {} self.is_fitted = False def fit( self, items: List[Dict[str, Any]], categorical_fields: List[str], multilabel_fields: List[str], numerical_fields: List[str] ) -> 'StructuredItemFeatureExtractor': """ Learn encoding parameters from training data. Args: items: List of item dictionaries with metadata categorical_fields: Single-value categorical attributes multilabel_fields: Multi-value categorical attributes (e.g., genres) numerical_fields: Continuous numerical attributes """ # Fit categorical encoders (one-hot encoding) for field in categorical_fields: values = [[item.get(field, 'UNKNOWN')] for item in items] encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') encoder.fit(values) self.categorical_encoders[field] = encoder self.feature_dims[field] = len(encoder.categories_[0]) # Fit multi-label encoders (multi-hot encoding for tags, genres, etc.) for field in multilabel_fields: values = [item.get(field, []) for item in items] encoder = MultiLabelBinarizer() encoder.fit(values) self.multilabel_encoders[field] = encoder self.feature_dims[field] = len(encoder.classes_) # Fit numerical scalers (z-score normalization) for field in numerical_fields: values = np.array([ [item.get(field, 0.0)] for item in items ], dtype=float) # Handle missing values with median imputation median_val = np.nanmedian(values) values = np.where(np.isnan(values), median_val, values) scaler = StandardScaler() scaler.fit(values) self.numerical_scalers[field] = scaler self.feature_dims[field] = 1 self.is_fitted = True return self def transform(self, item: Dict[str, Any]) -> np.ndarray: """ Transform a single item into its feature vector. Args: item: Dictionary containing item metadata Returns: Concatenated feature vector as numpy array """ if not self.is_fitted: raise ValueError("Extractor must be fitted before transform") feature_parts = [] # Encode categorical fields for field, encoder in self.categorical_encoders.items(): value = [[item.get(field, 'UNKNOWN')]] encoded = encoder.transform(value)[0] feature_parts.append(encoded) # Encode multi-label fields for field, encoder in self.multilabel_encoders.items(): values = item.get(field, []) # Handle unknown labels gracefully known_values = [v for v in values if v in encoder.classes_] encoded = encoder.transform([known_values])[0] feature_parts.append(encoded) # Scale numerical fields for field, scaler in self.numerical_scalers.items(): value = item.get(field, 0.0) if value is None or np.isnan(value): value = 0.0 # Use the mean (scaled to 0) scaled = scaler.transform([[value]])[0] feature_parts.append(scaled) return np.concatenate(feature_parts) def get_feature_names(self) -> List[str]: """Return descriptive names for all feature dimensions.""" names = [] for field, encoder in self.categorical_encoders.items(): names.extend([f"{field}_{cat}" for cat in encoder.categories_[0]]) for field, encoder in self.multilabel_encoders.items(): names.extend([f"{field}_{cls}" for cls in encoder.classes_]) for field in self.numerical_scalers.keys(): names.append(field) return names @property def total_dimensions(self) -> int: """Total dimensionality of the feature vector.""" return sum(self.feature_dims.values()) # Example usage demonstrating feature extraction for moviesif __name__ == "__main__": movies = [ { "id": "m1", "title": "Inception", "genres": ["Sci-Fi", "Action", "Thriller"], "director": "Christopher Nolan", "rating": "PG-13", "runtime_minutes": 148, "budget_millions": 160, "release_year": 2010 }, { "id": "m2", "title": "The Notebook", "genres": ["Romance", "Drama"], "director": "Nick Cassavetes", "rating": "PG-13", "runtime_minutes": 123, "budget_millions": 29, "release_year": 2004 }, { "id": "m3", "title": "Interstellar", "genres": ["Sci-Fi", "Drama", "Adventure"], "director": "Christopher Nolan", "rating": "PG-13", "runtime_minutes": 169, "budget_millions": 165, "release_year": 2014 } ] extractor = StructuredItemFeatureExtractor() extractor.fit( items=movies, categorical_fields=["director", "rating"], multilabel_fields=["genres"], numerical_fields=["runtime_minutes", "budget_millions", "release_year"] ) print(f"Total feature dimensions: {extractor.total_dimensions}") print(f"Feature names: {extractor.get_feature_names()}") for movie in movies: features = extractor.transform(movie) print(f"\n{movie['title']}: {features.shape}") print(f" Vector: {features[:10]}...") # First 10 dimsFeature Engineering Considerations:
1. Cardinality Management: High-cardinality categorical fields (e.g., thousands of brands) can explode feature dimensionality. Solutions:
2. Missing Value Handling: Missing metadata is ubiquitous in real systems. Strategies:
3. Feature Interactions: Some information emerges from combinations:
One-hot encoding high-cardinality fields creates extremely sparse vectors. A catalog with 50,000 brands creates a 50,000-dimensional feature where each item has exactly one non-zero entry. This causes storage inefficiency, computational overhead, and difficulty learning meaningful similarities. Always consider dimensionality reduction or embeddings for high-cardinality fields.
Much of the most informative content about items is unstructured—text descriptions, images, audio, and video. These rich modalities contain nuanced information that structured metadata cannot capture.
Unstructured Content Types:
Textual Content:
Visual Content:
Audio Content:
The Challenge: Compared to structured metadata, unstructured content requires sophisticated processing to convert into useful features. But the payoff is significant—a product description often reveals attributes not captured in any structured field.
Text Feature Extraction Approaches:
1. Bag-of-Words (BoW) and TF-IDF: Classic approaches that represent documents as term frequency vectors. Simple, interpretable, and effective for many applications. (Covered in depth on Page 2.)
2. Topic Models (LDA, NMF): Discover latent topics in document collections. Each item is represented by its topic distribution—a dense, interpretable, lower-dimensional representation.
3. Word Embeddings (Word2Vec, GloVe, FastText): Represent words as dense vectors capturing semantic relationships. Document representation via averaging or TF-IDF-weighted aggregation of word vectors.
4. Sentence/Document Embeddings (BERT, Sentence-BERT): Modern transformer-based models that encode entire passages into contextual embeddings. State-of-the-art for semantic similarity but computationally expensive.
5. Named Entity Recognition (NER) + Knowledge Linking: Extract structured entities from text (people, places, organizations) and link to knowledge bases for enriched features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171
import numpy as npfrom typing import List, Dict, Optionalfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.decomposition import TruncatedSVDimport re class TextFeatureExtractor: """ Extracts feature vectors from textual content using multiple strategies. Supports TF-IDF, dimensionality reduction via LSA, and integration with pre-trained embeddings for semantic understanding. """ def __init__( self, max_features: int = 5000, ngram_range: tuple = (1, 2), use_lsa: bool = True, lsa_components: int = 100, min_df: int = 2, max_df: float = 0.95 ): """ Initialize text feature extractor. Args: max_features: Maximum vocabulary size for TF-IDF ngram_range: N-gram range (e.g., (1,2) for unigrams and bigrams) use_lsa: Whether to apply LSA dimensionality reduction lsa_components: Number of LSA components (final dimensionality) min_df: Minimum document frequency for terms max_df: Maximum document frequency (filter common terms) """ self.tfidf = TfidfVectorizer( max_features=max_features, ngram_range=ngram_range, min_df=min_df, max_df=max_df, stop_words='english', lowercase=True, strip_accents='unicode' ) self.use_lsa = use_lsa self.lsa = TruncatedSVD(n_components=lsa_components) if use_lsa else None self.is_fitted = False @staticmethod def preprocess_text(text: str) -> str: """Clean and normalize text for feature extraction.""" if not text: return "" # Convert to lowercase text = text.lower() # Remove special characters, keep alphanumeric and spaces text = re.sub(r'[^a-z0-9\s]', ' ', text) # Normalize whitespace text = ' '.join(text.split()) return text def fit(self, texts: List[str]) -> 'TextFeatureExtractor': """ Learn vocabulary and LSA transformation from corpus. Args: texts: List of text documents (one per item) """ # Preprocess all texts processed_texts = [self.preprocess_text(t) for t in texts] # Fit TF-IDF vectorizer tfidf_matrix = self.tfidf.fit_transform(processed_texts) # Fit LSA if enabled if self.use_lsa: self.lsa.fit(tfidf_matrix) self.is_fitted = True return self def transform(self, text: str) -> np.ndarray: """ Transform a single text into its feature vector. Args: text: Text content of an item Returns: Feature vector (TF-IDF or LSA-reduced) """ if not self.is_fitted: raise ValueError("Extractor must be fitted before transform") processed = self.preprocess_text(text) tfidf_vec = self.tfidf.transform([processed]) if self.use_lsa: return self.lsa.transform(tfidf_vec)[0] return tfidf_vec.toarray()[0] def transform_batch(self, texts: List[str]) -> np.ndarray: """Transform multiple texts efficiently.""" if not self.is_fitted: raise ValueError("Extractor must be fitted before transform") processed = [self.preprocess_text(t) for t in texts] tfidf_matrix = self.tfidf.transform(processed) if self.use_lsa: return self.lsa.transform(tfidf_matrix) return tfidf_matrix.toarray() @property def output_dimension(self) -> int: """Dimensionality of output feature vectors.""" if self.use_lsa: return self.lsa.n_components return len(self.tfidf.vocabulary_) def get_top_terms(self, feature_vector: np.ndarray, n: int = 10) -> List[str]: """ Get most important terms for a feature vector (interpretability). Only works when LSA is disabled. """ if self.use_lsa: raise ValueError("Term lookup not available with LSA reduction") vocab_array = np.array(self.tfidf.get_feature_names_out()) top_indices = np.argsort(feature_vector)[-n:][::-1] return list(vocab_array[top_indices]) # Example: Extracting features from product descriptionsif __name__ == "__main__": product_descriptions = [ "Premium wireless bluetooth headphones with active noise cancellation. " "40-hour battery life, comfortable memory foam ear cushions. Perfect for travel.", "Professional studio monitor headphones with flat frequency response. " "Detachable cable, over-ear design. Industry standard for mixing and mastering.", "Kids-friendly wireless headphones with volume limiter at 85dB. " "Colorful designs, durable construction, built-in microphone for calls.", "Gaming headset with 7.1 surround sound and RGB lighting. " "Retractable boom microphone, compatibility with PC, Xbox, PlayStation." ] extractor = TextFeatureExtractor( max_features=1000, ngram_range=(1, 2), use_lsa=True, lsa_components=50 ) extractor.fit(product_descriptions) print(f"Output dimension: {extractor.output_dimension}") # Transform and compare similarity features = extractor.transform_batch(product_descriptions) # Compute pairwise cosine similarities from sklearn.metrics.pairwise import cosine_similarity similarities = cosine_similarity(features) print("\nPairwise Similarities:") labels = ["Travel NC", "Studio", "Kids", "Gaming"] for i, label_i in enumerate(labels): for j, label_j in enumerate(labels): if i < j: print(f" {label_i} vs {label_j}: {similarities[i, j]:.3f}")For image-based features, convolutional neural networks (CNNs) pretrained on ImageNet are the standard. Using models like ResNet, VGG, or EfficientNet as feature extractors, we obtain dense embeddings (typically 512-2048 dimensions) that capture visual attributes like color, texture, shape, and style. These embeddings enable 'visual similarity' recommendations crucial for fashion, furniture, and art applications.
Embeddings have become the dominant paradigm for item representation in modern systems. Rather than hand-engineering features, we learn dense, low-dimensional vectors that capture semantic relationships directly from data.
What Are Embeddings?
An embedding is a mapping from discrete entities (words, items, users) to continuous vectors in $\mathbb{R}^d$, where $d$ is typically 64-512:
$$e: \text{Entity} \rightarrow \mathbb{R}^d$$
The key property is that embeddings are learned such that entities with similar properties or behaviors have similar vectors (close in the embedding space).
Why Embeddings Work:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
import numpy as npimport torchimport torch.nn as nnfrom typing import Dict, List, Optional, Tuplefrom collections import defaultdict class Item2VecModel(nn.Module): """ Item2Vec: Learn item embeddings from co-occurrence in user sessions. Inspired by Word2Vec, but applied to items. Items that appear in the same user session (or basket, playlist) learn similar embeddings. Uses the Skip-gram architecture with negative sampling. """ def __init__( self, n_items: int, embedding_dim: int = 128, n_negative_samples: int = 5 ): """ Initialize Item2Vec model. Args: n_items: Total number of unique items embedding_dim: Dimensionality of item embeddings n_negative_samples: Negative samples per positive pair """ super().__init__() self.n_items = n_items self.embedding_dim = embedding_dim self.n_negative_samples = n_negative_samples # Target item embeddings (the ones we ultimately use) self.target_embeddings = nn.Embedding(n_items, embedding_dim) # Context item embeddings (for training only) self.context_embeddings = nn.Embedding(n_items, embedding_dim) # Initialize embeddings nn.init.xavier_uniform_(self.target_embeddings.weight) nn.init.xavier_uniform_(self.context_embeddings.weight) def forward( self, target_items: torch.Tensor, context_items: torch.Tensor, negative_items: torch.Tensor ) -> torch.Tensor: """ Compute skip-gram loss with negative sampling. Args: target_items: (batch_size,) target item indices context_items: (batch_size,) context item indices negative_items: (batch_size, n_negative) negative sample indices Returns: Scalar loss tensor """ # Get embeddings target_emb = self.target_embeddings(target_items) # (batch, dim) context_emb = self.context_embeddings(context_items) # (batch, dim) negative_emb = self.context_embeddings(negative_items) # (batch, n_neg, dim) # Positive pair score: dot product pos_score = torch.sum(target_emb * context_emb, dim=1) # (batch,) pos_loss = -torch.log(torch.sigmoid(pos_score) + 1e-10) # Negative pair scores neg_scores = torch.bmm( negative_emb, target_emb.unsqueeze(2) ).squeeze(2) # (batch, n_neg) neg_loss = -torch.log(torch.sigmoid(-neg_scores) + 1e-10).sum(dim=1) return (pos_loss + neg_loss).mean() def get_item_embedding(self, item_id: int) -> np.ndarray: """Get the learned embedding for a single item.""" with torch.no_grad(): return self.target_embeddings.weight[item_id].numpy() def get_all_embeddings(self) -> np.ndarray: """Get all item embeddings as numpy array.""" with torch.no_grad(): return self.target_embeddings.weight.numpy() class Item2VecTrainer: """Training pipeline for Item2Vec model.""" def __init__( self, model: Item2VecModel, learning_rate: float = 0.001, window_size: int = 5 ): self.model = model self.optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) self.window_size = window_size self.item_frequencies = None def prepare_training_data( self, sessions: List[List[int]] ) -> List[Tuple[int, int]]: """ Generate (target, context) pairs from user sessions. For each item in a session, creates pairs with items within the context window on both sides. """ pairs = [] item_counts = defaultdict(int) for session in sessions: for i, target in enumerate(session): item_counts[target] += 1 # Context window start = max(0, i - self.window_size) end = min(len(session), i + self.window_size + 1) for j in range(start, end): if i != j: pairs.append((target, session[j])) # Store frequencies for negative sampling total = sum(item_counts.values()) self.item_frequencies = np.array([ (item_counts.get(i, 0) / total) ** 0.75 # Smoothed frequency for i in range(self.model.n_items) ]) self.item_frequencies /= self.item_frequencies.sum() return pairs def sample_negatives(self, batch_size: int) -> torch.Tensor: """Sample negative items based on smoothed unigram distribution.""" negatives = np.random.choice( self.model.n_items, size=(batch_size, self.model.n_negative_samples), p=self.item_frequencies ) return torch.tensor(negatives, dtype=torch.long) def train_epoch( self, pairs: List[Tuple[int, int]], batch_size: int = 512 ) -> float: """Train for one epoch, return average loss.""" np.random.shuffle(pairs) total_loss = 0.0 n_batches = 0 for i in range(0, len(pairs), batch_size): batch = pairs[i:i + batch_size] targets = torch.tensor([p[0] for p in batch]) contexts = torch.tensor([p[1] for p in batch]) negatives = self.sample_negatives(len(batch)) self.optimizer.zero_grad() loss = self.model(targets, contexts, negatives) loss.backward() self.optimizer.step() total_loss += loss.item() n_batches += 1 return total_loss / n_batches # Example usageif __name__ == "__main__": # Simulated user sessions (lists of item IDs) sessions = [ [0, 1, 2, 3, 4], # User 1's session [1, 2, 5, 6], # User 2's session [0, 2, 3, 7, 8], # User 3's session [5, 6, 9, 10], # User 4's session [1, 3, 4, 7], # User 5's session # ... many more sessions in practice ] n_items = 100 # Total items in catalog model = Item2VecModel(n_items=n_items, embedding_dim=64) trainer = Item2VecTrainer(model, learning_rate=0.01) pairs = trainer.prepare_training_data(sessions) print(f"Generated {len(pairs)} training pairs") # Train for a few epochs for epoch in range(5): loss = trainer.train_epoch(pairs, batch_size=32) print(f"Epoch {epoch + 1}: Loss = {loss:.4f}") # Get embeddings embeddings = model.get_all_embeddings() print(f"\nLearned embeddings shape: {embeddings.shape}")The most powerful systems combine both. Content embeddings handle cold-start (new items with no interactions), while collaborative embeddings capture preference patterns that content alone cannot reveal. Common strategies include concatenation, weighted averaging, or learning a projection that aligns both embedding spaces.
Real-world items are inherently multimodal—a product has text descriptions, images, structured attributes, and behavioral signals. State-of-the-art systems learn unified representations that capture information from all available modalities.
The Multimodal Challenge:
Each modality has its own:
Multimodal Fusion Strategies:
Early Fusion: Concatenate features from all modalities before learning: $$\phi_{\text{early}}(i) = [\phi_{\text{text}}(i) | \phi_{\text{image}}(i) | \phi_{\text{meta}}(i)]$$
Pros: Simple, allows cross-modal interactions Cons: High dimensionality, modality imbalance
Late Fusion: Learn separate models per modality, combine predictions: $$s_{\text{late}}(u, i) = \alpha \cdot s_{\text{text}}(u, i) + \beta \cdot s_{\text{image}}(u, i) + \gamma \cdot s_{\text{meta}}(u, i)$$
Pros: Modular, handles missing modalities gracefully Cons: No cross-modal feature interactions
Cross-Modal Attention: Use attention mechanisms to learn how modalities should interact: $$\phi_{\text{fused}}(i) = \text{Attention}(\phi_{\text{text}}(i), \phi_{\text{image}}(i))$$
Pros: Learns dynamic, context-dependent fusion Cons: Higher computational cost, more training data needed
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192
import torchimport torch.nn as nnfrom typing import Dict, Optional class MultimodalItemEncoder(nn.Module): """ Encodes items using multiple modalities with learnable fusion. This architecture takes pre-extracted features from each modality and learns to combine them into a unified item representation. """ def __init__( self, text_dim: int = 768, # e.g., BERT output image_dim: int = 2048, # e.g., ResNet output metadata_dim: int = 128, # Structured features output_dim: int = 256, # Final embedding dimension fusion_method: str = 'gated' # 'concat', 'attention', 'gated' ): super().__init__() self.fusion_method = fusion_method self.output_dim = output_dim # Modality-specific projection layers self.text_proj = nn.Sequential( nn.Linear(text_dim, output_dim), nn.LayerNorm(output_dim), nn.ReLU(), nn.Dropout(0.1) ) self.image_proj = nn.Sequential( nn.Linear(image_dim, output_dim), nn.LayerNorm(output_dim), nn.ReLU(), nn.Dropout(0.1) ) self.meta_proj = nn.Sequential( nn.Linear(metadata_dim, output_dim), nn.LayerNorm(output_dim), nn.ReLU(), nn.Dropout(0.1) ) # Fusion-specific layers if fusion_method == 'concat': self.fusion = nn.Sequential( nn.Linear(output_dim * 3, output_dim), nn.LayerNorm(output_dim), nn.ReLU() ) elif fusion_method == 'attention': # Cross-modal attention self.attention = nn.MultiheadAttention( embed_dim=output_dim, num_heads=4, batch_first=True ) self.fusion_norm = nn.LayerNorm(output_dim) elif fusion_method == 'gated': # Gated fusion: learn which modalities to emphasize self.gate_net = nn.Sequential( nn.Linear(output_dim * 3, 3), nn.Softmax(dim=-1) ) # Final projection self.final_proj = nn.Linear(output_dim, output_dim) def forward( self, text_features: torch.Tensor, # (batch, text_dim) image_features: torch.Tensor, # (batch, image_dim) metadata_features: torch.Tensor, # (batch, metadata_dim) modality_mask: Optional[Dict[str, torch.Tensor]] = None ) -> torch.Tensor: """ Forward pass combining all modalities. Args: text_features: Text embeddings (e.g., from BERT) image_features: Image embeddings (e.g., from ResNet) metadata_features: Structured metadata features modality_mask: Optional dict indicating available modalities Returns: Unified item embedding (batch, output_dim) """ # Project each modality to common dimension text_emb = self.text_proj(text_features) # (batch, output_dim) image_emb = self.image_proj(image_features) # (batch, output_dim) meta_emb = self.meta_proj(metadata_features) # (batch, output_dim) # Handle missing modalities by zeroing if modality_mask is not None: if 'text' in modality_mask: text_emb = text_emb * modality_mask['text'].unsqueeze(-1) if 'image' in modality_mask: image_emb = image_emb * modality_mask['image'].unsqueeze(-1) if 'metadata' in modality_mask: meta_emb = meta_emb * modality_mask['metadata'].unsqueeze(-1) # Fuse modalities if self.fusion_method == 'concat': combined = torch.cat([text_emb, image_emb, meta_emb], dim=-1) fused = self.fusion(combined) elif self.fusion_method == 'attention': # Stack modalities as sequence: (batch, 3, output_dim) modality_stack = torch.stack([text_emb, image_emb, meta_emb], dim=1) # Self-attention across modalities attended, _ = self.attention( modality_stack, modality_stack, modality_stack ) # Mean pooling over modalities fused = self.fusion_norm(attended.mean(dim=1)) elif self.fusion_method == 'gated': # Learn per-sample modality weights combined = torch.cat([text_emb, image_emb, meta_emb], dim=-1) gates = self.gate_net(combined) # (batch, 3) # Weighted combination fused = ( gates[:, 0:1] * text_emb + gates[:, 1:2] * image_emb + gates[:, 2:3] * meta_emb ) else: raise ValueError(f"Unknown fusion method: {self.fusion_method}") return self.final_proj(fused) class MultimodalItemRepresentationSystem: """ Complete pipeline for multimodal item representation. Integrates pre-trained feature extractors with learnable fusion. """ def __init__( self, text_encoder, # e.g., SentenceTransformer image_encoder, # e.g., CLIP or ResNet wrapper metadata_extractor, # StructuredItemFeatureExtractor fusion_dim: int = 256 ): self.text_encoder = text_encoder self.image_encoder = image_encoder self.metadata_extractor = metadata_extractor # Determine input dimensions from encoders text_dim = 768 # Placeholder - get from encoder image_dim = 512 # Placeholder - get from encoder meta_dim = metadata_extractor.total_dimensions self.fusion_model = MultimodalItemEncoder( text_dim=text_dim, image_dim=image_dim, metadata_dim=meta_dim, output_dim=fusion_dim, fusion_method='gated' ) def encode_item( self, item_text: str, item_image_path: str, item_metadata: dict ) -> torch.Tensor: """ Encode a single item from all its modalities. """ # Extract features from each modality text_features = self.text_encoder.encode(item_text) image_features = self.image_encoder.encode(item_image_path) meta_features = self.metadata_extractor.transform(item_metadata) # Convert to tensors text_tensor = torch.tensor(text_features).unsqueeze(0) image_tensor = torch.tensor(image_features).unsqueeze(0) meta_tensor = torch.tensor(meta_features, dtype=torch.float32).unsqueeze(0) # Fuse and return with torch.no_grad(): return self.fusion_model(text_tensor, image_tensor, meta_tensor)Modern foundation models like CLIP learn aligned text-image embeddings from web-scale data. A product's text description and image map to nearby points in the same embedding space. This enables zero-shot recommendations: find items whose images are similar to a text query, or vice versa—powerful for search and discovery.
How do we know if our item representations are good? Before deploying to production, we need principled ways to evaluate representation quality.
Intrinsic Evaluation (Representation Quality):
1. Similarity Coherence: Do items that humans consider similar have close embeddings?
2. Clustering Quality: Do meaningful item categories emerge as clusters?
3. Nearest Neighbor Inspection: Manual qualitative evaluation of nearest neighbors
4. Dimensionality Analysis:
| Evaluation Type | What It Measures | Methods | Good Signal |
|---|---|---|---|
| Similarity Coherence | Alignment with human similarity | Annotated pairs, rank correlation | High correlation (>0.7) |
| Clustering Quality | Category structure preservation | ARI, NMI vs known labels | Significant cluster separation |
| Retrieval Performance | Downstream recommendation quality | Recall@K, NDCG | Better than baselines |
| Coverage | Representation of all item types | Analyze embedding variance | Even distribution, no collapse |
| Interpretability | Human understanding of dimensions | Top items per dimension analysis | Meaningful semantic axes |
Extrinsic Evaluation (Downstream Performance):
Ultimately, representations should be evaluated by how well they support the recommendation task:
1. Content-Based Retrieval:
2. Cold-Start Performance:
3. Transfer Performance:
4. A/B Testing:
Common Failure Modes:
Strong offline metrics don't guarantee production success. A representation that excels at predicting held-out ratings might produce homogeneous recommendations that bore users. Always validate with A/B tests measuring actual user satisfaction and engagement.
Deploying item representations at scale introduces engineering challenges beyond model quality.
Representation Serving:
1. Storage:
2. Similarity Search:
3. Update Latency:
4. Versioning:
We've explored the foundation of content-based recommendation: how to represent items mathematically such that algorithms can understand, compare, and recommend them.
What's Next:
With item representations established, we'll explore the complementary challenge: user profiles. How do we represent what users want? User profiles aggregates and abstracts from individual interactions to build a persistent model of user preferences that can be matched against item representations.
You now understand how to construct item representations for content-based recommendation systems. You can engineer features from structured metadata, extract representations from unstructured content, leverage learned embeddings, fuse multiple modalities, and reason about evaluation and production deployment. Next, we'll build the user side of the equation: user profiles.