Loading learning content...
Text is the richest source of item information in many domains—product descriptions, movie plots, article content, user reviews. But algorithms cannot process words directly; they require numerical representations.
The evolution from TF-IDF (Term Frequency-Inverse Document Frequency) to modern neural embeddings represents one of the most significant advances in NLP and recommendation systems. Understanding both approaches—their strengths, limitations, and when to use each—is essential for building effective content-based recommenders.
By the end of this page, you will master TF-IDF computation and its theoretical foundations, understand word embeddings (Word2Vec, GloVe) and document embeddings (Doc2Vec, BERT), and know how to choose the right representation strategy for your recommendation task.
Bag of Words (BoW) is the simplest text representation: a document is represented as a vector of word counts, ignoring word order.
Formal Definition:
Given vocabulary $V = {w_1, w_2, ..., w_{|V|}}$, document $d$ is represented as:
$$\text{BoW}(d) = [c(w_1, d), c(w_2, d), ..., c(w_{|V|}, d)]$$
Where $c(w, d)$ is the count of word $w$ in document $d$.
Example:
Vocabulary: ["action", "comedy", "drama", "thriller", "romance"]
Movie A description: "An action-packed thriller with intense drama"
Movie B description: "A romantic comedy with comedic drama"
Limitations of BoW:
Raw term counts are misleading—common words like "the" appear frequently but carry no discriminative information. TF-IDF addresses this by weighting terms by their importance.
Term Frequency (TF):
How often does the term appear in this document?
$$\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$
Or with logarithmic dampening to reduce impact of very frequent terms: $$\text{TF}(t, d) = 1 + \log(f_{t,d}) \text{ if } f_{t,d} > 0 \text{ else } 0$$
Inverse Document Frequency (IDF):
How rare is this term across all documents?
$$\text{IDF}(t, D) = \log\frac{|D|}{|{d \in D : t \in d}|}$$
Where $|D|$ is total documents and the denominator counts documents containing $t$.
TF-IDF Score:
$$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$$
Intuition:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
import numpy as npfrom collections import Counterfrom typing import List, Dictimport re class TFIDFVectorizer: """ Production-quality TF-IDF implementation with customization options. """ def __init__( self, max_features: int = 10000, min_df: int = 2, max_df: float = 0.95, sublinear_tf: bool = True, use_idf: bool = True, norm: str = 'l2' ): self.max_features = max_features self.min_df = min_df self.max_df = max_df self.sublinear_tf = sublinear_tf self.use_idf = use_idf self.norm = norm self.vocabulary_: Dict[str, int] = {} self.idf_: np.ndarray = None self.n_docs_ = 0 def _tokenize(self, text: str) -> List[str]: """Simple tokenization - lowercase and extract words.""" text = text.lower() words = re.findall(r'\b[a-z]{2,}\b', text) return words def fit(self, documents: List[str]) -> 'TFIDFVectorizer': """Learn vocabulary and IDF weights from corpus.""" self.n_docs_ = len(documents) # Count document frequency for each term doc_freq = Counter() term_counts = Counter() for doc in documents: tokens = set(self._tokenize(doc)) for token in tokens: doc_freq[token] += 1 term_counts.update(self._tokenize(doc)) # Filter by document frequency max_doc_count = int(self.max_df * self.n_docs_) valid_terms = [ term for term, freq in doc_freq.items() if self.min_df <= freq <= max_doc_count ] # Keep top max_features by total frequency valid_terms = sorted( valid_terms, key=lambda t: term_counts[t], reverse=True )[:self.max_features] self.vocabulary_ = {term: i for i, term in enumerate(valid_terms)} # Compute IDF if self.use_idf: self.idf_ = np.zeros(len(self.vocabulary_)) for term, idx in self.vocabulary_.items(): df = doc_freq[term] self.idf_[idx] = np.log(self.n_docs_ / df) + 1 else: self.idf_ = np.ones(len(self.vocabulary_)) return self def transform(self, documents: List[str]) -> np.ndarray: """Transform documents to TF-IDF matrix.""" n_docs = len(documents) n_features = len(self.vocabulary_) matrix = np.zeros((n_docs, n_features)) for i, doc in enumerate(documents): tokens = self._tokenize(doc) term_counts = Counter(tokens) for term, count in term_counts.items(): if term in self.vocabulary_: idx = self.vocabulary_[term] # Term frequency if self.sublinear_tf: tf = 1 + np.log(count) if count > 0 else 0 else: tf = count / len(tokens) # TF-IDF matrix[i, idx] = tf * self.idf_[idx] # Normalize if self.norm == 'l2': norm = np.linalg.norm(matrix[i]) if norm > 0: matrix[i] /= norm return matrix def fit_transform(self, documents: List[str]) -> np.ndarray: """Fit and transform in one call.""" return self.fit(documents).transform(documents) # Example usageif __name__ == "__main__": docs = [ "Sci-fi thriller about dreams within dreams", "Romantic comedy set in New York City", "Action-packed superhero adventure film", "Psychological thriller with twist ending", "Romantic drama about star-crossed lovers" ] vectorizer = TFIDFVectorizer(max_features=100, min_df=1) tfidf_matrix = vectorizer.fit_transform(docs) print(f"Vocabulary size: {len(vectorizer.vocabulary_)}") print(f"Matrix shape: {tfidf_matrix.shape}") # Cosine similarity between documents similarities = tfidf_matrix @ tfidf_matrix.T print("\nDocument similarities:") print(similarities.round(3))In content-based filtering, TF-IDF serves multiple purposes:
1. Item Representation: Items become TF-IDF vectors from their text content: $$\phi_{\text{tfidf}}(i) = [\text{TF-IDF}(t_1, d_i), ..., \text{TF-IDF}(t_{|V|}, d_i)]$$
2. User Profile Construction: Aggregate TF-IDF vectors of items user has engaged with: $$\psi_{\text{tfidf}}(u) = \text{normalize}\left(\sum_{i \in H_u} r_{ui} \cdot \phi_{\text{tfidf}}(i)\right)$$
3. Recommendation Scoring: Cosine similarity between user profile and item vectors: $$\text{score}(u, i) = \cos(\psi(u), \phi(i))$$
Advantages for RecSys:
Limitations:
For recommendations: (1) Include item titles with higher weight than descriptions, (2) Extract key phrases as additional terms, (3) Consider n-grams (bigrams, trigrams) to capture phrases like 'machine learning', (4) Domain-specific stopwords (e.g., 'movie' for film recommendations) should be filtered.
Word embeddings learn dense, low-dimensional vectors where semantically similar words have similar representations.
The Key Insight:
"You shall know a word by the company it keeps" — J.R. Firth
Words appearing in similar contexts have similar meanings. Word embeddings learn to predict context, and similar predictions yield similar vectors.
Word2Vec Architectures:
Skip-gram: Predict context words given center word: $$P(w_{context} | w_{center}) = \frac{\exp(v_{context} \cdot v_{center})}{\sum_{w} \exp(v_w \cdot v_{center})}$$
CBOW (Continuous Bag of Words): Predict center word given context: $$P(w_{center} | w_{context_1}, ..., w_{context_k})$$
Properties of Word Embeddings:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom typing import List, Dict, Optionalfrom gensim.models import Word2Vec, KeyedVectors class WordEmbeddingDocumentEncoder: """ Encode documents using pre-trained word embeddings. Aggregates word vectors into document vectors using various strategies. """ def __init__( self, embedding_model: KeyedVectors, aggregation: str = 'mean', # 'mean', 'tfidf_weighted', 'sif' tfidf_weights: Optional[Dict[str, float]] = None ): self.embedding_model = embedding_model self.aggregation = aggregation self.tfidf_weights = tfidf_weights or {} self.embedding_dim = embedding_model.vector_size def encode_document(self, tokens: List[str]) -> np.ndarray: """Encode a document (list of tokens) into a single vector.""" # Get embeddings for words in vocabulary valid_embeddings = [] weights = [] for token in tokens: if token in self.embedding_model: valid_embeddings.append(self.embedding_model[token]) if self.aggregation == 'tfidf_weighted': weights.append(self.tfidf_weights.get(token, 1.0)) else: weights.append(1.0) if not valid_embeddings: return np.zeros(self.embedding_dim) embeddings = np.array(valid_embeddings) weights = np.array(weights) if self.aggregation == 'mean': doc_vector = embeddings.mean(axis=0) elif self.aggregation == 'tfidf_weighted': doc_vector = np.average(embeddings, axis=0, weights=weights) elif self.aggregation == 'sif': # Smooth Inverse Frequency weighting doc_vector = self._sif_embedding(embeddings, tokens) else: doc_vector = embeddings.mean(axis=0) # L2 normalize norm = np.linalg.norm(doc_vector) return doc_vector / norm if norm > 0 else doc_vector def _sif_embedding( self, embeddings: np.ndarray, tokens: List[str], a: float = 0.001 ) -> np.ndarray: """ Smooth Inverse Frequency (SIF) embedding. Downweights frequent words and removes first principal component. """ # Compute SIF weights (simplified - uses uniform frequencies) weights = np.array([a / (a + 0.01) for _ in tokens]) # Weighted average sif_embedding = np.average(embeddings, axis=0, weights=weights) return sif_embedding def similarity(self, doc1_tokens: List[str], doc2_tokens: List[str]) -> float: """Compute cosine similarity between two documents.""" vec1 = self.encode_document(doc1_tokens) vec2 = self.encode_document(doc2_tokens) return np.dot(vec1, vec2) # Using pre-trained embeddingsdef load_pretrained_embeddings(): """Load pre-trained embeddings (GloVe or Word2Vec).""" # Example: Load GloVe embeddings # embeddings = KeyedVectors.load_word2vec_format( # 'glove.6B.300d.txt', binary=False # ) # For demonstration, train on sample data sentences = [ ["action", "thriller", "exciting", "adventure"], ["romantic", "comedy", "funny", "love"], ["drama", "emotional", "powerful", "moving"], ["horror", "scary", "thriller", "suspense"], ] model = Word2Vec(sentences, vector_size=50, window=3, min_count=1) return model.wvAveraging word embeddings loses information about word order and document-level semantics. Document embeddings directly encode entire passages.
Doc2Vec (Paragraph Vectors):
Extends Word2Vec by adding a "document ID" as additional context:
Sentence-BERT (SBERT):
Fine-tunes BERT to produce semantically meaningful sentence embeddings: $$\text{emb}(s) = \text{BERT}(s)_{[CLS]}$$ or mean pooling
Siamese architecture trained on similarity/NLI datasets.
Universal Sentence Encoder:
Google's model specifically designed for semantic similarity tasks. Available in transformer and DAN (deep averaging network) variants.
Comparison:
| Method | Dimension | Speed | Quality | Training |
|---|---|---|---|---|
| TF-IDF | 10K-100K | Fast | Lexical | None |
| Word2Vec avg | 100-300 | Fast | Moderate | Pretrained |
| Doc2Vec | 50-400 | Medium | Good | On corpus |
| SBERT | 384-768 | Slow | Excellent | Pretrained |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
from sentence_transformers import SentenceTransformerimport numpy as npfrom typing import List class SemanticDocumentEncoder: """ Encode documents using transformer-based models. Uses Sentence-BERT for high-quality semantic embeddings. """ def __init__(self, model_name: str = 'all-MiniLM-L6-v2'): """ Initialize with a sentence-transformers model. Popular options: - 'all-MiniLM-L6-v2': Fast, 384 dims, good quality - 'all-mpnet-base-v2': Slower, 768 dims, best quality - 'paraphrase-multilingual-MiniLM-L12-v2': Multilingual """ self.model = SentenceTransformer(model_name) self.embedding_dim = self.model.get_sentence_embedding_dimension() def encode(self, texts: List[str], batch_size: int = 32) -> np.ndarray: """ Encode texts into semantic embeddings. Args: texts: List of text strings batch_size: Batch size for encoding Returns: Matrix of shape (n_texts, embedding_dim) """ embeddings = self.model.encode( texts, batch_size=batch_size, normalize_embeddings=True, show_progress_bar=False ) return embeddings def similarity_matrix(self, texts: List[str]) -> np.ndarray: """Compute pairwise similarity matrix.""" embeddings = self.encode(texts) return embeddings @ embeddings.T class HybridDocumentEncoder: """ Combines TF-IDF (lexical) with embeddings (semantic). """ def __init__( self, tfidf_vectorizer, semantic_encoder: SemanticDocumentEncoder, tfidf_weight: float = 0.3 ): self.tfidf_vectorizer = tfidf_vectorizer self.semantic_encoder = semantic_encoder self.tfidf_weight = tfidf_weight def encode(self, texts: List[str]) -> np.ndarray: """Get hybrid TF-IDF + semantic embeddings.""" # TF-IDF features tfidf_features = self.tfidf_vectorizer.transform(texts) if hasattr(tfidf_features, 'toarray'): tfidf_features = tfidf_features.toarray() # Reduce TF-IDF dimensionality if needed # (Use SVD/PCA in practice) # Semantic embeddings semantic_features = self.semantic_encoder.encode(texts) # Concatenate (or learn fusion) combined = np.hstack([ self.tfidf_weight * tfidf_features, (1 - self.tfidf_weight) * semantic_features ]) # Normalize norms = np.linalg.norm(combined, axis=1, keepdims=True) return combined / np.maximum(norms, 1e-10)For new projects, start with Sentence-BERT (SBERT). Models like 'all-MiniLM-L6-v2' offer excellent quality at reasonable speed. They capture semantic similarity that TF-IDF misses (synonyms, paraphrases) while being pretrained on massive data.
When to Use TF-IDF:
When to Use Neural Embeddings:
Hybrid Approaches:
Combine both for best of both worlds:
| Criterion | TF-IDF | Embeddings | Hybrid |
|---|---|---|---|
| Synonym handling | ❌ Poor | ✅ Excellent | ✅ Good |
| Rare terms | ✅ Preserved | ❌ OOV issues | ✅ Best |
| Interpretability | ✅ High | ❌ Black box | ⚠️ Partial |
| Speed | ✅ Fast | ⚠️ Medium | ⚠️ Medium |
| Cold-start | ✅ Works | ✅ Works | ✅ Works |
| Domain adaptation | ✅ Easy | ⚠️ May need fine-tune | ✅ Flexible |
Preprocessing Pipeline:
Scaling Considerations:
If you update embedding models, old item/user embeddings become incompatible. Plan for full re-encoding or version management to handle model updates gracefully.
What's Next:
With text representations mastered, we'll explore hybrid approaches that combine content-based methods with collaborative filtering, creating systems that leverage both the content of items and the wisdom of the crowd.
You now command the full spectrum of text representation techniques for recommendation systems—from classical TF-IDF to cutting-edge neural embeddings—and understand when to apply each approach.