Machine LearningRecommendation Systems

Content-Based Methods

LevelIntermediate

Duration120 mins

TopicRecommendation Systems

3 / 5

TF-IDF and Embeddings

From Words to Vectors

Text is the richest source of item information in many domains—product descriptions, movie plots, article content, user reviews. But algorithms cannot process words directly; they require numerical representations.

The evolution from TF-IDF (Term Frequency-Inverse Document Frequency) to modern neural embeddings represents one of the most significant advances in NLP and recommendation systems. Understanding both approaches—their strengths, limitations, and when to use each—is essential for building effective content-based recommenders.

What You Will Learn

By the end of this page, you will master TF-IDF computation and its theoretical foundations, understand word embeddings (Word2Vec, GloVe) and document embeddings (Doc2Vec, BERT), and know how to choose the right representation strategy for your recommendation task.

Foundation: Bag of Words

Bag of Words (BoW) is the simplest text representation: a document is represented as a vector of word counts, ignoring word order.

Formal Definition:

Given vocabulary $V = {w_1, w_2, ..., w_{|V|}}$, document $d$ is represented as:

$$\text{BoW}(d) = [c(w_1, d), c(w_2, d), ..., c(w_{|V|}, d)]$$

Where $c(w, d)$ is the count of word $w$ in document $d$.

Example:

Vocabulary: ["action", "comedy", "drama", "thriller", "romance"]

Movie A description: "An action-packed thriller with intense drama"

BoW(A) = [1, 0, 1, 1, 0]

Movie B description: "A romantic comedy with comedic drama"

BoW(B) = [0, 2, 1, 0, 1]

Limitations of BoW:

No semantics: "good" and "excellent" are unrelated
Ignores word order: "dog bites man" = "man bites dog"
High dimensionality: Vocabulary can be 100K+ words
Sparsity: Most documents use tiny vocabulary subset

TF-IDF: Weighting What Matters

Raw term counts are misleading—common words like "the" appear frequently but carry no discriminative information. TF-IDF addresses this by weighting terms by their importance.

Term Frequency (TF):

How often does the term appear in this document?

$$\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$

Or with logarithmic dampening to reduce impact of very frequent terms: $$\text{TF}(t, d) = 1 + \log(f_{t,d}) \text{ if } f_{t,d} > 0 \text{ else } 0$$

Inverse Document Frequency (IDF):

How rare is this term across all documents?

$$\text{IDF}(t, D) = \log\frac{|D|}{|{d \in D : t \in d}|}$$

Where $|D|$ is total documents and the denominator counts documents containing $t$.

TF-IDF Score:

$$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$$

Intuition:

High TF-IDF: Term is frequent in this document but rare overall → discriminative
Low TF-IDF: Term is either rare in document or common everywhere → uninformative

TF-IDF Calculation ExampleComputing TF-IDF for movie descriptions in a collection of 10,000 movies:

Input

Output

tfidf_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from collections import Counter
from typing import List, Dict
import re
 
class TFIDFVectorizer:
    """
    Production-quality TF-IDF implementation with customization options.
    """
    
    def __init__(
        self,
        max_features: int = 10000,
        min_df: int = 2,
        max_df: float = 0.95,
        sublinear_tf: bool = True,
        use_idf: bool = True,
        norm: str = 'l2'
    ):
        self.max_features = max_features
        self.min_df = min_df
        self.max_df = max_df
        self.sublinear_tf = sublinear_tf
        self.use_idf = use_idf
        self.norm = norm
        
        self.vocabulary_: Dict[str, int] = {}
        self.idf_: np.ndarray = None
        self.n_docs_ = 0
    
    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization - lowercase and extract words."""
        text = text.lower()
        words = re.findall(r'\b[a-z]{2,}\b', text)
        return words
    
    def fit(self, documents: List[str]) -> 'TFIDFVectorizer':
        """Learn vocabulary and IDF weights from corpus."""
        self.n_docs_ = len(documents)
        
        # Count document frequency for each term
        doc_freq = Counter()
        term_counts = Counter()
        
        for doc in documents:
            tokens = set(self._tokenize(doc))
            for token in tokens:
                doc_freq[token] += 1
            term_counts.update(self._tokenize(doc))
        
        # Filter by document frequency
        max_doc_count = int(self.max_df * self.n_docs_)
        valid_terms = [
            term for term, freq in doc_freq.items()
            if self.min_df <= freq <= max_doc_count
        ]
        
        # Keep top max_features by total frequency
        valid_terms = sorted(
            valid_terms,
            key=lambda t: term_counts[t],
            reverse=True
        )[:self.max_features]
        
        self.vocabulary_ = {term: i for i, term in enumerate(valid_terms)}
        
        # Compute IDF
        if self.use_idf:
            self.idf_ = np.zeros(len(self.vocabulary_))
            for term, idx in self.vocabulary_.items():
                df = doc_freq[term]
                self.idf_[idx] = np.log(self.n_docs_ / df) + 1
        else:
            self.idf_ = np.ones(len(self.vocabulary_))
        
        return self
    
    def transform(self, documents: List[str]) -> np.ndarray:
        """Transform documents to TF-IDF matrix."""
        n_docs = len(documents)
        n_features = len(self.vocabulary_)
        matrix = np.zeros((n_docs, n_features))
        
        for i, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            term_counts = Counter(tokens)
            
            for term, count in term_counts.items():
                if term in self.vocabulary_:
                    idx = self.vocabulary_[term]
                    
                    # Term frequency
                    if self.sublinear_tf:
                        tf = 1 + np.log(count) if count > 0 else 0
                    else:
                        tf = count / len(tokens)
                    
                    # TF-IDF
                    matrix[i, idx] = tf * self.idf_[idx]
            
            # Normalize
            if self.norm == 'l2':
                norm = np.linalg.norm(matrix[i])
                if norm > 0:
                    matrix[i] /= norm
        
        return matrix
    
    def fit_transform(self, documents: List[str]) -> np.ndarray:
        """Fit and transform in one call."""
        return self.fit(documents).transform(documents)
 
 
# Example usage
if __name__ == "__main__":
    docs = [
        "Sci-fi thriller about dreams within dreams",
        "Romantic comedy set in New York City",
        "Action-packed superhero adventure film",
        "Psychological thriller with twist ending",
        "Romantic drama about star-crossed lovers"
    ]
    
    vectorizer = TFIDFVectorizer(max_features=100, min_df=1)
    tfidf_matrix = vectorizer.fit_transform(docs)
    
    print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
    print(f"Matrix shape: {tfidf_matrix.shape}")
    
    # Cosine similarity between documents
    similarities = tfidf_matrix @ tfidf_matrix.T
    print("\nDocument similarities:")
    print(similarities.round(3))

TF-IDF for Content-Based Recommendations

In content-based filtering, TF-IDF serves multiple purposes:

1. Item Representation: Items become TF-IDF vectors from their text content: $$\phi_{\text{tfidf}}(i) = [\text{TF-IDF}(t_1, d_i), ..., \text{TF-IDF}(t_{|V|}, d_i)]$$

2. User Profile Construction: Aggregate TF-IDF vectors of items user has engaged with: $$\psi_{\text{tfidf}}(u) = \text{normalize}\left(\sum_{i \in H_u} r_{ui} \cdot \phi_{\text{tfidf}}(i)\right)$$

3. Recommendation Scoring: Cosine similarity between user profile and item vectors: $$\text{score}(u, i) = \cos(\psi(u), \phi(i))$$

Advantages for RecSys:

Interpretable: Can explain recommendations via matching terms
No training required: Pure statistical computation
Works with any text: Generalizes across domains
Cold-start friendly: New items with descriptions are immediately representable

Limitations:

Vocabulary mismatch: "car" vs "automobile" seen as unrelated
High dimensionality: 10K-100K features
No semantic understanding: Similar meaning ≠ similar vectors

TF-IDF Best Practices

For recommendations: (1) Include item titles with higher weight than descriptions, (2) Extract key phrases as additional terms, (3) Consider n-grams (bigrams, trigrams) to capture phrases like 'machine learning', (4) Domain-specific stopwords (e.g., 'movie' for film recommendations) should be filtered.

Word Embeddings: Semantic Representations

Word embeddings learn dense, low-dimensional vectors where semantically similar words have similar representations.

The Key Insight:

"You shall know a word by the company it keeps" — J.R. Firth

Words appearing in similar contexts have similar meanings. Word embeddings learn to predict context, and similar predictions yield similar vectors.

Word2Vec Architectures:

Skip-gram: Predict context words given center word: $$P(w_{context} | w_{center}) = \frac{\exp(v_{context} \cdot v_{center})}{\sum_{w} \exp(v_w \cdot v_{center})}$$

CBOW (Continuous Bag of Words): Predict center word given context: $$P(w_{center} | w_{context_1}, ..., w_{context_k})$$

Properties of Word Embeddings:

Semantic similarity: sim("king", "queen") > sim("king", "car")
Analogies: king - man + woman ≈ queen
Clustering: Related words form clusters
Low dimension: Typically 100-300 dimensions vs 100K+ for TF-IDF

word_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from typing import List, Dict, Optional
from gensim.models import Word2Vec, KeyedVectors
 
class WordEmbeddingDocumentEncoder:
    """
    Encode documents using pre-trained word embeddings.
    
    Aggregates word vectors into document vectors using various strategies.
    """
    
    def __init__(
        self,
        embedding_model: KeyedVectors,
        aggregation: str = 'mean',  # 'mean', 'tfidf_weighted', 'sif'
        tfidf_weights: Optional[Dict[str, float]] = None
    ):
        self.embedding_model = embedding_model
        self.aggregation = aggregation
        self.tfidf_weights = tfidf_weights or {}
        self.embedding_dim = embedding_model.vector_size
    
    def encode_document(self, tokens: List[str]) -> np.ndarray:
        """Encode a document (list of tokens) into a single vector."""
        
        # Get embeddings for words in vocabulary
        valid_embeddings = []
        weights = []
        
        for token in tokens:
            if token in self.embedding_model:
                valid_embeddings.append(self.embedding_model[token])
                
                if self.aggregation == 'tfidf_weighted':
                    weights.append(self.tfidf_weights.get(token, 1.0))
                else:
                    weights.append(1.0)
        
        if not valid_embeddings:
            return np.zeros(self.embedding_dim)
        
        embeddings = np.array(valid_embeddings)
        weights = np.array(weights)
        
        if self.aggregation == 'mean':
            doc_vector = embeddings.mean(axis=0)
        elif self.aggregation == 'tfidf_weighted':
            doc_vector = np.average(embeddings, axis=0, weights=weights)
        elif self.aggregation == 'sif':
            # Smooth Inverse Frequency weighting
            doc_vector = self._sif_embedding(embeddings, tokens)
        else:
            doc_vector = embeddings.mean(axis=0)
        
        # L2 normalize
        norm = np.linalg.norm(doc_vector)
        return doc_vector / norm if norm > 0 else doc_vector
    
    def _sif_embedding(
        self,
        embeddings: np.ndarray,
        tokens: List[str],
        a: float = 0.001
    ) -> np.ndarray:
        """
        Smooth Inverse Frequency (SIF) embedding.
        
        Downweights frequent words and removes first principal component.
        """
        # Compute SIF weights (simplified - uses uniform frequencies)
        weights = np.array([a / (a + 0.01) for _ in tokens])
        
        # Weighted average
        sif_embedding = np.average(embeddings, axis=0, weights=weights)
        return sif_embedding
    
    def similarity(self, doc1_tokens: List[str], doc2_tokens: List[str]) -> float:
        """Compute cosine similarity between two documents."""
        vec1 = self.encode_document(doc1_tokens)
        vec2 = self.encode_document(doc2_tokens)
        return np.dot(vec1, vec2)
 
 
# Using pre-trained embeddings
def load_pretrained_embeddings():
    """Load pre-trained embeddings (GloVe or Word2Vec)."""
    # Example: Load GloVe embeddings
    # embeddings = KeyedVectors.load_word2vec_format(
    #     'glove.6B.300d.txt', binary=False
    # )
    
    # For demonstration, train on sample data
    sentences = [
        ["action", "thriller", "exciting", "adventure"],
        ["romantic", "comedy", "funny", "love"],
        ["drama", "emotional", "powerful", "moving"],
        ["horror", "scary", "thriller", "suspense"],
    ]
    
    model = Word2Vec(sentences, vector_size=50, window=3, min_count=1)
    return model.wv

Document Embeddings: Beyond Word Averaging

Averaging word embeddings loses information about word order and document-level semantics. Document embeddings directly encode entire passages.

Doc2Vec (Paragraph Vectors):

Extends Word2Vec by adding a "document ID" as additional context:

Each document gets its own embedding
Trained jointly with word embeddings
Captures document-level themes

Sentence-BERT (SBERT):

Fine-tunes BERT to produce semantically meaningful sentence embeddings: $$\text{emb}(s) = \text{BERT}(s)_{[CLS]}$$ or mean pooling

Siamese architecture trained on similarity/NLI datasets.

Universal Sentence Encoder:

Google's model specifically designed for semantic similarity tasks. Available in transformer and DAN (deep averaging network) variants.

Comparison:

Method	Dimension	Speed	Quality	Training
TF-IDF	10K-100K	Fast	Lexical	None
Word2Vec avg	100-300	Fast	Moderate	Pretrained
Doc2Vec	50-400	Medium	Good	On corpus
SBERT	384-768	Slow	Excellent	Pretrained

document_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List
 
class SemanticDocumentEncoder:
    """
    Encode documents using transformer-based models.
    
    Uses Sentence-BERT for high-quality semantic embeddings.
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize with a sentence-transformers model.
        
        Popular options:
        - 'all-MiniLM-L6-v2': Fast, 384 dims, good quality
        - 'all-mpnet-base-v2': Slower, 768 dims, best quality
        - 'paraphrase-multilingual-MiniLM-L12-v2': Multilingual
        """
        self.model = SentenceTransformer(model_name)
        self.embedding_dim = self.model.get_sentence_embedding_dimension()
    
    def encode(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
        """
        Encode texts into semantic embeddings.
        
        Args:
            texts: List of text strings
            batch_size: Batch size for encoding
            
        Returns:
            Matrix of shape (n_texts, embedding_dim)
        """
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            normalize_embeddings=True,
            show_progress_bar=False
        )
        return embeddings
    
    def similarity_matrix(self, texts: List[str]) -> np.ndarray:
        """Compute pairwise similarity matrix."""
        embeddings = self.encode(texts)
        return embeddings @ embeddings.T
 
 
class HybridDocumentEncoder:
    """
    Combines TF-IDF (lexical) with embeddings (semantic).
    """
    
    def __init__(
        self,
        tfidf_vectorizer,
        semantic_encoder: SemanticDocumentEncoder,
        tfidf_weight: float = 0.3
    ):
        self.tfidf_vectorizer = tfidf_vectorizer
        self.semantic_encoder = semantic_encoder
        self.tfidf_weight = tfidf_weight
    
    def encode(self, texts: List[str]) -> np.ndarray:
        """Get hybrid TF-IDF + semantic embeddings."""
        # TF-IDF features
        tfidf_features = self.tfidf_vectorizer.transform(texts)
        if hasattr(tfidf_features, 'toarray'):
            tfidf_features = tfidf_features.toarray()
        
        # Reduce TF-IDF dimensionality if needed
        # (Use SVD/PCA in practice)
        
        # Semantic embeddings
        semantic_features = self.semantic_encoder.encode(texts)
        
        # Concatenate (or learn fusion)
        combined = np.hstack([
            self.tfidf_weight * tfidf_features,
            (1 - self.tfidf_weight) * semantic_features
        ])
        
        # Normalize
        norms = np.linalg.norm(combined, axis=1, keepdims=True)
        return combined / np.maximum(norms, 1e-10)

Modern Best Practice

For new projects, start with Sentence-BERT (SBERT). Models like 'all-MiniLM-L6-v2' offer excellent quality at reasonable speed. They capture semantic similarity that TF-IDF misses (synonyms, paraphrases) while being pretrained on massive data.

Choosing the Right Representation

When to Use TF-IDF:

Exact keyword matching matters (technical documents, legal)
Interpretability is critical (explainable recommendations)
Computational resources are limited
Domain-specific vocabulary not in pretrained models
Bag-of-words assumption holds (short descriptions)

When to Use Neural Embeddings:

Semantic similarity matters more than lexical
Handling synonyms, paraphrases essential
Multilingual content
Rich pretrained models available for domain
Computational budget allows inference cost

Hybrid Approaches:

Combine both for best of both worlds:

TF-IDF for rare terms, exact matches
Embeddings for semantic similarity
Learn fusion weights from recommendation data

Representation Strategy Decision Matrix
Criterion	TF-IDF	Embeddings	Hybrid
Synonym handling	❌ Poor	✅ Excellent	✅ Good
Rare terms	✅ Preserved	❌ OOV issues	✅ Best
Interpretability	✅ High	❌ Black box	⚠️ Partial
Speed	✅ Fast	⚠️ Medium	⚠️ Medium
Cold-start	✅ Works	✅ Works	✅ Works
Domain adaptation	✅ Easy	⚠️ May need fine-tune	✅ Flexible

Production Implementation

Preprocessing Pipeline:

Text Cleaning: Remove HTML, normalize unicode, handle special chars
Tokenization: Word-level or subword (BPE for embeddings)
Normalization: Lowercase, stemming/lemmatization (for TF-IDF)
Stopword Removal: Domain-specific stopwords
Encoding: TF-IDF or embedding model inference
Storage: Vector database or feature store

Scaling Considerations:

Batch Processing: Encode items in batches, not one-by-one
GPU Acceleration: Embedding models benefit from GPU
Caching: Cache embeddings, recompute only on content change
Quantization: Reduce embedding precision (float32 → int8)
Dimensionality: Use smaller models (MiniLM) when latency matters

Embedding Drift

If you update embedding models, old item/user embeddings become incompatible. Plan for full re-encoding or version management to handle model updates gracefully.

Summary: TF-IDF and Embeddings

Key Takeaways

•TF-IDF weights terms by importance — Balancing frequency with rarity for discriminative representations.
•Word embeddings capture semantics — Similar meanings yield similar vectors through context prediction.
•Document embeddings encode passages directly — SBERT and similar models provide state-of-the-art semantic similarity.
•Hybrid approaches combine strengths — TF-IDF for lexical precision, embeddings for semantic understanding.
•Choice depends on requirements — Interpretability, domain vocabulary, computational budget all factor in.

What's Next:

With text representations mastered, we'll explore hybrid approaches that combine content-based methods with collaborative filtering, creating systems that leverage both the content of items and the wisdom of the crowd.

Page Complete

You now command the full spectrum of text representation techniques for recommendation systems—from classical TF-IDF to cutting-edge neural embeddings—and understand when to apply each approach.

3 / 5

Loading learning content...

Machine LearningRecommendation Systems

Content-Based Methods

LevelIntermediate

Duration120 mins

TopicRecommendation Systems

3 / 5

TF-IDF and Embeddings

From Words to Vectors

What You Will Learn

Foundation: Bag of Words

Bag of Words (BoW) is the simplest text representation: a document is represented as a vector of word counts, ignoring word order.

Formal Definition:

Given vocabulary $V = {w_1, w_2, ..., w_{|V|}}$, document $d$ is represented as:

$$\text{BoW}(d) = [c(w_1, d), c(w_2, d), ..., c(w_{|V|}, d)]$$

Where $c(w, d)$ is the count of word $w$ in document $d$.

Example:

Vocabulary: ["action", "comedy", "drama", "thriller", "romance"]

Movie A description: "An action-packed thriller with intense drama"

BoW(A) = [1, 0, 1, 1, 0]

Movie B description: "A romantic comedy with comedic drama"

BoW(B) = [0, 2, 1, 0, 1]

Limitations of BoW:

No semantics: "good" and "excellent" are unrelated
Ignores word order: "dog bites man" = "man bites dog"
High dimensionality: Vocabulary can be 100K+ words
Sparsity: Most documents use tiny vocabulary subset

TF-IDF: Weighting What Matters

Raw term counts are misleading—common words like "the" appear frequently but carry no discriminative information. TF-IDF addresses this by weighting terms by their importance.

Term Frequency (TF):

How often does the term appear in this document?

$$\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$

Or with logarithmic dampening to reduce impact of very frequent terms: $$\text{TF}(t, d) = 1 + \log(f_{t,d}) \text{ if } f_{t,d} > 0 \text{ else } 0$$

Inverse Document Frequency (IDF):

How rare is this term across all documents?

$$\text{IDF}(t, D) = \log\frac{|D|}{|{d \in D : t \in d}|}$$

Where $|D|$ is total documents and the denominator counts documents containing $t$.

TF-IDF Score:

$$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)$$

Intuition:

High TF-IDF: Term is frequent in this document but rare overall → discriminative
Low TF-IDF: Term is either rare in document or common everywhere → uninformative

TF-IDF Calculation ExampleComputing TF-IDF for movie descriptions in a collection of 10,000 movies:

Input

Output

tfidf_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
import numpy as np
from collections import Counter
from typing import List, Dict
import re
 
class TFIDFVectorizer:
    """
    Production-quality TF-IDF implementation with customization options.
    """
    
    def __init__(
        self,
        max_features: int = 10000,
        min_df: int = 2,
        max_df: float = 0.95,
        sublinear_tf: bool = True,
        use_idf: bool = True,
        norm: str = 'l2'
    ):
        self.max_features = max_features
        self.min_df = min_df
        self.max_df = max_df
        self.sublinear_tf = sublinear_tf
        self.use_idf = use_idf
        self.norm = norm
        
        self.vocabulary_: Dict[str, int] = {}
        self.idf_: np.ndarray = None
        self.n_docs_ = 0
    
    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization - lowercase and extract words."""
        text = text.lower()
        words = re.findall(r'\b[a-z]{2,}\b', text)
        return words
    
    def fit(self, documents: List[str]) -> 'TFIDFVectorizer':
        """Learn vocabulary and IDF weights from corpus."""
        self.n_docs_ = len(documents)
        
        # Count document frequency for each term
        doc_freq = Counter()
        term_counts = Counter()
        
        for doc in documents:
            tokens = set(self._tokenize(doc))
            for token in tokens:
                doc_freq[token] += 1
            term_counts.update(self._tokenize(doc))
        
        # Filter by document frequency
        max_doc_count = int(self.max_df * self.n_docs_)
        valid_terms = [
            term for term, freq in doc_freq.items()
            if self.min_df <= freq <= max_doc_count
        ]
        
        # Keep top max_features by total frequency
        valid_terms = sorted(
            valid_terms,
            key=lambda t: term_counts[t],
            reverse=True
        )[:self.max_features]
        
        self.vocabulary_ = {term: i for i, term in enumerate(valid_terms)}
        
        # Compute IDF
        if self.use_idf:
            self.idf_ = np.zeros(len(self.vocabulary_))
            for term, idx in self.vocabulary_.items():
                df = doc_freq[term]
                self.idf_[idx] = np.log(self.n_docs_ / df) + 1
        else:
            self.idf_ = np.ones(len(self.vocabulary_))
        
        return self
    
    def transform(self, documents: List[str]) -> np.ndarray:
        """Transform documents to TF-IDF matrix."""
        n_docs = len(documents)
        n_features = len(self.vocabulary_)
        matrix = np.zeros((n_docs, n_features))
        
        for i, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            term_counts = Counter(tokens)
            
            for term, count in term_counts.items():
                if term in self.vocabulary_:
                    idx = self.vocabulary_[term]
                    
                    # Term frequency
                    if self.sublinear_tf:
                        tf = 1 + np.log(count) if count > 0 else 0
                    else:
                        tf = count / len(tokens)
                    
                    # TF-IDF
                    matrix[i, idx] = tf * self.idf_[idx]
            
            # Normalize
            if self.norm == 'l2':
                norm = np.linalg.norm(matrix[i])
                if norm > 0:
                    matrix[i] /= norm
        
        return matrix
    
    def fit_transform(self, documents: List[str]) -> np.ndarray:
        """Fit and transform in one call."""
        return self.fit(documents).transform(documents)
 
 
# Example usage
if __name__ == "__main__":
    docs = [
        "Sci-fi thriller about dreams within dreams",
        "Romantic comedy set in New York City",
        "Action-packed superhero adventure film",
        "Psychological thriller with twist ending",
        "Romantic drama about star-crossed lovers"
    ]
    
    vectorizer = TFIDFVectorizer(max_features=100, min_df=1)
    tfidf_matrix = vectorizer.fit_transform(docs)
    
    print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
    print(f"Matrix shape: {tfidf_matrix.shape}")
    
    # Cosine similarity between documents
    similarities = tfidf_matrix @ tfidf_matrix.T
    print("\nDocument similarities:")
    print(similarities.round(3))

TF-IDF for Content-Based Recommendations

In content-based filtering, TF-IDF serves multiple purposes:

1. Item Representation: Items become TF-IDF vectors from their text content: $$\phi_{\text{tfidf}}(i) = [\text{TF-IDF}(t_1, d_i), ..., \text{TF-IDF}(t_{|V|}, d_i)]$$

3. Recommendation Scoring: Cosine similarity between user profile and item vectors: $$\text{score}(u, i) = \cos(\psi(u), \phi(i))$$

Advantages for RecSys:

Interpretable: Can explain recommendations via matching terms
No training required: Pure statistical computation
Works with any text: Generalizes across domains
Cold-start friendly: New items with descriptions are immediately representable

Limitations:

Vocabulary mismatch: "car" vs "automobile" seen as unrelated
High dimensionality: 10K-100K features
No semantic understanding: Similar meaning ≠ similar vectors

TF-IDF Best Practices

Word Embeddings: Semantic Representations

Word embeddings learn dense, low-dimensional vectors where semantically similar words have similar representations.

The Key Insight:

"You shall know a word by the company it keeps" — J.R. Firth

Words appearing in similar contexts have similar meanings. Word embeddings learn to predict context, and similar predictions yield similar vectors.

Word2Vec Architectures:

Skip-gram: Predict context words given center word: $$P(w_{context} | w_{center}) = \frac{\exp(v_{context} \cdot v_{center})}{\sum_{w} \exp(v_w \cdot v_{center})}$$

CBOW (Continuous Bag of Words): Predict center word given context: $$P(w_{center} | w_{context_1}, ..., w_{context_k})$$

Properties of Word Embeddings:

Semantic similarity: sim("king", "queen") > sim("king", "car")
Analogies: king - man + woman ≈ queen
Clustering: Related words form clusters
Low dimension: Typically 100-300 dimensions vs 100K+ for TF-IDF

word_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from typing import List, Dict, Optional
from gensim.models import Word2Vec, KeyedVectors
 
class WordEmbeddingDocumentEncoder:
    """
    Encode documents using pre-trained word embeddings.
    
    Aggregates word vectors into document vectors using various strategies.
    """
    
    def __init__(
        self,
        embedding_model: KeyedVectors,
        aggregation: str = 'mean',  # 'mean', 'tfidf_weighted', 'sif'
        tfidf_weights: Optional[Dict[str, float]] = None
    ):
        self.embedding_model = embedding_model
        self.aggregation = aggregation
        self.tfidf_weights = tfidf_weights or {}
        self.embedding_dim = embedding_model.vector_size
    
    def encode_document(self, tokens: List[str]) -> np.ndarray:
        """Encode a document (list of tokens) into a single vector."""
        
        # Get embeddings for words in vocabulary
        valid_embeddings = []
        weights = []
        
        for token in tokens:
            if token in self.embedding_model:
                valid_embeddings.append(self.embedding_model[token])
                
                if self.aggregation == 'tfidf_weighted':
                    weights.append(self.tfidf_weights.get(token, 1.0))
                else:
                    weights.append(1.0)
        
        if not valid_embeddings:
            return np.zeros(self.embedding_dim)
        
        embeddings = np.array(valid_embeddings)
        weights = np.array(weights)
        
        if self.aggregation == 'mean':
            doc_vector = embeddings.mean(axis=0)
        elif self.aggregation == 'tfidf_weighted':
            doc_vector = np.average(embeddings, axis=0, weights=weights)
        elif self.aggregation == 'sif':
            # Smooth Inverse Frequency weighting
            doc_vector = self._sif_embedding(embeddings, tokens)
        else:
            doc_vector = embeddings.mean(axis=0)
        
        # L2 normalize
        norm = np.linalg.norm(doc_vector)
        return doc_vector / norm if norm > 0 else doc_vector
    
    def _sif_embedding(
        self,
        embeddings: np.ndarray,
        tokens: List[str],
        a: float = 0.001
    ) -> np.ndarray:
        """
        Smooth Inverse Frequency (SIF) embedding.
        
        Downweights frequent words and removes first principal component.
        """
        # Compute SIF weights (simplified - uses uniform frequencies)
        weights = np.array([a / (a + 0.01) for _ in tokens])
        
        # Weighted average
        sif_embedding = np.average(embeddings, axis=0, weights=weights)
        return sif_embedding
    
    def similarity(self, doc1_tokens: List[str], doc2_tokens: List[str]) -> float:
        """Compute cosine similarity between two documents."""
        vec1 = self.encode_document(doc1_tokens)
        vec2 = self.encode_document(doc2_tokens)
        return np.dot(vec1, vec2)
 
 
# Using pre-trained embeddings
def load_pretrained_embeddings():
    """Load pre-trained embeddings (GloVe or Word2Vec)."""
    # Example: Load GloVe embeddings
    # embeddings = KeyedVectors.load_word2vec_format(
    #     'glove.6B.300d.txt', binary=False
    # )
    
    # For demonstration, train on sample data
    sentences = [
        ["action", "thriller", "exciting", "adventure"],
        ["romantic", "comedy", "funny", "love"],
        ["drama", "emotional", "powerful", "moving"],
        ["horror", "scary", "thriller", "suspense"],
    ]
    
    model = Word2Vec(sentences, vector_size=50, window=3, min_count=1)
    return model.wv

Document Embeddings: Beyond Word Averaging

Averaging word embeddings loses information about word order and document-level semantics. Document embeddings directly encode entire passages.

Doc2Vec (Paragraph Vectors):

Extends Word2Vec by adding a "document ID" as additional context:

Each document gets its own embedding
Trained jointly with word embeddings
Captures document-level themes

Sentence-BERT (SBERT):

Fine-tunes BERT to produce semantically meaningful sentence embeddings: $$\text{emb}(s) = \text{BERT}(s)_{[CLS]}$$ or mean pooling

Siamese architecture trained on similarity/NLI datasets.

Universal Sentence Encoder:

Google's model specifically designed for semantic similarity tasks. Available in transformer and DAN (deep averaging network) variants.

Comparison:

Method	Dimension	Speed	Quality	Training
TF-IDF	10K-100K	Fast	Lexical	None
Word2Vec avg	100-300	Fast	Moderate	Pretrained
Doc2Vec	50-400	Medium	Good	On corpus
SBERT	384-768	Slow	Excellent	Pretrained

document_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
from sentence_transformers import SentenceTransformer
import numpy as np
from typing import List
 
class SemanticDocumentEncoder:
    """
    Encode documents using transformer-based models.
    
    Uses Sentence-BERT for high-quality semantic embeddings.
    """
    
    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
        """
        Initialize with a sentence-transformers model.
        
        Popular options:
        - 'all-MiniLM-L6-v2': Fast, 384 dims, good quality
        - 'all-mpnet-base-v2': Slower, 768 dims, best quality
        - 'paraphrase-multilingual-MiniLM-L12-v2': Multilingual
        """
        self.model = SentenceTransformer(model_name)
        self.embedding_dim = self.model.get_sentence_embedding_dimension()
    
    def encode(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
        """
        Encode texts into semantic embeddings.
        
        Args:
            texts: List of text strings
            batch_size: Batch size for encoding
            
        Returns:
            Matrix of shape (n_texts, embedding_dim)
        """
        embeddings = self.model.encode(
            texts,
            batch_size=batch_size,
            normalize_embeddings=True,
            show_progress_bar=False
        )
        return embeddings
    
    def similarity_matrix(self, texts: List[str]) -> np.ndarray:
        """Compute pairwise similarity matrix."""
        embeddings = self.encode(texts)
        return embeddings @ embeddings.T
 
 
class HybridDocumentEncoder:
    """
    Combines TF-IDF (lexical) with embeddings (semantic).
    """
    
    def __init__(
        self,
        tfidf_vectorizer,
        semantic_encoder: SemanticDocumentEncoder,
        tfidf_weight: float = 0.3
    ):
        self.tfidf_vectorizer = tfidf_vectorizer
        self.semantic_encoder = semantic_encoder
        self.tfidf_weight = tfidf_weight
    
    def encode(self, texts: List[str]) -> np.ndarray:
        """Get hybrid TF-IDF + semantic embeddings."""
        # TF-IDF features
        tfidf_features = self.tfidf_vectorizer.transform(texts)
        if hasattr(tfidf_features, 'toarray'):
            tfidf_features = tfidf_features.toarray()
        
        # Reduce TF-IDF dimensionality if needed
        # (Use SVD/PCA in practice)
        
        # Semantic embeddings
        semantic_features = self.semantic_encoder.encode(texts)
        
        # Concatenate (or learn fusion)
        combined = np.hstack([
            self.tfidf_weight * tfidf_features,
            (1 - self.tfidf_weight) * semantic_features
        ])
        
        # Normalize
        norms = np.linalg.norm(combined, axis=1, keepdims=True)
        return combined / np.maximum(norms, 1e-10)

Modern Best Practice

Choosing the Right Representation

When to Use TF-IDF:

Exact keyword matching matters (technical documents, legal)
Interpretability is critical (explainable recommendations)
Computational resources are limited
Domain-specific vocabulary not in pretrained models
Bag-of-words assumption holds (short descriptions)

When to Use Neural Embeddings:

Semantic similarity matters more than lexical
Handling synonyms, paraphrases essential
Multilingual content
Rich pretrained models available for domain
Computational budget allows inference cost

Hybrid Approaches:

Combine both for best of both worlds:

TF-IDF for rare terms, exact matches
Embeddings for semantic similarity
Learn fusion weights from recommendation data

Representation Strategy Decision Matrix
Criterion	TF-IDF	Embeddings	Hybrid
Synonym handling	❌ Poor	✅ Excellent	✅ Good
Rare terms	✅ Preserved	❌ OOV issues	✅ Best
Interpretability	✅ High	❌ Black box	⚠️ Partial
Speed	✅ Fast	⚠️ Medium	⚠️ Medium
Cold-start	✅ Works	✅ Works	✅ Works
Domain adaptation	✅ Easy	⚠️ May need fine-tune	✅ Flexible

Production Implementation

Preprocessing Pipeline:

Text Cleaning: Remove HTML, normalize unicode, handle special chars
Tokenization: Word-level or subword (BPE for embeddings)
Normalization: Lowercase, stemming/lemmatization (for TF-IDF)
Stopword Removal: Domain-specific stopwords
Encoding: TF-IDF or embedding model inference
Storage: Vector database or feature store

Scaling Considerations:

Batch Processing: Encode items in batches, not one-by-one
GPU Acceleration: Embedding models benefit from GPU
Caching: Cache embeddings, recompute only on content change
Quantization: Reduce embedding precision (float32 → int8)
Dimensionality: Use smaller models (MiniLM) when latency matters

Embedding Drift

If you update embedding models, old item/user embeddings become incompatible. Plan for full re-encoding or version management to handle model updates gracefully.

Summary: TF-IDF and Embeddings

Key Takeaways

•TF-IDF weights terms by importance — Balancing frequency with rarity for discriminative representations.
•Word embeddings capture semantics — Similar meanings yield similar vectors through context prediction.
•Document embeddings encode passages directly — SBERT and similar models provide state-of-the-art semantic similarity.
•Hybrid approaches combine strengths — TF-IDF for lexical precision, embeddings for semantic understanding.
•Choice depends on requirements — Interpretability, domain vocabulary, computational budget all factor in.

What's Next:

Page Complete

You now command the full spectrum of text representation techniques for recommendation systems—from classical TF-IDF to cutting-edge neural embeddings—and understand when to apply each approach.

3 / 5