Loading learning content...
Consider two documents about machine learning:
In simple averaging, the word "the" (appearing 3-4 times) dominates the embedding, even though it tells us nothing about the document's topic. Meanwhile, the truly informative words—"neural," "network," "gradient," "optimizes"—each contribute only a single vote.
TF-IDF weighted Word2Vec addresses this imbalance by assigning each word an importance weight derived from its statistical properties in the corpus. Words that are frequent in a specific document but rare across the corpus receive high weights; words appearing everywhere receive low weights.
By the end of this page, you will understand how to combine TF-IDF weights with Word2Vec embeddings, implement the weighted averaging scheme correctly, explore variations (sublinear TF, different IDF formulations), and learn when TF-IDF weighting helps most. This technique often provides a significant boost over simple averaging at minimal additional cost.
Before combining TF-IDF with Word2Vec, let's ensure we have a solid understanding of TF-IDF itself.
Term Frequency (TF):
Measures how often a term appears in a document. For term t in document d:
$$\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$
where f_{t,d} is the raw count of term t in document d. This normalization ensures TF values sum to 1 within each document.
Inverse Document Frequency (IDF):
Measures how rare or common a term is across the entire corpus:
$$\text{IDF}(t) = \log \frac{N}{\text{df}(t)}$$
where N is the total number of documents and df(t) is the number of documents containing term t. Rare terms have high IDF; common terms have low (or even negative) IDF.
TF-IDF:
The product captures "importance": terms that are frequent in a specific document but rare overall:
$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$
| Scenario | TF | IDF | TF-IDF | Interpretation |
|---|---|---|---|---|
| 'machine' in ML paper | High (10 occurrences) | High (rare in general) | Very High | Highly discriminative for this doc |
| 'the' in any doc | High (frequent) | Very Low (in all docs) | Low | Not discriminative |
| 'zebra' in ML paper | Very Low (0-1) | Very High (rare) | Low | Rare but not characteristic of this doc |
| 'algorithm' in ML paper | Medium | Medium | Medium-High | Somewhat discriminative |
There are several IDF variants: Smooth IDF adds 1 to the denominator to avoid division by zero for OOV terms: log(N / (df(t) + 1)) + 1. Probabilistic IDF uses log((N - df(t)) / df(t)) for a more theoretically motivated formulation. Most libraries (sklearn) use smooth IDF by default.
TF-IDF weighted Word2Vec replaces the uniform average with a weighted average:
$$\mathbf{v}D = \frac{\sum{w \in D} \text{TF-IDF}(w, D) \cdot \mathbf{v}w}{\sum{w \in D} \text{TF-IDF}(w, D)}$$
Equivalently, using normalized weights:
$$\mathbf{v}D = \sum{w \in D} \alpha_w \cdot \mathbf{v}_w$$
where: $$\alpha_w = \frac{\text{TF-IDF}(w, D)}{\sum_{w' \in D} \text{TF-IDF}(w', D)}$$
Key insight: The weights α_w now depend on two factors:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom sklearn.feature_extraction.text import TfidfVectorizerfrom typing import List, Dict class TFIDFWeightedWord2Vec: """ Document embeddings via TF-IDF weighted averaging of word vectors. """ def __init__(self, word_vectors, embedding_dim: int = 200): """ Args: word_vectors: Trained word embedding model (gensim or dict) embedding_dim: Dimension of embeddings """ self.word_vectors = word_vectors self.embedding_dim = embedding_dim self.tfidf_vectorizer = None self.vocabulary_ = None self.idf_ = None def fit(self, documents: List[str]): """ Fit TF-IDF on the corpus. Args: documents: List of raw text documents (not tokenized) """ self.tfidf_vectorizer = TfidfVectorizer( lowercase=True, sublinear_tf=True, # Use log(1 + tf) instead of raw tf norm=None # Don't L2 normalize—we normalize embeddings instead ) self.tfidf_vectorizer.fit(documents) # Store vocabulary and IDF values for efficient lookup self.vocabulary_ = self.tfidf_vectorizer.vocabulary_ self.idf_ = dict(zip( self.tfidf_vectorizer.get_feature_names_out(), self.tfidf_vectorizer.idf_ )) return self def get_document_vector(self, document: str) -> np.ndarray: """ Compute TF-IDF weighted average embedding for a document. Args: document: Raw text string Returns: Weighted average embedding """ if self.tfidf_vectorizer is None: raise ValueError("Must call fit() before transform()") # Get TF-IDF representation tfidf_vector = self.tfidf_vectorizer.transform([document]) # Get feature names and their TF-IDF values feature_names = self.tfidf_vectorizer.get_feature_names_out() tfidf_array = tfidf_vector.toarray()[0] # Compute weighted average weighted_sum = np.zeros(self.embedding_dim) total_weight = 0.0 for word, weight in zip(feature_names, tfidf_array): if weight > 0 and word in self.word_vectors: weighted_sum += weight * self.word_vectors[word] total_weight += weight if total_weight > 0: return weighted_sum / total_weight else: return np.zeros(self.embedding_dim) def transform(self, documents: List[str]) -> np.ndarray: """ Transform documents to TF-IDF weighted embeddings. Args: documents: List of raw text documents Returns: Matrix of shape (n_documents, embedding_dim) """ return np.array([self.get_document_vector(doc) for doc in documents]) def fit_transform(self, documents: List[str]) -> np.ndarray: """Fit and transform in one step.""" return self.fit(documents).transform(documents) # Example usagedocuments = [ "The neural network processes data efficiently through multiple hidden layers.", "Deep learning algorithms optimize gradient descent for better convergence.", "The cat sat on the mat near the window overlooking the garden.",] # Create weighted embedderweighted_embedder = TFIDFWeightedWord2Vec(word_vectors)doc_embeddings = weighted_embedder.fit_transform(documents) print(f"Document embeddings shape: {doc_embeddings.shape}") # Compare with simple averagefrom sklearn.metrics.pairwise import cosine_similarity print("Document similarity matrix (TF-IDF weighted):")print(cosine_similarity(doc_embeddings).round(3))The IDF values depend on the corpus used for fitting. If you fit on a general corpus and apply to domain-specific documents (or vice versa), the weights may be misleading. For best results, fit TF-IDF on a corpus representative of your target domain and use case.
There are several variations in how to compute and apply TF-IDF weights. Each has different properties:
1. Sublinear TF scaling:
Instead of raw term frequency, use logarithmic scaling:
$$\text{TF}{\text{sublinear}}(t, d) = 1 + \log(f{t,d}) \quad \text{if } f_{t,d} > 0$$
This reduces the impact of highly repeated words. A word appearing 100 times doesn't contribute 100x more than one appearing once—the relationship is logarithmic.
2. Binary TF:
Simplest variant—just check presence/absence:
$$\text{TF}_{\text{binary}}(t, d) = \mathbf{1}[t \in d]$$
Useful when word presence matters more than frequency.
3. IDF-only weighting:
Some applications use only IDF weights (ignoring TF), which gives corpus-level importance:
$$\mathbf{v}D = \frac{\sum{w \in D} \text{IDF}(w) \cdot \mathbf{v}w}{\sum{w \in D} \text{IDF}(w)}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npfrom collections import Counterfrom typing import List, Dict def compute_idf(documents: List[List[str]]) -> Dict[str, float]: """ Compute IDF values from a corpus of tokenized documents. Uses smooth IDF: log((N + 1) / (df + 1)) + 1 """ N = len(documents) df = Counter() for doc in documents: # Count each word once per document (document frequency) unique_words = set(doc) df.update(unique_words) idf = {} for word, doc_freq in df.items(): # Smooth IDF with additive smoothing idf[word] = np.log((N + 1) / (doc_freq + 1)) + 1 return idf def weighted_average_embedding( words: List[str], word_vectors, idf_values: Dict[str, float], tf_scheme: str = 'sublinear', # 'raw', 'binary', 'sublinear' use_tf: bool = True, # If False, use IDF-only normalize_output: bool = True) -> np.ndarray: """ Compute weighted average with configurable TF and IDF schemes. """ embedding_dim = word_vectors.vector_size word_counts = Counter(words) weighted_sum = np.zeros(embedding_dim) total_weight = 0.0 for word, count in word_counts.items(): if word not in word_vectors: continue # Compute TF component if tf_scheme == 'raw': tf = count / len(words) elif tf_scheme == 'binary': tf = 1.0 elif tf_scheme == 'sublinear': tf = 1 + np.log(count) else: raise ValueError(f"Unknown TF scheme: {tf_scheme}") # Get IDF component (default to 1.0 for unknown words) idf = idf_values.get(word, 1.0) # Compute weight if use_tf: weight = tf * idf else: weight = idf # IDF-only weighted_sum += weight * word_vectors[word] total_weight += weight if total_weight > 0: result = weighted_sum / total_weight else: result = weighted_sum if normalize_output: norm = np.linalg.norm(result) if norm > 0: result = result / norm return result # Comparison of different schemesdef compare_weighting_schemes(doc: List[str], word_vectors, idf_values): """Compare embeddings from different TF-IDF schemes.""" schemes = [ {'tf_scheme': 'raw', 'use_tf': True, 'label': 'Raw TF × IDF'}, {'tf_scheme': 'sublinear', 'use_tf': True, 'label': 'Sublinear TF × IDF'}, {'tf_scheme': 'binary', 'use_tf': True, 'label': 'Binary TF × IDF'}, {'tf_scheme': 'sublinear', 'use_tf': False, 'label': 'IDF only'}, ] embeddings = {} for scheme in schemes: label = scheme.pop('label') embeddings[label] = weighted_average_embedding( doc, word_vectors, idf_values, **scheme ) scheme['label'] = label # Restore for next iteration # Show pairwise similarities labels = list(embeddings.keys()) vectors = np.array(list(embeddings.values())) from sklearn.metrics.pairwise import cosine_similarity sim_matrix = cosine_similarity(vectors) print("Similarity between different weighting schemes:") for i, label_i in enumerate(labels): for j, label_j in enumerate(labels): if i < j: print(f" {label_i} vs {label_j}: {sim_matrix[i,j]:.4f}") return embeddings # Build IDF from corpuscorpus = [ "machine learning algorithms process data".split(), "neural networks learn patterns from data".split(), "deep learning uses gradient optimization".split(), "the cat sat on the mat".split(),] idf = compute_idf(corpus)print("IDF values for key terms:")for word in ['machine', 'learning', 'the', 'data', 'cat']: print(f" '{word}': {idf.get(word, 0):.4f}") # Compare schemes on a documenttest_doc = "machine learning algorithms process the data efficiently".split()compare_weighting_schemes(test_doc, word_vectors, idf)For most applications, sublinear TF with smooth IDF is the best default. Sublinear scaling prevents very frequent words from dominating, while smooth IDF handles the rare case of words not seen during training. This matches sklearn's TfidfVectorizer defaults.
A challenge arises when documents contain words not seen during IDF fitting. There are several strategies:
1. Default IDF value:
Assign a default IDF (often max IDF or average IDF) to unknown words:
$$\text{IDF}{\text{default}} = \max{w \in V} \text{IDF}(w)$$
Rationale: Unknown words are likely rare (hence high IDF).
2. Zero weight:
Simply ignore words without IDF values (equivalent to setting weight to 0). This is equivalent to the intersection of TF-IDF vocabulary and Word2Vec vocabulary.
3. Uniform weight (IDF = 1):
Treat unknown words as having IDF = 1, so their weight equals their TF. This is a neutral assumption.
4. OOV-specific IDF:
Estimate IDF for unseen words using a global OOV rate or by treating all OOV words as a single pseudo-word.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
import numpy as npfrom collections import Counterfrom typing import List, Dict, Optional class RobustTFIDFWeightedW2V: """ TF-IDF weighted Word2Vec with robust OOV handling. """ def __init__( self, word_vectors, embedding_dim: int = 200, oov_idf_strategy: str = 'max', # 'max', 'mean', 'fixed', 'zero' fixed_oov_idf: float = 1.0 ): """ Args: word_vectors: Word embedding model embedding_dim: Dimension of embeddings oov_idf_strategy: How to handle words not in IDF vocabulary fixed_oov_idf: IDF value for OOV words when using 'fixed' strategy """ self.word_vectors = word_vectors self.embedding_dim = embedding_dim self.oov_idf_strategy = oov_idf_strategy self.fixed_oov_idf = fixed_oov_idf self.idf_ = None self.max_idf_ = None self.mean_idf_ = None def fit(self, documents: List[List[str]]): """Compute IDF values from a tokenized corpus.""" N = len(documents) df = Counter() for doc in documents: df.update(set(doc)) self.idf_ = { word: np.log((N + 1) / (count + 1)) + 1 for word, count in df.items() } # Precompute OOV fallback values idf_values = list(self.idf_.values()) self.max_idf_ = max(idf_values) if idf_values else 1.0 self.mean_idf_ = np.mean(idf_values) if idf_values else 1.0 return self def _get_idf(self, word: str) -> float: """Get IDF for a word, handling OOV cases.""" if word in self.idf_: return self.idf_[word] # OOV handling if self.oov_idf_strategy == 'max': return self.max_idf_ elif self.oov_idf_strategy == 'mean': return self.mean_idf_ elif self.oov_idf_strategy == 'fixed': return self.fixed_oov_idf elif self.oov_idf_strategy == 'zero': return 0.0 else: raise ValueError(f"Unknown OOV strategy: {self.oov_idf_strategy}") def get_document_vector( self, words: List[str], tf_scheme: str = 'sublinear' ) -> np.ndarray: """Compute weighted embedding for tokenized document.""" if self.idf_ is None: raise ValueError("Must call fit() first") word_counts = Counter(words) weighted_sum = np.zeros(self.embedding_dim) total_weight = 0.0 for word, count in word_counts.items(): # Skip words not in word vector vocabulary if word not in self.word_vectors: continue # Compute TF if tf_scheme == 'sublinear': tf = 1 + np.log(count) elif tf_scheme == 'raw': tf = count / len(words) else: tf = 1.0 # Get IDF (with OOV handling) idf = self._get_idf(word) weight = tf * idf weighted_sum += weight * self.word_vectors[word] total_weight += weight if total_weight > 0: return weighted_sum / total_weight return np.zeros(self.embedding_dim) def transform(self, documents: List[List[str]]) -> np.ndarray: """Transform list of tokenized documents.""" return np.array([self.get_document_vector(doc) for doc in documents]) # Compare OOV strategiesdef compare_oov_strategies(test_doc, train_corpus, word_vectors): """Compare different OOV IDF handling strategies.""" strategies = ['max', 'mean', 'fixed', 'zero'] results = {} for strategy in strategies: embedder = RobustTFIDFWeightedW2V( word_vectors, oov_idf_strategy=strategy, fixed_oov_idf=1.0 ) embedder.fit(train_corpus) # Find OOV words oov_words = [w for w in test_doc if w not in embedder.idf_] vec = embedder.get_document_vector(test_doc) results[strategy] = { 'vector': vec, 'norm': np.linalg.norm(vec), 'oov_count': len(oov_words) } print(f"Test document has {results['max']['oov_count']} OOV words") print("Embedding norms by strategy:") for strategy, data in results.items(): print(f" {strategy}: {data['norm']:.4f}") return resultsFor production systems, using 'max' or 'mean' IDF for OOV words works well. The 'max' strategy assumes unknown words are informative (reasonable for domain-specific terms), while 'mean' is a conservative middle ground. Avoid 'zero' unless you specifically want to ignore OOV words entirely.
When and how much does TF-IDF weighting help over simple averaging? The answer depends on document characteristics and task requirements.
| Scenario | Simple Avg | TF-IDF Weighted | Reason |
|---|---|---|---|
| Long documents with repeated keywords | Worse | Better | Keyword repetition gets logarithmic (not linear) boost |
| Documents with many stop words | Worse | Better | Stop words get low IDF → low weight |
| Comparing documents of different lengths | Worse | Better | Normalization handles length variation better |
| Short, focused documents | Similar | Similar | Few words → weighting has less impact |
| Highly technical domains | Worse | Better | Technical terms get high IDF → high weight |
| Informal text (tweets, comments) | Similar | Similar | Less vocabulary variation, shorter text |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom sklearn.metrics.pairwise import cosine_similarityfrom collections import Counter def empirical_comparison( documents: list, word_vectors, labels: list = None): """ Empirically compare simple average vs TF-IDF weighted embeddings. """ # Compute IDF tokenized_docs = [doc.lower().split() for doc in documents] N = len(tokenized_docs) df = Counter() for doc in tokenized_docs: df.update(set(doc)) idf = {w: np.log((N + 1) / (c + 1)) + 1 for w, c in df.items()} simple_embeddings = [] weighted_embeddings = [] for doc_words in tokenized_docs: # Simple average simple_vecs = [word_vectors[w] for w in doc_words if w in word_vectors] if simple_vecs: simple_emb = np.mean(simple_vecs, axis=0) simple_emb = simple_emb / (np.linalg.norm(simple_emb) + 1e-9) else: simple_emb = np.zeros(word_vectors.vector_size) # TF-IDF weighted word_counts = Counter(doc_words) weighted_sum = np.zeros(word_vectors.vector_size) total_weight = 0 for word, count in word_counts.items(): if word in word_vectors: tf = 1 + np.log(count) w_idf = idf.get(word, 1.0) weight = tf * w_idf weighted_sum += weight * word_vectors[word] total_weight += weight if total_weight > 0: weighted_emb = weighted_sum / total_weight weighted_emb = weighted_emb / (np.linalg.norm(weighted_emb) + 1e-9) else: weighted_emb = np.zeros(word_vectors.vector_size) simple_embeddings.append(simple_emb) weighted_embeddings.append(weighted_emb) simple_mat = np.array(simple_embeddings) weighted_mat = np.array(weighted_embeddings) # Compute similarity matrices simple_sim = cosine_similarity(simple_mat) weighted_sim = cosine_similarity(weighted_mat) # How different are the similarities? diff = np.abs(simple_sim - weighted_sim) upper_indices = np.triu_indices_from(diff, k=1) mean_diff = diff[upper_indices].mean() max_diff = diff[upper_indices].max() print(f"Mean difference in similarities: {mean_diff:.4f}") print(f"Max difference in similarities: {max_diff:.4f}") # Show specific comparisons if labels provided if labels: print("Pairwise similarities:") print(f"{'Pair':<30} {'Simple':<10} {'Weighted':<10} {'Diff'}") print("-" * 60) for i in range(len(documents)): for j in range(i+1, len(documents)): label = f"{labels[i][:12]} vs {labels[j][:12]}" print(f"{label:<30} {simple_sim[i,j]:.4f} {weighted_sim[i,j]:.4f} {diff[i,j]:.4f}") return simple_mat, weighted_mat # Example with contrasting documentsdocuments = [ "The machine learning algorithm processes data through neural network layers efficiently", "Deep neural networks learn patterns from data using backpropagation optimization", "The cat sat on the mat while the dog slept on the rug in the house", "Python programming language is used for machine learning and data science",] labels = ["ML Algo", "Deep Learning", "Cat/Dog", "Python ML"] simple, weighted = empirical_comparison(documents, word_vectors, labels)In text classification and semantic similarity tasks, TF-IDF weighting typically improves over simple averaging by 2-5% in accuracy or correlation. The gain is most pronounced for longer documents with varied vocabulary. For short texts (tweets, queries), the improvement is minimal.
For production systems processing millions of documents, efficiency matters. Here are optimizations for the TF-IDF weighted approach:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165
import numpy as npfrom collections import Counterfrom typing import List, Dictimport threadingfrom concurrent.futures import ThreadPoolExecutor class EfficientTFIDFWord2Vec: """ Optimized TF-IDF weighted Word2Vec for production scale. """ def __init__( self, word_vectors, embedding_dim: int = 200, batch_size: int = 1000, n_workers: int = 4 ): self.embedding_dim = embedding_dim self.batch_size = batch_size self.n_workers = n_workers # Convert word vectors to numpy array for faster access self.vocab = list(word_vectors.key_to_index.keys()) self.word_to_idx = {w: i for i, w in enumerate(self.vocab)} self.embedding_matrix = np.array([ word_vectors[w] for w in self.vocab ]) self.idf_ = None self.idf_array_ = None # Aligned with vocab for vectorized ops def fit(self, tokenized_documents: List[List[str]]): """Compute IDF from corpus.""" N = len(tokenized_documents) df = Counter() for doc in tokenized_documents: df.update(set(doc)) # Store IDF as dict self.idf_ = { word: np.log((N + 1) / (count + 1)) + 1 for word, count in df.items() } # Create aligned IDF array for vectorized operations self.idf_array_ = np.array([ self.idf_.get(w, 1.0) for w in self.vocab ]) return self def _process_single(self, words: List[str]) -> np.ndarray: """Process a single document (optimized).""" word_counts = Counter(words) # Get indices of words in vocab valid_words = [w for w in word_counts if w in self.word_to_idx] if not valid_words: return np.zeros(self.embedding_dim) indices = np.array([self.word_to_idx[w] for w in valid_words]) counts = np.array([word_counts[w] for w in valid_words]) # Sublinear TF tf = 1 + np.log(counts) # Get IDF values (already aligned) idf_vals = self.idf_array_[indices] # Weights weights = tf * idf_vals # Weighted sum using matrix indexing embeddings = self.embedding_matrix[indices] # (n_words, dim) weighted_sum = np.sum(embeddings * weights[:, np.newaxis], axis=0) return weighted_sum / weights.sum() def transform(self, tokenized_documents: List[List[str]]) -> np.ndarray: """Transform documents with parallel processing.""" results = [] # Process in batches for memory efficiency for batch_start in range(0, len(tokenized_documents), self.batch_size): batch = tokenized_documents[batch_start:batch_start + self.batch_size] # Parallel processing within batch with ThreadPoolExecutor(max_workers=self.n_workers) as executor: batch_results = list(executor.map(self._process_single, batch)) results.extend(batch_results) return np.array(results) def transform_single(self, words: List[str]) -> np.ndarray: """Transform a single document (for online inference).""" return self._process_single(words) # Memory-efficient streaming for very large corporaclass StreamingTFIDFWord2Vec: """ Streaming version that doesn't load full corpus into memory. Computes IDF in a streaming fashion. """ def __init__(self, word_vectors, embedding_dim: int = 200): self.word_vectors = word_vectors self.embedding_dim = embedding_dim self.df_ = Counter() self.n_docs_ = 0 self.idf_ = None def partial_fit(self, documents: List[List[str]]): """Update IDF counts with new documents.""" self.n_docs_ += len(documents) for doc in documents: self.df_.update(set(doc)) return self def finalize(self): """Compute final IDF values after all partial fits.""" self.idf_ = { word: np.log((self.n_docs_ + 1) / (count + 1)) + 1 for word, count in self.df_.items() } return self # ... transform methods same as before # Benchmarkdef benchmark_implementations(documents, word_vectors, n_repeats=3): """Compare implementation speeds.""" import time # Standard implementation standard = TFIDFWeightedWord2Vec(word_vectors) # Efficient implementation efficient = EfficientTFIDFWord2Vec(word_vectors) # Tokenize tokenized = [doc.lower().split() for doc in documents] # Fit both standard.fit(documents) efficient.fit(tokenized) # Benchmark transform for impl_name, impl, data in [ ('Standard', standard, documents), ('Efficient', efficient, tokenized) ]: times = [] for _ in range(n_repeats): start = time.time() impl.transform(data) times.append(time.time() - start) print(f"{impl_name}: {np.mean(times):.4f}s (±{np.std(times):.4f}s)")The main optimizations are: (1) Precompute embedding matrix aligned with vocabulary for vectorized lookups, (2) Use numpy operations instead of Python loops, (3) Parallel processing for batch transforms, (4) Streaming IDF computation for memory efficiency with large corpora. These can speed up processing by 10-50x for large datasets.
TF-IDF weighted Word2Vec integrates smoothly with scikit-learn pipelines for end-to-end training and inference:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
import numpy as npfrom sklearn.base import BaseEstimator, TransformerMixinfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_score, GridSearchCVfrom sklearn.preprocessing import Normalizerfrom collections import Counter class TFIDFWord2VecTransformer(BaseEstimator, TransformerMixin): """ Sklearn-compatible TF-IDF weighted Word2Vec transformer. """ def __init__( self, word_vectors, tf_scheme: str = 'sublinear', normalize: bool = True, min_df: int = 1 ): """ Args: word_vectors: Word embedding model tf_scheme: 'raw', 'sublinear', or 'binary' normalize: L2-normalize output embeddings min_df: Minimum document frequency for IDF """ self.word_vectors = word_vectors self.tf_scheme = tf_scheme self.normalize = normalize self.min_df = min_df self.embedding_dim_ = word_vectors.vector_size self.idf_ = None def fit(self, X, y=None): """Compute IDF from training documents.""" N = len(X) df = Counter() for doc in X: words = doc if isinstance(doc, list) else doc.split() df.update(set(words)) # Filter by min_df and compute IDF self.idf_ = { word: np.log((N + 1) / (count + 1)) + 1 for word, count in df.items() if count >= self.min_df } return self def transform(self, X): """Transform documents to embeddings.""" embeddings = [] for doc in X: words = doc if isinstance(doc, list) else doc.split() word_counts = Counter(words) weighted_sum = np.zeros(self.embedding_dim_) total_weight = 0 for word, count in word_counts.items(): if word not in self.word_vectors: continue # TF if self.tf_scheme == 'sublinear': tf = 1 + np.log(count) elif self.tf_scheme == 'binary': tf = 1.0 else: # raw tf = count / len(words) # IDF idf = self.idf_.get(word, 1.0) weight = tf * idf weighted_sum += weight * self.word_vectors[word] total_weight += weight if total_weight > 0: embedding = weighted_sum / total_weight else: embedding = np.zeros(self.embedding_dim_) if self.normalize: norm = np.linalg.norm(embedding) if norm > 0: embedding = embedding / norm embeddings.append(embedding) return np.array(embeddings) # Create complete classification pipelinedef create_classification_pipeline(word_vectors): """ Create a complete text classification pipeline using TF-IDF weighted Word2Vec embeddings. """ return Pipeline([ ('tfidf_w2v', TFIDFWord2VecTransformer( word_vectors, tf_scheme='sublinear', normalize=True )), ('classifier', LogisticRegression( max_iter=1000, solver='lbfgs' )) ]) # Hyperparameter tuningdef tune_pipeline(X_train, y_train, word_vectors): """Grid search over embedding and classifier parameters.""" pipeline = create_classification_pipeline(word_vectors) param_grid = { 'tfidf_w2v__tf_scheme': ['sublinear', 'binary', 'raw'], 'tfidf_w2v__min_df': [1, 2, 5], 'classifier__C': [0.1, 1.0, 10.0], 'classifier__penalty': ['l2'], } grid_search = GridSearchCV( pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1 ) grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}") print(f"Best CV score: {grid_search.best_score_:.4f}") return grid_search.best_estimator_ # Usage example# X_train = ["document one", "document two", ...]# y_train = [0, 1, ...] pipeline = create_classification_pipeline(word_vectors)scores = cross_val_score(pipeline, X_train, y_train, cv=5)print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})") # Fit final modelpipeline.fit(X_train, y_train) # Save for productionimport joblibjoblib.dump(pipeline, 'tfidf_w2v_classifier.joblib') # Load and predictloaded_pipeline = joblib.load('tfidf_w2v_classifier.joblib')predictions = loaded_pipeline.predict(X_test)TF-IDF weighted Word2Vec sits between simple averaging and more complex methods. Here's guidance on when it's the right choice:
Start with TF-IDF weighting as your default for any document-level task (classification, clustering, retrieval). Only fall back to simple averaging if (1) TF-IDF provides no measurable improvement, (2) you're constrained on preprocessing complexity, or (3) your documents are extremely short. The computational overhead of TF-IDF is minimal.
We've explored the principled combination of TF-IDF weighting with Word2Vec embeddings. Let's consolidate the key concepts:
What's next:
Having mastered the combination of TF-IDF with Word2Vec for document representation, we now turn to fundamentally different approaches to word embeddings. The next pages cover GloVe (Global Vectors for Word Representation)—which learns embeddings through global matrix factorization rather than local context prediction—and FastText—which extends Word2Vec to handle subword information for better generalization.
You now have a comprehensive understanding of TF-IDF weighted Word2Vec—from the mathematical formulation through implementation variations to production-ready sklearn integration. This technique provides a meaningful improvement over simple averaging at minimal additional cost, making it an excellent default for document-level NLP tasks.