Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

3 / 6

TF-IDF Weighted Word2Vec: Principled Word Importance Weighting

Not All Words Are Created Equal

Consider two documents about machine learning:

Document A: "The neural network processes the data through the layers."
Document B: "The algorithm optimizes the gradient through the network."

In simple averaging, the word "the" (appearing 3-4 times) dominates the embedding, even though it tells us nothing about the document's topic. Meanwhile, the truly informative words—"neural," "network," "gradient," "optimizes"—each contribute only a single vote.

TF-IDF weighted Word2Vec addresses this imbalance by assigning each word an importance weight derived from its statistical properties in the corpus. Words that are frequent in a specific document but rare across the corpus receive high weights; words appearing everywhere receive low weights.

What You Will Master

By the end of this page, you will understand how to combine TF-IDF weights with Word2Vec embeddings, implement the weighted averaging scheme correctly, explore variations (sublinear TF, different IDF formulations), and learn when TF-IDF weighting helps most. This technique often provides a significant boost over simple averaging at minimal additional cost.

Recap: TF-IDF Fundamentals

Before combining TF-IDF with Word2Vec, let's ensure we have a solid understanding of TF-IDF itself.

Term Frequency (TF):

Measures how often a term appears in a document. For term t in document d:

$$\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$

where f_{t,d} is the raw count of term t in document d. This normalization ensures TF values sum to 1 within each document.

Inverse Document Frequency (IDF):

Measures how rare or common a term is across the entire corpus:

$$\text{IDF}(t) = \log \frac{N}{\text{df}(t)}$$

where N is the total number of documents and df(t) is the number of documents containing term t. Rare terms have high IDF; common terms have low (or even negative) IDF.

TF-IDF:

The product captures "importance": terms that are frequent in a specific document but rare overall:

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

TF-IDF Intuition
Scenario	TF	IDF	TF-IDF	Interpretation
'machine' in ML paper	High (10 occurrences)	High (rare in general)	Very High	Highly discriminative for this doc
'the' in any doc	High (frequent)	Very Low (in all docs)	Low	Not discriminative
'zebra' in ML paper	Very Low (0-1)	Very High (rare)	Low	Rare but not characteristic of this doc
'algorithm' in ML paper	Medium	Medium	Medium-High	Somewhat discriminative

IDF Variations

There are several IDF variants: Smooth IDF adds 1 to the denominator to avoid division by zero for OOV terms: log(N / (df(t) + 1)) + 1. Probabilistic IDF uses log((N - df(t)) / df(t)) for a more theoretically motivated formulation. Most libraries (sklearn) use smooth IDF by default.

The TF-IDF Weighted Averaging Formula

TF-IDF weighted Word2Vec replaces the uniform average with a weighted average:

$$\mathbf{v}D = \frac{\sum{w \in D} \text{TF-IDF}(w, D) \cdot \mathbf{v}w}{\sum{w \in D} \text{TF-IDF}(w, D)}$$

Equivalently, using normalized weights:

$$\mathbf{v}D = \sum{w \in D} \alpha_w \cdot \mathbf{v}_w$$

where: $$\alpha_w = \frac{\text{TF-IDF}(w, D)}{\sum_{w' \in D} \text{TF-IDF}(w', D)}$$

Key insight: The weights α_w now depend on two factors:

How often the word appears in this document (TF component)
How rare the word is across the entire corpus (IDF component)

tfidf_weighted_w2v_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from typing import List, Dict
 
class TFIDFWeightedWord2Vec:
    """
    Document embeddings via TF-IDF weighted averaging of word vectors.
    """
    
    def __init__(self, word_vectors, embedding_dim: int = 200):
        """
        Args:
            word_vectors: Trained word embedding model (gensim or dict)
            embedding_dim: Dimension of embeddings
        """
        self.word_vectors = word_vectors
        self.embedding_dim = embedding_dim
        self.tfidf_vectorizer = None
        self.vocabulary_ = None
        self.idf_ = None
    
    def fit(self, documents: List[str]):
        """
        Fit TF-IDF on the corpus.
        
        Args:
            documents: List of raw text documents (not tokenized)
        """
        self.tfidf_vectorizer = TfidfVectorizer(
            lowercase=True,
            sublinear_tf=True,  # Use log(1 + tf) instead of raw tf
            norm=None  # Don't L2 normalize—we normalize embeddings instead
        )
        self.tfidf_vectorizer.fit(documents)
        
        # Store vocabulary and IDF values for efficient lookup
        self.vocabulary_ = self.tfidf_vectorizer.vocabulary_
        self.idf_ = dict(zip(
            self.tfidf_vectorizer.get_feature_names_out(),
            self.tfidf_vectorizer.idf_
        ))
        
        return self
    
    def get_document_vector(self, document: str) -> np.ndarray:
        """
        Compute TF-IDF weighted average embedding for a document.
        
        Args:
            document: Raw text string
            
        Returns:
            Weighted average embedding
        """
        if self.tfidf_vectorizer is None:
            raise ValueError("Must call fit() before transform()")
        
        # Get TF-IDF representation
        tfidf_vector = self.tfidf_vectorizer.transform([document])
        
        # Get feature names and their TF-IDF values
        feature_names = self.tfidf_vectorizer.get_feature_names_out()
        tfidf_array = tfidf_vector.toarray()[0]
        
        # Compute weighted average
        weighted_sum = np.zeros(self.embedding_dim)
        total_weight = 0.0
        
        for word, weight in zip(feature_names, tfidf_array):
            if weight > 0 and word in self.word_vectors:
                weighted_sum += weight * self.word_vectors[word]
                total_weight += weight
        
        if total_weight > 0:
            return weighted_sum / total_weight
        else:
            return np.zeros(self.embedding_dim)
    
    def transform(self, documents: List[str]) -> np.ndarray:
        """
        Transform documents to TF-IDF weighted embeddings.
        
        Args:
            documents: List of raw text documents
            
        Returns:
            Matrix of shape (n_documents, embedding_dim)
        """
        return np.array([self.get_document_vector(doc) for doc in documents])
    
    def fit_transform(self, documents: List[str]) -> np.ndarray:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
 
 
# Example usage
documents = [
    "The neural network processes data efficiently through multiple hidden layers.",
    "Deep learning algorithms optimize gradient descent for better convergence.",
    "The cat sat on the mat near the window overlooking the garden.",
]
 
# Create weighted embedder
weighted_embedder = TFIDFWeightedWord2Vec(word_vectors)
doc_embeddings = weighted_embedder.fit_transform(documents)
 
print(f"Document embeddings shape: {doc_embeddings.shape}")
 
# Compare with simple average
from sklearn.metrics.pairwise import cosine_similarity
 
print("
Document similarity matrix (TF-IDF weighted):")
print(cosine_similarity(doc_embeddings).round(3))

Corpus Consistency

The IDF values depend on the corpus used for fitting. If you fit on a general corpus and apply to domain-specific documents (or vice versa), the weights may be misleading. For best results, fit TF-IDF on a corpus representative of your target domain and use case.

Implementation Variations

There are several variations in how to compute and apply TF-IDF weights. Each has different properties:

1. Sublinear TF scaling:

Instead of raw term frequency, use logarithmic scaling:

$$\text{TF}{\text{sublinear}}(t, d) = 1 + \log(f{t,d}) \quad \text{if } f_{t,d} > 0$$

This reduces the impact of highly repeated words. A word appearing 100 times doesn't contribute 100x more than one appearing once—the relationship is logarithmic.

2. Binary TF:

Simplest variant—just check presence/absence:

$$\text{TF}_{\text{binary}}(t, d) = \mathbf{1}[t \in d]$$

Useful when word presence matters more than frequency.

3. IDF-only weighting:

Some applications use only IDF weights (ignoring TF), which gives corpus-level importance:

$$\mathbf{v}D = \frac{\sum{w \in D} \text{IDF}(w) \cdot \mathbf{v}w}{\sum{w \in D} \text{IDF}(w)}$$

tfidf_variations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from collections import Counter
from typing import List, Dict
 
def compute_idf(documents: List[List[str]]) -> Dict[str, float]:
    """
    Compute IDF values from a corpus of tokenized documents.
    
    Uses smooth IDF: log((N + 1) / (df + 1)) + 1
    """
    N = len(documents)
    df = Counter()
    
    for doc in documents:
        # Count each word once per document (document frequency)
        unique_words = set(doc)
        df.update(unique_words)
    
    idf = {}
    for word, doc_freq in df.items():
        # Smooth IDF with additive smoothing
        idf[word] = np.log((N + 1) / (doc_freq + 1)) + 1
    
    return idf
 
 
def weighted_average_embedding(
    words: List[str],
    word_vectors,
    idf_values: Dict[str, float],
    tf_scheme: str = 'sublinear',  # 'raw', 'binary', 'sublinear'
    use_tf: bool = True,  # If False, use IDF-only
    normalize_output: bool = True
) -> np.ndarray:
    """
    Compute weighted average with configurable TF and IDF schemes.
    """
    embedding_dim = word_vectors.vector_size
    word_counts = Counter(words)
    
    weighted_sum = np.zeros(embedding_dim)
    total_weight = 0.0
    
    for word, count in word_counts.items():
        if word not in word_vectors:
            continue
        
        # Compute TF component
        if tf_scheme == 'raw':
            tf = count / len(words)
        elif tf_scheme == 'binary':
            tf = 1.0
        elif tf_scheme == 'sublinear':
            tf = 1 + np.log(count)
        else:
            raise ValueError(f"Unknown TF scheme: {tf_scheme}")
        
        # Get IDF component (default to 1.0 for unknown words)
        idf = idf_values.get(word, 1.0)
        
        # Compute weight
        if use_tf:
            weight = tf * idf
        else:
            weight = idf  # IDF-only
        
        weighted_sum += weight * word_vectors[word]
        total_weight += weight
    
    if total_weight > 0:
        result = weighted_sum / total_weight
    else:
        result = weighted_sum
    
    if normalize_output:
        norm = np.linalg.norm(result)
        if norm > 0:
            result = result / norm
    
    return result
 
 
# Comparison of different schemes
def compare_weighting_schemes(doc: List[str], word_vectors, idf_values):
    """Compare embeddings from different TF-IDF schemes."""
    schemes = [
        {'tf_scheme': 'raw', 'use_tf': True, 'label': 'Raw TF × IDF'},
        {'tf_scheme': 'sublinear', 'use_tf': True, 'label': 'Sublinear TF × IDF'},
        {'tf_scheme': 'binary', 'use_tf': True, 'label': 'Binary TF × IDF'},
        {'tf_scheme': 'sublinear', 'use_tf': False, 'label': 'IDF only'},
    ]
    
    embeddings = {}
    for scheme in schemes:
        label = scheme.pop('label')
        embeddings[label] = weighted_average_embedding(
            doc, word_vectors, idf_values, **scheme
        )
        scheme['label'] = label  # Restore for next iteration
    
    # Show pairwise similarities
    labels = list(embeddings.keys())
    vectors = np.array(list(embeddings.values()))
    
    from sklearn.metrics.pairwise import cosine_similarity
    sim_matrix = cosine_similarity(vectors)
    
    print("Similarity between different weighting schemes:")
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            if i < j:
                print(f"  {label_i} vs {label_j}: {sim_matrix[i,j]:.4f}")
    
    return embeddings
 
 
# Build IDF from corpus
corpus = [
    "machine learning algorithms process data".split(),
    "neural networks learn patterns from data".split(),
    "deep learning uses gradient optimization".split(),
    "the cat sat on the mat".split(),
]
 
idf = compute_idf(corpus)
print("IDF values for key terms:")
for word in ['machine', 'learning', 'the', 'data', 'cat']:
    print(f"  '{word}': {idf.get(word, 0):.4f}")
 
# Compare schemes on a document
test_doc = "machine learning algorithms process the data efficiently".split()
compare_weighting_schemes(test_doc, word_vectors, idf)

Recommended Default

For most applications, sublinear TF with smooth IDF is the best default. Sublinear scaling prevents very frequent words from dominating, while smooth IDF handles the rare case of words not seen during training. This matches sklearn's TfidfVectorizer defaults.

Handling IDF for Unseen Words

A challenge arises when documents contain words not seen during IDF fitting. There are several strategies:

1. Default IDF value:

Assign a default IDF (often max IDF or average IDF) to unknown words:

$$\text{IDF}{\text{default}} = \max{w \in V} \text{IDF}(w)$$

Rationale: Unknown words are likely rare (hence high IDF).

2. Zero weight:

Simply ignore words without IDF values (equivalent to setting weight to 0). This is equivalent to the intersection of TF-IDF vocabulary and Word2Vec vocabulary.

3. Uniform weight (IDF = 1):

Treat unknown words as having IDF = 1, so their weight equals their TF. This is a neutral assumption.

4. OOV-specific IDF:

Estimate IDF for unseen words using a global OOV rate or by treating all OOV words as a single pseudo-word.

idf_oov_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import numpy as np
from collections import Counter
from typing import List, Dict, Optional
 
class RobustTFIDFWeightedW2V:
    """
    TF-IDF weighted Word2Vec with robust OOV handling.
    """
    
    def __init__(
        self,
        word_vectors,
        embedding_dim: int = 200,
        oov_idf_strategy: str = 'max',  # 'max', 'mean', 'fixed', 'zero'
        fixed_oov_idf: float = 1.0
    ):
        """
        Args:
            word_vectors: Word embedding model
            embedding_dim: Dimension of embeddings
            oov_idf_strategy: How to handle words not in IDF vocabulary
            fixed_oov_idf: IDF value for OOV words when using 'fixed' strategy
        """
        self.word_vectors = word_vectors
        self.embedding_dim = embedding_dim
        self.oov_idf_strategy = oov_idf_strategy
        self.fixed_oov_idf = fixed_oov_idf
        
        self.idf_ = None
        self.max_idf_ = None
        self.mean_idf_ = None
    
    def fit(self, documents: List[List[str]]):
        """Compute IDF values from a tokenized corpus."""
        N = len(documents)
        df = Counter()
        
        for doc in documents:
            df.update(set(doc))
        
        self.idf_ = {
            word: np.log((N + 1) / (count + 1)) + 1
            for word, count in df.items()
        }
        
        # Precompute OOV fallback values
        idf_values = list(self.idf_.values())
        self.max_idf_ = max(idf_values) if idf_values else 1.0
        self.mean_idf_ = np.mean(idf_values) if idf_values else 1.0
        
        return self
    
    def _get_idf(self, word: str) -> float:
        """Get IDF for a word, handling OOV cases."""
        if word in self.idf_:
            return self.idf_[word]
        
        # OOV handling
        if self.oov_idf_strategy == 'max':
            return self.max_idf_
        elif self.oov_idf_strategy == 'mean':
            return self.mean_idf_
        elif self.oov_idf_strategy == 'fixed':
            return self.fixed_oov_idf
        elif self.oov_idf_strategy == 'zero':
            return 0.0
        else:
            raise ValueError(f"Unknown OOV strategy: {self.oov_idf_strategy}")
    
    def get_document_vector(
        self,
        words: List[str],
        tf_scheme: str = 'sublinear'
    ) -> np.ndarray:
        """Compute weighted embedding for tokenized document."""
        if self.idf_ is None:
            raise ValueError("Must call fit() first")
        
        word_counts = Counter(words)
        weighted_sum = np.zeros(self.embedding_dim)
        total_weight = 0.0
        
        for word, count in word_counts.items():
            # Skip words not in word vector vocabulary
            if word not in self.word_vectors:
                continue
            
            # Compute TF
            if tf_scheme == 'sublinear':
                tf = 1 + np.log(count)
            elif tf_scheme == 'raw':
                tf = count / len(words)
            else:
                tf = 1.0
            
            # Get IDF (with OOV handling)
            idf = self._get_idf(word)
            
            weight = tf * idf
            weighted_sum += weight * self.word_vectors[word]
            total_weight += weight
        
        if total_weight > 0:
            return weighted_sum / total_weight
        return np.zeros(self.embedding_dim)
    
    def transform(self, documents: List[List[str]]) -> np.ndarray:
        """Transform list of tokenized documents."""
        return np.array([self.get_document_vector(doc) for doc in documents])
 
 
# Compare OOV strategies
def compare_oov_strategies(test_doc, train_corpus, word_vectors):
    """Compare different OOV IDF handling strategies."""
    strategies = ['max', 'mean', 'fixed', 'zero']
    
    results = {}
    for strategy in strategies:
        embedder = RobustTFIDFWeightedW2V(
            word_vectors,
            oov_idf_strategy=strategy,
            fixed_oov_idf=1.0
        )
        embedder.fit(train_corpus)
        
        # Find OOV words
        oov_words = [w for w in test_doc if w not in embedder.idf_]
        
        vec = embedder.get_document_vector(test_doc)
        results[strategy] = {
            'vector': vec,
            'norm': np.linalg.norm(vec),
            'oov_count': len(oov_words)
        }
    
    print(f"Test document has {results['max']['oov_count']} OOV words")
    print("
Embedding norms by strategy:")
    for strategy, data in results.items():
        print(f"  {strategy}: {data['norm']:.4f}")
    
    return results

Practical Recommendation

For production systems, using 'max' or 'mean' IDF for OOV words works well. The 'max' strategy assumes unknown words are informative (reasonable for domain-specific terms), while 'mean' is a conservative middle ground. Avoid 'zero' unless you specifically want to ignore OOV words entirely.

Comparison with Simple Averaging

When and how much does TF-IDF weighting help over simple averaging? The answer depends on document characteristics and task requirements.

When TF-IDF Weighting Helps Most
Scenario	Simple Avg	TF-IDF Weighted	Reason
Long documents with repeated keywords	Worse	Better	Keyword repetition gets logarithmic (not linear) boost
Documents with many stop words	Worse	Better	Stop words get low IDF → low weight
Comparing documents of different lengths	Worse	Better	Normalization handles length variation better
Short, focused documents	Similar	Similar	Few words → weighting has less impact
Highly technical domains	Worse	Better	Technical terms get high IDF → high weight
Informal text (tweets, comments)	Similar	Similar	Less vocabulary variation, shorter text

empirical_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
 
def empirical_comparison(
    documents: list,
    word_vectors,
    labels: list = None
):
    """
    Empirically compare simple average vs TF-IDF weighted embeddings.
    """
    # Compute IDF
    tokenized_docs = [doc.lower().split() for doc in documents]
    N = len(tokenized_docs)
    df = Counter()
    for doc in tokenized_docs:
        df.update(set(doc))
    
    idf = {w: np.log((N + 1) / (c + 1)) + 1 for w, c in df.items()}
    
    simple_embeddings = []
    weighted_embeddings = []
    
    for doc_words in tokenized_docs:
        # Simple average
        simple_vecs = [word_vectors[w] for w in doc_words if w in word_vectors]
        if simple_vecs:
            simple_emb = np.mean(simple_vecs, axis=0)
            simple_emb = simple_emb / (np.linalg.norm(simple_emb) + 1e-9)
        else:
            simple_emb = np.zeros(word_vectors.vector_size)
        
        # TF-IDF weighted
        word_counts = Counter(doc_words)
        weighted_sum = np.zeros(word_vectors.vector_size)
        total_weight = 0
        
        for word, count in word_counts.items():
            if word in word_vectors:
                tf = 1 + np.log(count)
                w_idf = idf.get(word, 1.0)
                weight = tf * w_idf
                weighted_sum += weight * word_vectors[word]
                total_weight += weight
        
        if total_weight > 0:
            weighted_emb = weighted_sum / total_weight
            weighted_emb = weighted_emb / (np.linalg.norm(weighted_emb) + 1e-9)
        else:
            weighted_emb = np.zeros(word_vectors.vector_size)
        
        simple_embeddings.append(simple_emb)
        weighted_embeddings.append(weighted_emb)
    
    simple_mat = np.array(simple_embeddings)
    weighted_mat = np.array(weighted_embeddings)
    
    # Compute similarity matrices
    simple_sim = cosine_similarity(simple_mat)
    weighted_sim = cosine_similarity(weighted_mat)
    
    # How different are the similarities?
    diff = np.abs(simple_sim - weighted_sim)
    upper_indices = np.triu_indices_from(diff, k=1)
    mean_diff = diff[upper_indices].mean()
    max_diff = diff[upper_indices].max()
    
    print(f"Mean difference in similarities: {mean_diff:.4f}")
    print(f"Max difference in similarities: {max_diff:.4f}")
    
    # Show specific comparisons if labels provided
    if labels:
        print("
Pairwise similarities:")
        print(f"{'Pair':<30} {'Simple':<10} {'Weighted':<10} {'Diff'}")
        print("-" * 60)
        for i in range(len(documents)):
            for j in range(i+1, len(documents)):
                label = f"{labels[i][:12]} vs {labels[j][:12]}"
                print(f"{label:<30} {simple_sim[i,j]:.4f}     {weighted_sim[i,j]:.4f}     {diff[i,j]:.4f}")
    
    return simple_mat, weighted_mat
 
 
# Example with contrasting documents
documents = [
    "The machine learning algorithm processes data through neural network layers efficiently",
    "Deep neural networks learn patterns from data using backpropagation optimization",
    "The cat sat on the mat while the dog slept on the rug in the house",
    "Python programming language is used for machine learning and data science",
]
 
labels = ["ML Algo", "Deep Learning", "Cat/Dog", "Python ML"]
 
simple, weighted = empirical_comparison(documents, word_vectors, labels)

Typical Improvement

In text classification and semantic similarity tasks, TF-IDF weighting typically improves over simple averaging by 2-5% in accuracy or correlation. The gain is most pronounced for longer documents with varied vocabulary. For short texts (tweets, queries), the improvement is minimal.

Efficient Implementation for Large Scale

For production systems processing millions of documents, efficiency matters. Here are optimizations for the TF-IDF weighted approach:

efficient_tfidf_w2v.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from collections import Counter
from typing import List, Dict
import threading
from concurrent.futures import ThreadPoolExecutor
 
class EfficientTFIDFWord2Vec:
    """
    Optimized TF-IDF weighted Word2Vec for production scale.
    """
    
    def __init__(
        self,
        word_vectors,
        embedding_dim: int = 200,
        batch_size: int = 1000,
        n_workers: int = 4
    ):
        self.embedding_dim = embedding_dim
        self.batch_size = batch_size
        self.n_workers = n_workers
        
        # Convert word vectors to numpy array for faster access
        self.vocab = list(word_vectors.key_to_index.keys())
        self.word_to_idx = {w: i for i, w in enumerate(self.vocab)}
        self.embedding_matrix = np.array([
            word_vectors[w] for w in self.vocab
        ])
        
        self.idf_ = None
        self.idf_array_ = None  # Aligned with vocab for vectorized ops
    
    def fit(self, tokenized_documents: List[List[str]]):
        """Compute IDF from corpus."""
        N = len(tokenized_documents)
        df = Counter()
        
        for doc in tokenized_documents:
            df.update(set(doc))
        
        # Store IDF as dict
        self.idf_ = {
            word: np.log((N + 1) / (count + 1)) + 1
            for word, count in df.items()
        }
        
        # Create aligned IDF array for vectorized operations
        self.idf_array_ = np.array([
            self.idf_.get(w, 1.0) for w in self.vocab
        ])
        
        return self
    
    def _process_single(self, words: List[str]) -> np.ndarray:
        """Process a single document (optimized)."""
        word_counts = Counter(words)
        
        # Get indices of words in vocab
        valid_words = [w for w in word_counts if w in self.word_to_idx]
        if not valid_words:
            return np.zeros(self.embedding_dim)
        
        indices = np.array([self.word_to_idx[w] for w in valid_words])
        counts = np.array([word_counts[w] for w in valid_words])
        
        # Sublinear TF
        tf = 1 + np.log(counts)
        
        # Get IDF values (already aligned)
        idf_vals = self.idf_array_[indices]
        
        # Weights
        weights = tf * idf_vals
        
        # Weighted sum using matrix indexing
        embeddings = self.embedding_matrix[indices]  # (n_words, dim)
        weighted_sum = np.sum(embeddings * weights[:, np.newaxis], axis=0)
        
        return weighted_sum / weights.sum()
    
    def transform(self, tokenized_documents: List[List[str]]) -> np.ndarray:
        """Transform documents with parallel processing."""
        results = []
        
        # Process in batches for memory efficiency
        for batch_start in range(0, len(tokenized_documents), self.batch_size):
            batch = tokenized_documents[batch_start:batch_start + self.batch_size]
            
            # Parallel processing within batch
            with ThreadPoolExecutor(max_workers=self.n_workers) as executor:
                batch_results = list(executor.map(self._process_single, batch))
            
            results.extend(batch_results)
        
        return np.array(results)
    
    def transform_single(self, words: List[str]) -> np.ndarray:
        """Transform a single document (for online inference)."""
        return self._process_single(words)
 
 
# Memory-efficient streaming for very large corpora
class StreamingTFIDFWord2Vec:
    """
    Streaming version that doesn't load full corpus into memory.
    Computes IDF in a streaming fashion.
    """
    
    def __init__(self, word_vectors, embedding_dim: int = 200):
        self.word_vectors = word_vectors
        self.embedding_dim = embedding_dim
        self.df_ = Counter()
        self.n_docs_ = 0
        self.idf_ = None
    
    def partial_fit(self, documents: List[List[str]]):
        """Update IDF counts with new documents."""
        self.n_docs_ += len(documents)
        
        for doc in documents:
            self.df_.update(set(doc))
        
        return self
    
    def finalize(self):
        """Compute final IDF values after all partial fits."""
        self.idf_ = {
            word: np.log((self.n_docs_ + 1) / (count + 1)) + 1
            for word, count in self.df_.items()
        }
        return self
    
    # ... transform methods same as before
 
 
# Benchmark
def benchmark_implementations(documents, word_vectors, n_repeats=3):
    """Compare implementation speeds."""
    import time
    
    # Standard implementation
    standard = TFIDFWeightedWord2Vec(word_vectors)
    
    # Efficient implementation
    efficient = EfficientTFIDFWord2Vec(word_vectors)
    
    # Tokenize
    tokenized = [doc.lower().split() for doc in documents]
    
    # Fit both
    standard.fit(documents)
    efficient.fit(tokenized)
    
    # Benchmark transform
    for impl_name, impl, data in [
        ('Standard', standard, documents),
        ('Efficient', efficient, tokenized)
    ]:
        times = []
        for _ in range(n_repeats):
            start = time.time()
            impl.transform(data)
            times.append(time.time() - start)
        
        print(f"{impl_name}: {np.mean(times):.4f}s (±{np.std(times):.4f}s)")

Key Optimizations

The main optimizations are: (1) Precompute embedding matrix aligned with vocabulary for vectorized lookups, (2) Use numpy operations instead of Python loops, (3) Parallel processing for batch transforms, (4) Streaming IDF computation for memory efficiency with large corpora. These can speed up processing by 10-50x for large datasets.

Integration with ML Pipelines

TF-IDF weighted Word2Vec integrates smoothly with scikit-learn pipelines for end-to-end training and inference:

sklearn_integration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import Normalizer
from collections import Counter
 
class TFIDFWord2VecTransformer(BaseEstimator, TransformerMixin):
    """
    Sklearn-compatible TF-IDF weighted Word2Vec transformer.
    """
    
    def __init__(
        self,
        word_vectors,
        tf_scheme: str = 'sublinear',
        normalize: bool = True,
        min_df: int = 1
    ):
        """
        Args:
            word_vectors: Word embedding model
            tf_scheme: 'raw', 'sublinear', or 'binary'
            normalize: L2-normalize output embeddings
            min_df: Minimum document frequency for IDF
        """
        self.word_vectors = word_vectors
        self.tf_scheme = tf_scheme
        self.normalize = normalize
        self.min_df = min_df
        
        self.embedding_dim_ = word_vectors.vector_size
        self.idf_ = None
    
    def fit(self, X, y=None):
        """Compute IDF from training documents."""
        N = len(X)
        df = Counter()
        
        for doc in X:
            words = doc if isinstance(doc, list) else doc.split()
            df.update(set(words))
        
        # Filter by min_df and compute IDF
        self.idf_ = {
            word: np.log((N + 1) / (count + 1)) + 1
            for word, count in df.items()
            if count >= self.min_df
        }
        
        return self
    
    def transform(self, X):
        """Transform documents to embeddings."""
        embeddings = []
        
        for doc in X:
            words = doc if isinstance(doc, list) else doc.split()
            word_counts = Counter(words)
            
            weighted_sum = np.zeros(self.embedding_dim_)
            total_weight = 0
            
            for word, count in word_counts.items():
                if word not in self.word_vectors:
                    continue
                
                # TF
                if self.tf_scheme == 'sublinear':
                    tf = 1 + np.log(count)
                elif self.tf_scheme == 'binary':
                    tf = 1.0
                else:  # raw
                    tf = count / len(words)
                
                # IDF
                idf = self.idf_.get(word, 1.0)
                
                weight = tf * idf
                weighted_sum += weight * self.word_vectors[word]
                total_weight += weight
            
            if total_weight > 0:
                embedding = weighted_sum / total_weight
            else:
                embedding = np.zeros(self.embedding_dim_)
            
            if self.normalize:
                norm = np.linalg.norm(embedding)
                if norm > 0:
                    embedding = embedding / norm
            
            embeddings.append(embedding)
        
        return np.array(embeddings)
 
 
# Create complete classification pipeline
def create_classification_pipeline(word_vectors):
    """
    Create a complete text classification pipeline using
    TF-IDF weighted Word2Vec embeddings.
    """
    return Pipeline([
        ('tfidf_w2v', TFIDFWord2VecTransformer(
            word_vectors,
            tf_scheme='sublinear',
            normalize=True
        )),
        ('classifier', LogisticRegression(
            max_iter=1000,
            solver='lbfgs'
        ))
    ])
 
# Hyperparameter tuning
def tune_pipeline(X_train, y_train, word_vectors):
    """Grid search over embedding and classifier parameters."""
    pipeline = create_classification_pipeline(word_vectors)
    
    param_grid = {
        'tfidf_w2v__tf_scheme': ['sublinear', 'binary', 'raw'],
        'tfidf_w2v__min_df': [1, 2, 5],
        'classifier__C': [0.1, 1.0, 10.0],
        'classifier__penalty': ['l2'],
    }
    
    grid_search = GridSearchCV(
        pipeline,
        param_grid,
        cv=5,
        scoring='accuracy',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV score: {grid_search.best_score_:.4f}")
    
    return grid_search.best_estimator_
 
 
# Usage example
# X_train = ["document one", "document two", ...]
# y_train = [0, 1, ...]
 
pipeline = create_classification_pipeline(word_vectors)
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
 
# Fit final model
pipeline.fit(X_train, y_train)
 
# Save for production
import joblib
joblib.dump(pipeline, 'tfidf_w2v_classifier.joblib')
 
# Load and predict
loaded_pipeline = joblib.load('tfidf_w2v_classifier.joblib')
predictions = loaded_pipeline.predict(X_test)

When to Use TF-IDF Weighting

TF-IDF weighted Word2Vec sits between simple averaging and more complex methods. Here's guidance on when it's the right choice:

Use TF-IDF Weighting When

•Working with longer documents (articles, reviews, papers)
•Your domain has specialized vocabulary with high discriminative value
•Documents contain many stop words that you want to downweight automatically
•You need better baseline performance without added model complexity
•Training a classification or retrieval system where term importance matters

Simple Averaging May Suffice When

•Working with very short texts (tweets, queries, keywords)
•You've pre-filtered stop words from input
•The vocabulary is homogeneous with little variation in word informativeness
•You need maximum simplicity and interpretability
•Computational resources are extremely limited

Decision Framework

Start with TF-IDF weighting as your default for any document-level task (classification, clustering, retrieval). Only fall back to simple averaging if (1) TF-IDF provides no measurable improvement, (2) you're constrained on preprocessing complexity, or (3) your documents are extremely short. The computational overhead of TF-IDF is minimal.

Summary: TF-IDF Weighted Word2Vec

We've explored the principled combination of TF-IDF weighting with Word2Vec embeddings. Let's consolidate the key concepts:

Key Takeaways

•TF-IDF weights capture word importance: frequent in this document (TF) but rare across the corpus (IDF) signals high discriminative value.
•Weighted averaging replaces uniform averaging, emphasizing informative words and downweighting ubiquitous ones.
•Sublinear TF (log scaling) prevents very frequent words from dominating through mere repetition.
•OOV handling requires explicit strategy—using max or mean IDF for unseen words works well in practice.
•Typical improvement over simple averaging is 2-5% on classification/similarity tasks, most pronounced for longer documents.
•Integrates naturally with sklearn pipelines for end-to-end training and inference.

What's next:

Having mastered the combination of TF-IDF with Word2Vec for document representation, we now turn to fundamentally different approaches to word embeddings. The next pages cover GloVe (Global Vectors for Word Representation)—which learns embeddings through global matrix factorization rather than local context prediction—and FastText—which extends Word2Vec to handle subword information for better generalization.

Page Complete

You now have a comprehensive understanding of TF-IDF weighted Word2Vec—from the mathematical formulation through implementation variations to production-ready sklearn integration. This technique provides a meaningful improvement over simple averaging at minimal additional cost, making it an excellent default for document-level NLP tasks.

3 / 6

Loading learning content...

Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

3 / 6

TF-IDF Weighted Word2Vec: Principled Word Importance Weighting

Not All Words Are Created Equal

Consider two documents about machine learning:

Document A: "The neural network processes the data through the layers."
Document B: "The algorithm optimizes the gradient through the network."

What You Will Master

Recap: TF-IDF Fundamentals

Before combining TF-IDF with Word2Vec, let's ensure we have a solid understanding of TF-IDF itself.

Term Frequency (TF):

Measures how often a term appears in a document. For term t in document d:

$$\text{TF}(t, d) = \frac{f_{t,d}}{\sum_{t' \in d} f_{t',d}}$$

where f_{t,d} is the raw count of term t in document d. This normalization ensures TF values sum to 1 within each document.

Inverse Document Frequency (IDF):

Measures how rare or common a term is across the entire corpus:

$$\text{IDF}(t) = \log \frac{N}{\text{df}(t)}$$

where N is the total number of documents and df(t) is the number of documents containing term t. Rare terms have high IDF; common terms have low (or even negative) IDF.

TF-IDF:

The product captures "importance": terms that are frequent in a specific document but rare overall:

$$\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$$

TF-IDF Intuition
Scenario	TF	IDF	TF-IDF	Interpretation
'machine' in ML paper	High (10 occurrences)	High (rare in general)	Very High	Highly discriminative for this doc
'the' in any doc	High (frequent)	Very Low (in all docs)	Low	Not discriminative
'zebra' in ML paper	Very Low (0-1)	Very High (rare)	Low	Rare but not characteristic of this doc
'algorithm' in ML paper	Medium	Medium	Medium-High	Somewhat discriminative

IDF Variations

The TF-IDF Weighted Averaging Formula

TF-IDF weighted Word2Vec replaces the uniform average with a weighted average:

$$\mathbf{v}D = \frac{\sum{w \in D} \text{TF-IDF}(w, D) \cdot \mathbf{v}w}{\sum{w \in D} \text{TF-IDF}(w, D)}$$

Equivalently, using normalized weights:

$$\mathbf{v}D = \sum{w \in D} \alpha_w \cdot \mathbf{v}_w$$

where: $$\alpha_w = \frac{\text{TF-IDF}(w, D)}{\sum_{w' \in D} \text{TF-IDF}(w', D)}$$

Key insight: The weights α_w now depend on two factors:

How often the word appears in this document (TF component)
How rare the word is across the entire corpus (IDF component)

tfidf_weighted_w2v_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from typing import List, Dict
 
class TFIDFWeightedWord2Vec:
    """
    Document embeddings via TF-IDF weighted averaging of word vectors.
    """
    
    def __init__(self, word_vectors, embedding_dim: int = 200):
        """
        Args:
            word_vectors: Trained word embedding model (gensim or dict)
            embedding_dim: Dimension of embeddings
        """
        self.word_vectors = word_vectors
        self.embedding_dim = embedding_dim
        self.tfidf_vectorizer = None
        self.vocabulary_ = None
        self.idf_ = None
    
    def fit(self, documents: List[str]):
        """
        Fit TF-IDF on the corpus.
        
        Args:
            documents: List of raw text documents (not tokenized)
        """
        self.tfidf_vectorizer = TfidfVectorizer(
            lowercase=True,
            sublinear_tf=True,  # Use log(1 + tf) instead of raw tf
            norm=None  # Don't L2 normalize—we normalize embeddings instead
        )
        self.tfidf_vectorizer.fit(documents)
        
        # Store vocabulary and IDF values for efficient lookup
        self.vocabulary_ = self.tfidf_vectorizer.vocabulary_
        self.idf_ = dict(zip(
            self.tfidf_vectorizer.get_feature_names_out(),
            self.tfidf_vectorizer.idf_
        ))
        
        return self
    
    def get_document_vector(self, document: str) -> np.ndarray:
        """
        Compute TF-IDF weighted average embedding for a document.
        
        Args:
            document: Raw text string
            
        Returns:
            Weighted average embedding
        """
        if self.tfidf_vectorizer is None:
            raise ValueError("Must call fit() before transform()")
        
        # Get TF-IDF representation
        tfidf_vector = self.tfidf_vectorizer.transform([document])
        
        # Get feature names and their TF-IDF values
        feature_names = self.tfidf_vectorizer.get_feature_names_out()
        tfidf_array = tfidf_vector.toarray()[0]
        
        # Compute weighted average
        weighted_sum = np.zeros(self.embedding_dim)
        total_weight = 0.0
        
        for word, weight in zip(feature_names, tfidf_array):
            if weight > 0 and word in self.word_vectors:
                weighted_sum += weight * self.word_vectors[word]
                total_weight += weight
        
        if total_weight > 0:
            return weighted_sum / total_weight
        else:
            return np.zeros(self.embedding_dim)
    
    def transform(self, documents: List[str]) -> np.ndarray:
        """
        Transform documents to TF-IDF weighted embeddings.
        
        Args:
            documents: List of raw text documents
            
        Returns:
            Matrix of shape (n_documents, embedding_dim)
        """
        return np.array([self.get_document_vector(doc) for doc in documents])
    
    def fit_transform(self, documents: List[str]) -> np.ndarray:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
 
 
# Example usage
documents = [
    "The neural network processes data efficiently through multiple hidden layers.",
    "Deep learning algorithms optimize gradient descent for better convergence.",
    "The cat sat on the mat near the window overlooking the garden.",
]
 
# Create weighted embedder
weighted_embedder = TFIDFWeightedWord2Vec(word_vectors)
doc_embeddings = weighted_embedder.fit_transform(documents)
 
print(f"Document embeddings shape: {doc_embeddings.shape}")
 
# Compare with simple average
from sklearn.metrics.pairwise import cosine_similarity
 
print("
Document similarity matrix (TF-IDF weighted):")
print(cosine_similarity(doc_embeddings).round(3))

Corpus Consistency

Implementation Variations

There are several variations in how to compute and apply TF-IDF weights. Each has different properties:

1. Sublinear TF scaling:

Instead of raw term frequency, use logarithmic scaling:

$$\text{TF}{\text{sublinear}}(t, d) = 1 + \log(f{t,d}) \quad \text{if } f_{t,d} > 0$$

This reduces the impact of highly repeated words. A word appearing 100 times doesn't contribute 100x more than one appearing once—the relationship is logarithmic.

2. Binary TF:

Simplest variant—just check presence/absence:

$$\text{TF}_{\text{binary}}(t, d) = \mathbf{1}[t \in d]$$

Useful when word presence matters more than frequency.

3. IDF-only weighting:

Some applications use only IDF weights (ignoring TF), which gives corpus-level importance:

$$\mathbf{v}D = \frac{\sum{w \in D} \text{IDF}(w) \cdot \mathbf{v}w}{\sum{w \in D} \text{IDF}(w)}$$

tfidf_variations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from collections import Counter
from typing import List, Dict
 
def compute_idf(documents: List[List[str]]) -> Dict[str, float]:
    """
    Compute IDF values from a corpus of tokenized documents.
    
    Uses smooth IDF: log((N + 1) / (df + 1)) + 1
    """
    N = len(documents)
    df = Counter()
    
    for doc in documents:
        # Count each word once per document (document frequency)
        unique_words = set(doc)
        df.update(unique_words)
    
    idf = {}
    for word, doc_freq in df.items():
        # Smooth IDF with additive smoothing
        idf[word] = np.log((N + 1) / (doc_freq + 1)) + 1
    
    return idf
 
 
def weighted_average_embedding(
    words: List[str],
    word_vectors,
    idf_values: Dict[str, float],
    tf_scheme: str = 'sublinear',  # 'raw', 'binary', 'sublinear'
    use_tf: bool = True,  # If False, use IDF-only
    normalize_output: bool = True
) -> np.ndarray:
    """
    Compute weighted average with configurable TF and IDF schemes.
    """
    embedding_dim = word_vectors.vector_size
    word_counts = Counter(words)
    
    weighted_sum = np.zeros(embedding_dim)
    total_weight = 0.0
    
    for word, count in word_counts.items():
        if word not in word_vectors:
            continue
        
        # Compute TF component
        if tf_scheme == 'raw':
            tf = count / len(words)
        elif tf_scheme == 'binary':
            tf = 1.0
        elif tf_scheme == 'sublinear':
            tf = 1 + np.log(count)
        else:
            raise ValueError(f"Unknown TF scheme: {tf_scheme}")
        
        # Get IDF component (default to 1.0 for unknown words)
        idf = idf_values.get(word, 1.0)
        
        # Compute weight
        if use_tf:
            weight = tf * idf
        else:
            weight = idf  # IDF-only
        
        weighted_sum += weight * word_vectors[word]
        total_weight += weight
    
    if total_weight > 0:
        result = weighted_sum / total_weight
    else:
        result = weighted_sum
    
    if normalize_output:
        norm = np.linalg.norm(result)
        if norm > 0:
            result = result / norm
    
    return result
 
 
# Comparison of different schemes
def compare_weighting_schemes(doc: List[str], word_vectors, idf_values):
    """Compare embeddings from different TF-IDF schemes."""
    schemes = [
        {'tf_scheme': 'raw', 'use_tf': True, 'label': 'Raw TF × IDF'},
        {'tf_scheme': 'sublinear', 'use_tf': True, 'label': 'Sublinear TF × IDF'},
        {'tf_scheme': 'binary', 'use_tf': True, 'label': 'Binary TF × IDF'},
        {'tf_scheme': 'sublinear', 'use_tf': False, 'label': 'IDF only'},
    ]
    
    embeddings = {}
    for scheme in schemes:
        label = scheme.pop('label')
        embeddings[label] = weighted_average_embedding(
            doc, word_vectors, idf_values, **scheme
        )
        scheme['label'] = label  # Restore for next iteration
    
    # Show pairwise similarities
    labels = list(embeddings.keys())
    vectors = np.array(list(embeddings.values()))
    
    from sklearn.metrics.pairwise import cosine_similarity
    sim_matrix = cosine_similarity(vectors)
    
    print("Similarity between different weighting schemes:")
    for i, label_i in enumerate(labels):
        for j, label_j in enumerate(labels):
            if i < j:
                print(f"  {label_i} vs {label_j}: {sim_matrix[i,j]:.4f}")
    
    return embeddings
 
 
# Build IDF from corpus
corpus = [
    "machine learning algorithms process data".split(),
    "neural networks learn patterns from data".split(),
    "deep learning uses gradient optimization".split(),
    "the cat sat on the mat".split(),
]
 
idf = compute_idf(corpus)
print("IDF values for key terms:")
for word in ['machine', 'learning', 'the', 'data', 'cat']:
    print(f"  '{word}': {idf.get(word, 0):.4f}")
 
# Compare schemes on a document
test_doc = "machine learning algorithms process the data efficiently".split()
compare_weighting_schemes(test_doc, word_vectors, idf)

Recommended Default

Handling IDF for Unseen Words

A challenge arises when documents contain words not seen during IDF fitting. There are several strategies:

1. Default IDF value:

Assign a default IDF (often max IDF or average IDF) to unknown words:

$$\text{IDF}{\text{default}} = \max{w \in V} \text{IDF}(w)$$

Rationale: Unknown words are likely rare (hence high IDF).

2. Zero weight:

Simply ignore words without IDF values (equivalent to setting weight to 0). This is equivalent to the intersection of TF-IDF vocabulary and Word2Vec vocabulary.

3. Uniform weight (IDF = 1):

Treat unknown words as having IDF = 1, so their weight equals their TF. This is a neutral assumption.

4. OOV-specific IDF:

Estimate IDF for unseen words using a global OOV rate or by treating all OOV words as a single pseudo-word.

idf_oov_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
import numpy as np
from collections import Counter
from typing import List, Dict, Optional
 
class RobustTFIDFWeightedW2V:
    """
    TF-IDF weighted Word2Vec with robust OOV handling.
    """
    
    def __init__(
        self,
        word_vectors,
        embedding_dim: int = 200,
        oov_idf_strategy: str = 'max',  # 'max', 'mean', 'fixed', 'zero'
        fixed_oov_idf: float = 1.0
    ):
        """
        Args:
            word_vectors: Word embedding model
            embedding_dim: Dimension of embeddings
            oov_idf_strategy: How to handle words not in IDF vocabulary
            fixed_oov_idf: IDF value for OOV words when using 'fixed' strategy
        """
        self.word_vectors = word_vectors
        self.embedding_dim = embedding_dim
        self.oov_idf_strategy = oov_idf_strategy
        self.fixed_oov_idf = fixed_oov_idf
        
        self.idf_ = None
        self.max_idf_ = None
        self.mean_idf_ = None
    
    def fit(self, documents: List[List[str]]):
        """Compute IDF values from a tokenized corpus."""
        N = len(documents)
        df = Counter()
        
        for doc in documents:
            df.update(set(doc))
        
        self.idf_ = {
            word: np.log((N + 1) / (count + 1)) + 1
            for word, count in df.items()
        }
        
        # Precompute OOV fallback values
        idf_values = list(self.idf_.values())
        self.max_idf_ = max(idf_values) if idf_values else 1.0
        self.mean_idf_ = np.mean(idf_values) if idf_values else 1.0
        
        return self
    
    def _get_idf(self, word: str) -> float:
        """Get IDF for a word, handling OOV cases."""
        if word in self.idf_:
            return self.idf_[word]
        
        # OOV handling
        if self.oov_idf_strategy == 'max':
            return self.max_idf_
        elif self.oov_idf_strategy == 'mean':
            return self.mean_idf_
        elif self.oov_idf_strategy == 'fixed':
            return self.fixed_oov_idf
        elif self.oov_idf_strategy == 'zero':
            return 0.0
        else:
            raise ValueError(f"Unknown OOV strategy: {self.oov_idf_strategy}")
    
    def get_document_vector(
        self,
        words: List[str],
        tf_scheme: str = 'sublinear'
    ) -> np.ndarray:
        """Compute weighted embedding for tokenized document."""
        if self.idf_ is None:
            raise ValueError("Must call fit() first")
        
        word_counts = Counter(words)
        weighted_sum = np.zeros(self.embedding_dim)
        total_weight = 0.0
        
        for word, count in word_counts.items():
            # Skip words not in word vector vocabulary
            if word not in self.word_vectors:
                continue
            
            # Compute TF
            if tf_scheme == 'sublinear':
                tf = 1 + np.log(count)
            elif tf_scheme == 'raw':
                tf = count / len(words)
            else:
                tf = 1.0
            
            # Get IDF (with OOV handling)
            idf = self._get_idf(word)
            
            weight = tf * idf
            weighted_sum += weight * self.word_vectors[word]
            total_weight += weight
        
        if total_weight > 0:
            return weighted_sum / total_weight
        return np.zeros(self.embedding_dim)
    
    def transform(self, documents: List[List[str]]) -> np.ndarray:
        """Transform list of tokenized documents."""
        return np.array([self.get_document_vector(doc) for doc in documents])
 
 
# Compare OOV strategies
def compare_oov_strategies(test_doc, train_corpus, word_vectors):
    """Compare different OOV IDF handling strategies."""
    strategies = ['max', 'mean', 'fixed', 'zero']
    
    results = {}
    for strategy in strategies:
        embedder = RobustTFIDFWeightedW2V(
            word_vectors,
            oov_idf_strategy=strategy,
            fixed_oov_idf=1.0
        )
        embedder.fit(train_corpus)
        
        # Find OOV words
        oov_words = [w for w in test_doc if w not in embedder.idf_]
        
        vec = embedder.get_document_vector(test_doc)
        results[strategy] = {
            'vector': vec,
            'norm': np.linalg.norm(vec),
            'oov_count': len(oov_words)
        }
    
    print(f"Test document has {results['max']['oov_count']} OOV words")
    print("
Embedding norms by strategy:")
    for strategy, data in results.items():
        print(f"  {strategy}: {data['norm']:.4f}")
    
    return results

Practical Recommendation

Comparison with Simple Averaging

When and how much does TF-IDF weighting help over simple averaging? The answer depends on document characteristics and task requirements.

When TF-IDF Weighting Helps Most
Scenario	Simple Avg	TF-IDF Weighted	Reason
Long documents with repeated keywords	Worse	Better	Keyword repetition gets logarithmic (not linear) boost
Documents with many stop words	Worse	Better	Stop words get low IDF → low weight
Comparing documents of different lengths	Worse	Better	Normalization handles length variation better
Short, focused documents	Similar	Similar	Few words → weighting has less impact
Highly technical domains	Worse	Better	Technical terms get high IDF → high weight
Informal text (tweets, comments)	Similar	Similar	Less vocabulary variation, shorter text

empirical_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
 
def empirical_comparison(
    documents: list,
    word_vectors,
    labels: list = None
):
    """
    Empirically compare simple average vs TF-IDF weighted embeddings.
    """
    # Compute IDF
    tokenized_docs = [doc.lower().split() for doc in documents]
    N = len(tokenized_docs)
    df = Counter()
    for doc in tokenized_docs:
        df.update(set(doc))
    
    idf = {w: np.log((N + 1) / (c + 1)) + 1 for w, c in df.items()}
    
    simple_embeddings = []
    weighted_embeddings = []
    
    for doc_words in tokenized_docs:
        # Simple average
        simple_vecs = [word_vectors[w] for w in doc_words if w in word_vectors]
        if simple_vecs:
            simple_emb = np.mean(simple_vecs, axis=0)
            simple_emb = simple_emb / (np.linalg.norm(simple_emb) + 1e-9)
        else:
            simple_emb = np.zeros(word_vectors.vector_size)
        
        # TF-IDF weighted
        word_counts = Counter(doc_words)
        weighted_sum = np.zeros(word_vectors.vector_size)
        total_weight = 0
        
        for word, count in word_counts.items():
            if word in word_vectors:
                tf = 1 + np.log(count)
                w_idf = idf.get(word, 1.0)
                weight = tf * w_idf
                weighted_sum += weight * word_vectors[word]
                total_weight += weight
        
        if total_weight > 0:
            weighted_emb = weighted_sum / total_weight
            weighted_emb = weighted_emb / (np.linalg.norm(weighted_emb) + 1e-9)
        else:
            weighted_emb = np.zeros(word_vectors.vector_size)
        
        simple_embeddings.append(simple_emb)
        weighted_embeddings.append(weighted_emb)
    
    simple_mat = np.array(simple_embeddings)
    weighted_mat = np.array(weighted_embeddings)
    
    # Compute similarity matrices
    simple_sim = cosine_similarity(simple_mat)
    weighted_sim = cosine_similarity(weighted_mat)
    
    # How different are the similarities?
    diff = np.abs(simple_sim - weighted_sim)
    upper_indices = np.triu_indices_from(diff, k=1)
    mean_diff = diff[upper_indices].mean()
    max_diff = diff[upper_indices].max()
    
    print(f"Mean difference in similarities: {mean_diff:.4f}")
    print(f"Max difference in similarities: {max_diff:.4f}")
    
    # Show specific comparisons if labels provided
    if labels:
        print("
Pairwise similarities:")
        print(f"{'Pair':<30} {'Simple':<10} {'Weighted':<10} {'Diff'}")
        print("-" * 60)
        for i in range(len(documents)):
            for j in range(i+1, len(documents)):
                label = f"{labels[i][:12]} vs {labels[j][:12]}"
                print(f"{label:<30} {simple_sim[i,j]:.4f}     {weighted_sim[i,j]:.4f}     {diff[i,j]:.4f}")
    
    return simple_mat, weighted_mat
 
 
# Example with contrasting documents
documents = [
    "The machine learning algorithm processes data through neural network layers efficiently",
    "Deep neural networks learn patterns from data using backpropagation optimization",
    "The cat sat on the mat while the dog slept on the rug in the house",
    "Python programming language is used for machine learning and data science",
]
 
labels = ["ML Algo", "Deep Learning", "Cat/Dog", "Python ML"]
 
simple, weighted = empirical_comparison(documents, word_vectors, labels)

Typical Improvement

Efficient Implementation for Large Scale

For production systems processing millions of documents, efficiency matters. Here are optimizations for the TF-IDF weighted approach:

efficient_tfidf_w2v.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from collections import Counter
from typing import List, Dict
import threading
from concurrent.futures import ThreadPoolExecutor
 
class EfficientTFIDFWord2Vec:
    """
    Optimized TF-IDF weighted Word2Vec for production scale.
    """
    
    def __init__(
        self,
        word_vectors,
        embedding_dim: int = 200,
        batch_size: int = 1000,
        n_workers: int = 4
    ):
        self.embedding_dim = embedding_dim
        self.batch_size = batch_size
        self.n_workers = n_workers
        
        # Convert word vectors to numpy array for faster access
        self.vocab = list(word_vectors.key_to_index.keys())
        self.word_to_idx = {w: i for i, w in enumerate(self.vocab)}
        self.embedding_matrix = np.array([
            word_vectors[w] for w in self.vocab
        ])
        
        self.idf_ = None
        self.idf_array_ = None  # Aligned with vocab for vectorized ops
    
    def fit(self, tokenized_documents: List[List[str]]):
        """Compute IDF from corpus."""
        N = len(tokenized_documents)
        df = Counter()
        
        for doc in tokenized_documents:
            df.update(set(doc))
        
        # Store IDF as dict
        self.idf_ = {
            word: np.log((N + 1) / (count + 1)) + 1
            for word, count in df.items()
        }
        
        # Create aligned IDF array for vectorized operations
        self.idf_array_ = np.array([
            self.idf_.get(w, 1.0) for w in self.vocab
        ])
        
        return self
    
    def _process_single(self, words: List[str]) -> np.ndarray:
        """Process a single document (optimized)."""
        word_counts = Counter(words)
        
        # Get indices of words in vocab
        valid_words = [w for w in word_counts if w in self.word_to_idx]
        if not valid_words:
            return np.zeros(self.embedding_dim)
        
        indices = np.array([self.word_to_idx[w] for w in valid_words])
        counts = np.array([word_counts[w] for w in valid_words])
        
        # Sublinear TF
        tf = 1 + np.log(counts)
        
        # Get IDF values (already aligned)
        idf_vals = self.idf_array_[indices]
        
        # Weights
        weights = tf * idf_vals
        
        # Weighted sum using matrix indexing
        embeddings = self.embedding_matrix[indices]  # (n_words, dim)
        weighted_sum = np.sum(embeddings * weights[:, np.newaxis], axis=0)
        
        return weighted_sum / weights.sum()
    
    def transform(self, tokenized_documents: List[List[str]]) -> np.ndarray:
        """Transform documents with parallel processing."""
        results = []
        
        # Process in batches for memory efficiency
        for batch_start in range(0, len(tokenized_documents), self.batch_size):
            batch = tokenized_documents[batch_start:batch_start + self.batch_size]
            
            # Parallel processing within batch
            with ThreadPoolExecutor(max_workers=self.n_workers) as executor:
                batch_results = list(executor.map(self._process_single, batch))
            
            results.extend(batch_results)
        
        return np.array(results)
    
    def transform_single(self, words: List[str]) -> np.ndarray:
        """Transform a single document (for online inference)."""
        return self._process_single(words)
 
 
# Memory-efficient streaming for very large corpora
class StreamingTFIDFWord2Vec:
    """
    Streaming version that doesn't load full corpus into memory.
    Computes IDF in a streaming fashion.
    """
    
    def __init__(self, word_vectors, embedding_dim: int = 200):
        self.word_vectors = word_vectors
        self.embedding_dim = embedding_dim
        self.df_ = Counter()
        self.n_docs_ = 0
        self.idf_ = None
    
    def partial_fit(self, documents: List[List[str]]):
        """Update IDF counts with new documents."""
        self.n_docs_ += len(documents)
        
        for doc in documents:
            self.df_.update(set(doc))
        
        return self
    
    def finalize(self):
        """Compute final IDF values after all partial fits."""
        self.idf_ = {
            word: np.log((self.n_docs_ + 1) / (count + 1)) + 1
            for word, count in self.df_.items()
        }
        return self
    
    # ... transform methods same as before
 
 
# Benchmark
def benchmark_implementations(documents, word_vectors, n_repeats=3):
    """Compare implementation speeds."""
    import time
    
    # Standard implementation
    standard = TFIDFWeightedWord2Vec(word_vectors)
    
    # Efficient implementation
    efficient = EfficientTFIDFWord2Vec(word_vectors)
    
    # Tokenize
    tokenized = [doc.lower().split() for doc in documents]
    
    # Fit both
    standard.fit(documents)
    efficient.fit(tokenized)
    
    # Benchmark transform
    for impl_name, impl, data in [
        ('Standard', standard, documents),
        ('Efficient', efficient, tokenized)
    ]:
        times = []
        for _ in range(n_repeats):
            start = time.time()
            impl.transform(data)
            times.append(time.time() - start)
        
        print(f"{impl_name}: {np.mean(times):.4f}s (±{np.std(times):.4f}s)")

Key Optimizations

Integration with ML Pipelines

TF-IDF weighted Word2Vec integrates smoothly with scikit-learn pipelines for end-to-end training and inference:

sklearn_integration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.preprocessing import Normalizer
from collections import Counter
 
class TFIDFWord2VecTransformer(BaseEstimator, TransformerMixin):
    """
    Sklearn-compatible TF-IDF weighted Word2Vec transformer.
    """
    
    def __init__(
        self,
        word_vectors,
        tf_scheme: str = 'sublinear',
        normalize: bool = True,
        min_df: int = 1
    ):
        """
        Args:
            word_vectors: Word embedding model
            tf_scheme: 'raw', 'sublinear', or 'binary'
            normalize: L2-normalize output embeddings
            min_df: Minimum document frequency for IDF
        """
        self.word_vectors = word_vectors
        self.tf_scheme = tf_scheme
        self.normalize = normalize
        self.min_df = min_df
        
        self.embedding_dim_ = word_vectors.vector_size
        self.idf_ = None
    
    def fit(self, X, y=None):
        """Compute IDF from training documents."""
        N = len(X)
        df = Counter()
        
        for doc in X:
            words = doc if isinstance(doc, list) else doc.split()
            df.update(set(words))
        
        # Filter by min_df and compute IDF
        self.idf_ = {
            word: np.log((N + 1) / (count + 1)) + 1
            for word, count in df.items()
            if count >= self.min_df
        }
        
        return self
    
    def transform(self, X):
        """Transform documents to embeddings."""
        embeddings = []
        
        for doc in X:
            words = doc if isinstance(doc, list) else doc.split()
            word_counts = Counter(words)
            
            weighted_sum = np.zeros(self.embedding_dim_)
            total_weight = 0
            
            for word, count in word_counts.items():
                if word not in self.word_vectors:
                    continue
                
                # TF
                if self.tf_scheme == 'sublinear':
                    tf = 1 + np.log(count)
                elif self.tf_scheme == 'binary':
                    tf = 1.0
                else:  # raw
                    tf = count / len(words)
                
                # IDF
                idf = self.idf_.get(word, 1.0)
                
                weight = tf * idf
                weighted_sum += weight * self.word_vectors[word]
                total_weight += weight
            
            if total_weight > 0:
                embedding = weighted_sum / total_weight
            else:
                embedding = np.zeros(self.embedding_dim_)
            
            if self.normalize:
                norm = np.linalg.norm(embedding)
                if norm > 0:
                    embedding = embedding / norm
            
            embeddings.append(embedding)
        
        return np.array(embeddings)
 
 
# Create complete classification pipeline
def create_classification_pipeline(word_vectors):
    """
    Create a complete text classification pipeline using
    TF-IDF weighted Word2Vec embeddings.
    """
    return Pipeline([
        ('tfidf_w2v', TFIDFWord2VecTransformer(
            word_vectors,
            tf_scheme='sublinear',
            normalize=True
        )),
        ('classifier', LogisticRegression(
            max_iter=1000,
            solver='lbfgs'
        ))
    ])
 
# Hyperparameter tuning
def tune_pipeline(X_train, y_train, word_vectors):
    """Grid search over embedding and classifier parameters."""
    pipeline = create_classification_pipeline(word_vectors)
    
    param_grid = {
        'tfidf_w2v__tf_scheme': ['sublinear', 'binary', 'raw'],
        'tfidf_w2v__min_df': [1, 2, 5],
        'classifier__C': [0.1, 1.0, 10.0],
        'classifier__penalty': ['l2'],
    }
    
    grid_search = GridSearchCV(
        pipeline,
        param_grid,
        cv=5,
        scoring='accuracy',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best CV score: {grid_search.best_score_:.4f}")
    
    return grid_search.best_estimator_
 
 
# Usage example
# X_train = ["document one", "document two", ...]
# y_train = [0, 1, ...]
 
pipeline = create_classification_pipeline(word_vectors)
scores = cross_val_score(pipeline, X_train, y_train, cv=5)
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
 
# Fit final model
pipeline.fit(X_train, y_train)
 
# Save for production
import joblib
joblib.dump(pipeline, 'tfidf_w2v_classifier.joblib')
 
# Load and predict
loaded_pipeline = joblib.load('tfidf_w2v_classifier.joblib')
predictions = loaded_pipeline.predict(X_test)

When to Use TF-IDF Weighting

TF-IDF weighted Word2Vec sits between simple averaging and more complex methods. Here's guidance on when it's the right choice:

Use TF-IDF Weighting When

•Working with longer documents (articles, reviews, papers)
•Your domain has specialized vocabulary with high discriminative value
•Documents contain many stop words that you want to downweight automatically
•You need better baseline performance without added model complexity
•Training a classification or retrieval system where term importance matters

Simple Averaging May Suffice When

•Working with very short texts (tweets, queries, keywords)
•You've pre-filtered stop words from input
•The vocabulary is homogeneous with little variation in word informativeness
•You need maximum simplicity and interpretability
•Computational resources are extremely limited

Decision Framework

Summary: TF-IDF Weighted Word2Vec

We've explored the principled combination of TF-IDF weighting with Word2Vec embeddings. Let's consolidate the key concepts:

Key Takeaways

•TF-IDF weights capture word importance: frequent in this document (TF) but rare across the corpus (IDF) signals high discriminative value.
•Weighted averaging replaces uniform averaging, emphasizing informative words and downweighting ubiquitous ones.
•Sublinear TF (log scaling) prevents very frequent words from dominating through mere repetition.
•OOV handling requires explicit strategy—using max or mean IDF for unseen words works well in practice.
•Typical improvement over simple averaging is 2-5% on classification/similarity tasks, most pronounced for longer documents.
•Integrates naturally with sklearn pipelines for end-to-end training and inference.

What's next:

Page Complete

3 / 6