Word Embeddings - Learning Module

Loading content...

0/245

Pre-trained Embeddings: Practical Strategies for Real-World NLP

Standing on the Shoulders of Giants

Training high-quality word embeddings requires massive amounts of text and significant computational resources. The original Word2Vec paper used 100 billion words; GloVe's best models trained on 840 billion tokens from Common Crawl. For most practitioners, replicating this is neither feasible nor necessary.

Pre-trained embeddings are word vectors trained by research teams on enormous corpora and released publicly. Using them, you inherit the semantic knowledge from billions of words of training without the cost. This is often called transfer learning for NLP—leveraging representations learned on one task (language modeling) for another (your specific application).

This page covers the practical art of selecting, loading, evaluating, and adapting pre-trained embeddings for your specific needs.

What You Will Master

By the end of this page, you will understand the landscape of available pre-trained embeddings, selection criteria for choosing the right embeddings, practical loading and integration techniques, domain adaptation strategies when pre-trained embeddings don't quite fit, and principled decision-making on when to train custom embeddings.

Landscape of Pre-trained Embeddings

Several organizations have released high-quality pre-trained word embeddings. Understanding what's available helps you make informed choices.

Major Pre-trained Word Embedding Resources
Name	Type	Training Data	Vocab Size	Dimensions	Notable Features
Google Word2Vec	Word2Vec	Google News (100B words)	3M	300	Classic; widely used baseline
GloVe 6B	GloVe	Wikipedia + Gigaword	400K	50, 100, 200, 300	Clean, balanced corpus
GloVe 42B	GloVe	Common Crawl (42B tokens)	1.9M	300	Larger vocabulary
GloVe 840B	GloVe	Common Crawl (840B tokens)	2.2M	300	Largest GloVe; best quality
GloVe Twitter	GloVe	Twitter (27B tokens)	1.2M	25, 50, 100, 200	Informal language, hashtags
FastText Wiki	FastText	Wikipedia (per language)	~1M	300	157 languages; subword support
FastText CC	FastText	Common Crawl + Wiki	~2M	300	157 languages; best coverage
ConceptNet Numberbatch	Ensemble	Multiple sources	516K	300	Knowledge-enhanced; debiased

Beyond Static Embeddings

This module covers static word embeddings (one vector per word). Modern NLP increasingly uses contextualized embeddings like ELMo, BERT, and GPT, where the same word gets different vectors depending on context. These are covered in the Transformers chapter. Static embeddings remain valuable for efficiency-critical applications and as input features to larger models.

Choosing the Right Pre-trained Embeddings

Selecting appropriate pre-trained embeddings requires matching the embedding properties to your task requirements:

Key Selection Factors

•Domain match: Does the training corpus resemble your data? GloVe Twitter for social media; Google News for formal text; specialized embeddings for medical/legal domains.
•Vocabulary coverage: Will your domain vocabulary be covered? Scientific terms may be OOV in general embeddings. Check coverage before committing.
•Language: English has the most options; for other languages, FastText often has the best coverage with 157 languages.
•OOV handling: Do you need embeddings for unseen words? Only FastText (and similar subword models) provide this.
•Dimensionality: Higher dimensions capture more nuance but use more memory. 300d is typically optimal; 100d is a good trade-off for size-constrained applications.
•Model size: Production constraints may limit what's feasible. GloVe 6B (400K vocab) is much smaller than GloVe 840B (2.2M vocab).

evaluate_embedding_fit.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import numpy as np
from typing import List, Dict, Tuple
from collections import Counter
 
def evaluate_vocabulary_coverage(
    word_vectors,
    domain_corpus: List[str],
    min_frequency: int = 5
) -> Dict[str, float]:
    """
    Evaluate how well pre-trained embeddings cover your domain vocabulary.
    
    Args:
        word_vectors: Pre-trained word vectors (gensim or dict)
        domain_corpus: List of documents from your domain
        min_frequency: Only consider words appearing at least this often
    
    Returns:
        Coverage statistics
    """
    # Tokenize and count
    all_words = []
    for doc in domain_corpus:
        all_words.extend(doc.lower().split())
    
    word_counts = Counter(all_words)
    total_tokens = sum(word_counts.values())
    
    # Vocabulary (words meeting min frequency)
    domain_vocab = {w for w, c in word_counts.items() if c >= min_frequency}
    
    # Check coverage
    covered = sum(1 for w in domain_vocab if w in word_vectors)
    covered_tokens = sum(word_counts[w] for w in domain_vocab if w in word_vectors)
    
    # Find OOV words with high frequency (most impactful gaps)
    oov_words = [(w, word_counts[w]) for w in domain_vocab if w not in word_vectors]
    oov_words.sort(key=lambda x: -x[1])
    
    return {
        'total_tokens': total_tokens,
        'unique_words': len(domain_vocab),
        'covered_words': covered,
        'coverage_by_type': covered / len(domain_vocab) if domain_vocab else 0,
        'coverage_by_token': covered_tokens / sum(word_counts[w] for w in domain_vocab) if domain_vocab else 0,
        'top_oov': oov_words[:20],  # Most frequent OOV words
    }
 
 
def print_coverage_report(stats: Dict):
    """Pretty-print coverage statistics."""
    print("=" * 50)
    print("VOCABULARY COVERAGE REPORT")
    print("=" * 50)
    print(f"Total tokens in corpus: {stats['total_tokens']:,}")
    print(f"Unique words (min freq met): {stats['unique_words']:,}")
    print(f"Words in embeddings: {stats['covered_words']:,}")
    print(f"
Coverage by type (unique words): {stats['coverage_by_type']:.1%}")
    print(f"Coverage by token (occurrences): {stats['coverage_by_token']:.1%}")
    
    if stats['top_oov']:
        print(f"
Top OOV words (high frequency but missing):")
        for word, count in stats['top_oov'][:10]:
            print(f"  '{word}': {count} occurrences")
 
 
# Example usage
# domain_corpus = [load your domain documents]
# word_vectors = load_glove_vectors()  # or any embedding
 
stats = evaluate_vocabulary_coverage(word_vectors, domain_corpus, min_frequency=5)
print_coverage_report(stats)
 
# Decision guidance
if stats['coverage_by_token'] < 0.80:
    print("
⚠️ Low coverage - consider:")
    print("  1. Using FastText (handles OOV)")
    print("  2. Training domain-specific embeddings")
    print("  3. Using larger pre-trained vocabulary")
elif stats['coverage_by_token'] < 0.95:
    print("
✓ Good coverage - pre-trained embeddings should work well")
    print("  Consider handling critical OOV terms with fallbacks")
else:
    print("
✅ Excellent coverage - use pre-trained embeddings confidently")

For General English NLP

•Default: GloVe 6B (300d) — good balance of coverage and size
•Quality-first: GloVe 840B (300d) — best coverage and quality
•Speed/size: GloVe 6B (100d) — 3x smaller, minimal quality loss

For Specialized Needs

•Social media: GloVe Twitter or FastText CC
•Multilingual: FastText (157 languages)
•OOV-heavy: FastText (subword embeddings)
•Fairness-aware: ConceptNet Numberbatch (debiased)

Loading Pre-trained Embeddings

There are multiple ways to load pre-trained embeddings, each with trade-offs between convenience and control:

loading_embeddings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
import numpy as np
from typing import Dict, Tuple, Optional
import gensim.downloader as gensim_api
from gensim.models import KeyedVectors
 
# ============================================
# Method 1: Gensim Downloader (Easiest)
# ============================================
 
def load_via_gensim_api(name: str = 'glove-wiki-gigaword-100'):
    """
    Load embeddings via gensim's model API.
    Downloads automatically if not cached.
    
    Available models:
        'word2vec-google-news-300'      # 1.6GB, Google News
        'glove-wiki-gigaword-50'        # 66MB
        'glove-wiki-gigaword-100'       # 128MB  
        'glove-wiki-gigaword-200'       # 252MB
        'glove-wiki-gigaword-300'       # 376MB
        'glove-twitter-25'              # 105MB
        'glove-twitter-50'              # 200MB
        'glove-twitter-100'             # 387MB
        'glove-twitter-200'             # 758MB
        'fasttext-wiki-news-subwords-300'  # 958MB
    """
    print(f"Loading {name}...")
    model = gensim_api.load(name)
    print(f"Loaded {len(model)} vectors of dimension {model.vector_size}")
    return model
 
 
# ============================================
# Method 2: Load from File (More Control)
# ============================================
 
def load_glove_from_file(
    filepath: str,
    vocab_limit: Optional[int] = None
) -> Tuple[Dict[str, np.ndarray], int]:
    """
    Load GloVe vectors from text file.
    
    Args:
        filepath: Path to .txt file (word vec1 vec2 ...)
        vocab_limit: Only load first N words (for memory)
    
    Returns:
        (word_to_vector dict, dimension)
    """
    embeddings = {}
    dim = None
    
    with open(filepath, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f):
            if vocab_limit and i >= vocab_limit:
                break
                
            values = line.rstrip().split(' ')
            word = values[0]
            
            try:
                vector = np.array([float(x) for x in values[1:]], dtype=np.float32)
                if dim is None:
                    dim = len(vector)
                embeddings[word] = vector
                
            except ValueError:
                continue  # Skip malformed lines
            
            if (i + 1) % 100000 == 0:
                print(f"  Loaded {i+1:,} vectors...")
    
    print(f"Loaded {len(embeddings):,} vectors of dimension {dim}")
    return embeddings, dim
 
 
# ============================================
# Method 3: Load Word2Vec Binary Format
# ============================================
 
def load_word2vec_binary(filepath: str, limit: int = None):
    """
    Load Word2Vec binary format (e.g., Google News vectors).
    """
    model = KeyedVectors.load_word2vec_format(
        filepath,
        binary=True,
        limit=limit  # Load only first N words for memory
    )
    return model
 
 
# ============================================
# Method 4: Memory-Mapped Loading (Large Models)
# ============================================
 
def load_memory_mapped(filepath: str):
    """
    Load large models with memory mapping.
    Reduces RAM usage by loading from disk on-demand.
    """
    model = KeyedVectors.load(filepath, mmap='r')
    return model
 
 
# ============================================
# Converting Between Formats
# ============================================
 
def convert_glove_to_word2vec_format(glove_path: str, output_path: str):
    """
    Convert GloVe text format to Word2Vec format.
    Adds the header line required by gensim.
    """
    from gensim.scripts.glove2word2vec import glove2word2vec
    
    glove2word2vec(glove_path, output_path)
    print(f"Converted to {output_path}")
 
 
def save_for_fast_loading(model, output_path: str):
    """
    Save in gensim's native format for fast subsequent loading.
    """
    model.save(output_path)
    print(f"Saved to {output_path}")
    
    # Next time, load with:
    # model = KeyedVectors.load(output_path)
 
 
# ============================================
# Practical Example: Load, Filter, Save
# ============================================
 
def prepare_embeddings_for_task(
    pretrained_path: str,
    task_vocabulary: set,
    output_path: str
):
    """
    Load pre-trained embeddings and save only words needed for task.
    Reduces model size significantly.
    """
    # Load full model
    full_model = load_glove_from_file(pretrained_path)
    embeddings, dim = full_model
    
    # Filter to task vocabulary
    filtered = {w: embeddings[w] for w in task_vocabulary if w in embeddings}
    
    coverage = len(filtered) / len(task_vocabulary)
    print(f"Filtered to {len(filtered)} vectors ({coverage:.1%} coverage)")
    
    # Save filtered embeddings
    np.savez_compressed(
        output_path,
        words=list(filtered.keys()),
        vectors=np.array(list(filtered.values()))
    )
    print(f"Saved to {output_path}.npz")
    
    return filtered
 
 
# Usage example
glove = load_via_gensim_api('glove-wiki-gigaword-100')
 
# Basic operations
print(f"
Similarity examples:")
print(f"  king ↔ queen: {glove.similarity('king', 'queen'):.4f}")
print(f"  python ↔ java: {glove.similarity('python', 'java'):.4f}")
 
# Analogy
result = glove.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"  king - man + woman = {result[0][0]}")

Optimize Loading Time

First load from text/binary format, then save in gensim's native format using .save(). Subsequent loads will be 10-50x faster. For production, preload embeddings at service startup and keep them in memory.

Creating Embedding Matrices for Deep Learning

Deep learning models typically need embeddings as a weight matrix where row i contains the embedding for vocabulary word i. Here's how to construct this properly:

embedding_matrix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
import numpy as np
from typing import Dict, List, Tuple
import torch
import tensorflow as tf
 
def create_embedding_matrix(
    word_to_idx: Dict[str, int],
    pretrained_vectors,
    embedding_dim: int,
    oov_strategy: str = 'random',  # 'random', 'zero', 'mean'
    scale: float = 0.25
) -> Tuple[np.ndarray, float]:
    """
    Create embedding matrix for deep learning models.
    
    Args:
        word_to_idx: Vocabulary mapping {word: index}
        pretrained_vectors: Pre-trained word vectors
        embedding_dim: Dimension of embeddings
        oov_strategy: How to handle words not in pre-trained
        scale: Scale for random initialization
    
    Returns:
        (embedding_matrix, coverage_ratio)
    """
    vocab_size = len(word_to_idx)
    
    # Initialize matrix
    if oov_strategy == 'zero':
        embedding_matrix = np.zeros((vocab_size, embedding_dim), dtype=np.float32)
    else:
        # Random initialization for OOV words
        embedding_matrix = np.random.uniform(
            -scale, scale, (vocab_size, embedding_dim)
        ).astype(np.float32)
    
    # Optionally compute mean for 'mean' strategy
    if oov_strategy == 'mean':
        sample_size = min(10000, len(pretrained_vectors))
        sample_words = list(pretrained_vectors.key_to_index.keys())[:sample_size]
        mean_vec = np.mean([pretrained_vectors[w] for w in sample_words], axis=0)
    
    # Fill in pre-trained vectors
    found = 0
    for word, idx in word_to_idx.items():
        if word in pretrained_vectors:
            embedding_matrix[idx] = pretrained_vectors[word]
            found += 1
        elif oov_strategy == 'mean':
            embedding_matrix[idx] = mean_vec
    
    coverage = found / vocab_size
    print(f"Created embedding matrix: {vocab_size} x {embedding_dim}")
    print(f"Pre-trained coverage: {found}/{vocab_size} ({coverage:.1%})")
    
    return embedding_matrix, coverage
 
 
# ============================================
# PyTorch Integration
# ============================================
 
def create_pytorch_embedding_layer(
    word_to_idx: Dict[str, int],
    pretrained_vectors,
    embedding_dim: int,
    trainable: bool = True,
    padding_idx: int = 0
) -> torch.nn.Embedding:
    """
    Create a PyTorch Embedding layer initialized with pre-trained vectors.
    """
    # Create numpy matrix
    embedding_matrix, _ = create_embedding_matrix(
        word_to_idx, pretrained_vectors, embedding_dim
    )
    
    # Convert to PyTorch tensor
    embedding_tensor = torch.FloatTensor(embedding_matrix)
    
    # Create embedding layer
    vocab_size = len(word_to_idx)
    embedding_layer = torch.nn.Embedding(
        num_embeddings=vocab_size,
        embedding_dim=embedding_dim,
        padding_idx=padding_idx
    )
    
    # Initialize with pre-trained weights
    embedding_layer.weight = torch.nn.Parameter(embedding_tensor)
    
    # Optionally freeze weights
    embedding_layer.weight.requires_grad = trainable
    
    return embedding_layer
 
 
# Example PyTorch model with pre-trained embeddings
class TextClassifier(torch.nn.Module):
    def __init__(self, word_to_idx, pretrained_vectors, num_classes):
        super().__init__()
        
        embedding_dim = pretrained_vectors.vector_size
        
        # Pre-trained embedding layer
        self.embedding = create_pytorch_embedding_layer(
            word_to_idx, pretrained_vectors, embedding_dim,
            trainable=True  # Fine-tune embeddings
        )
        
        self.lstm = torch.nn.LSTM(
            embedding_dim, 128, batch_first=True, bidirectional=True
        )
        self.fc = torch.nn.Linear(256, num_classes)
    
    def forward(self, x):
        # x: (batch, seq_len) of word indices
        embedded = self.embedding(x)  # (batch, seq_len, embedding_dim)
        _, (hidden, _) = self.lstm(embedded)
        hidden = torch.cat([hidden[-2], hidden[-1]], dim=1)  # (batch, 256)
        return self.fc(hidden)
 
 
# ============================================
# TensorFlow/Keras Integration
# ============================================
 
def create_keras_embedding_layer(
    word_to_idx: Dict[str, int],
    pretrained_vectors,
    embedding_dim: int,
    trainable: bool = True,
    mask_zero: bool = True
):
    """
    Create a Keras Embedding layer initialized with pre-trained vectors.
    """
    # Create numpy matrix
    embedding_matrix, _ = create_embedding_matrix(
        word_to_idx, pretrained_vectors, embedding_dim
    )
    
    # Create Keras layer
    embedding_layer = tf.keras.layers.Embedding(
        input_dim=len(word_to_idx),
        output_dim=embedding_dim,
        embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix),
        trainable=trainable,
        mask_zero=mask_zero  # Mask padding tokens
    )
    
    return embedding_layer
 
 
# Example Keras model
def build_keras_model(word_to_idx, pretrained_vectors, num_classes):
    embedding_dim = pretrained_vectors.vector_size
    vocab_size = len(word_to_idx)
    
    model = tf.keras.Sequential([
        create_keras_embedding_layer(
            word_to_idx, pretrained_vectors, embedding_dim,
            trainable=True
        ),
        tf.keras.layers.Bidirectional(
            tf.keras.layers.LSTM(128)
        ),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(num_classes, activation='softmax')
    ])
    
    return model

Trainable vs. Frozen Embeddings

Freeze embeddings (trainable=False) when: you have very little training data (prevents overfitting), pre-trained domain matches well, or you want faster training. Fine-tune embeddings (trainable=True) when: you have moderate+ training data, domain differs from pre-trained, or maximum task performance is critical. A middle ground: freeze initially, then unfreeze later in training.

Domain Adaptation Strategies

Pre-trained embeddings may not perfectly match your domain. Several strategies can adapt them:

1. Fine-tuning on domain corpus:

Continue training the embedding model on your domain-specific text. This updates vectors to better reflect domain usage while retaining general knowledge.

domain_adaptation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
from gensim.models import Word2Vec, FastText, KeyedVectors
from gensim.models.callbacks import CallbackAny2Vec
import numpy as np
 
# ============================================
# Strategy 1: Continue Training (Fine-tuning)
# ============================================
 
def finetune_word2vec(
    pretrained_path: str,
    domain_sentences,
    epochs: int = 10,
    learning_rate: float = 0.01
):
    """
    Fine-tune pre-trained Word2Vec on domain corpus.
    
    Note: Requires the full model (.model), not just vectors (.bin/.vec)
    """
    # Load pre-trained model
    model = Word2Vec.load(pretrained_path)
    
    # Update learning rate for fine-tuning
    model.min_alpha = learning_rate * 0.0001
    model.alpha = learning_rate
    
    # Build vocabulary increment (new words)
    model.build_vocab(domain_sentences, update=True)
    
    # Train on domain data
    model.train(
        domain_sentences,
        total_examples=model.corpus_count,
        epochs=epochs
    )
    
    return model
 
 
def finetune_fasttext(
    pretrained_path: str,
    domain_sentences,
    epochs: int = 10
):
    """
    Fine-tune pre-trained FastText on domain corpus.
    FastText handles this better due to subword information.
    """
    from gensim.models.fasttext import load_facebook_model
    
    # Load pre-trained
    model = load_facebook_model(pretrained_path)
    
    # Build vocabulary increment
    model.build_vocab(domain_sentences, update=True)
    
    # Fine-tune
    model.train(
        domain_sentences,
        total_examples=model.corpus_count,
        epochs=epochs
    )
    
    return model
 
 
# ============================================
# Strategy 2: Train Domain Model + Combine
# ============================================
 
def combine_embeddings(
    general_vectors,
    domain_vectors,
    alpha: float = 0.5
) -> dict:
    """
    Combine general and domain embeddings via weighted average.
    
    For words in both: blend
    For words in only one: use that one
    
    Args:
        general_vectors: Pre-trained general embeddings  
        domain_vectors: Domain-trained embeddings
        alpha: Weight for general (1-alpha for domain)
    """
    combined = {}
    dim = general_vectors.vector_size
    
    all_words = set(general_vectors.key_to_index.keys()) | set(domain_vectors.key_to_index.keys())
    
    for word in all_words:
        in_general = word in general_vectors
        in_domain = word in domain_vectors
        
        if in_general and in_domain:
            # Blend
            combined[word] = (
                alpha * general_vectors[word] +
                (1 - alpha) * domain_vectors[word]
            )
        elif in_general:
            combined[word] = general_vectors[word]
        else:
            combined[word] = domain_vectors[word]
    
    print(f"Combined vocabulary: {len(combined)}")
    return combined
 
 
# ============================================
# Strategy 3: Retrofitting to Domain Lexicon
# ============================================
 
def retrofit_embeddings(
    word_vectors: dict,
    lexicon: dict,  # word -> list of related words
    num_iters: int = 10,
    alpha: float = 0.5
) -> dict:
    """
    Retrofit embeddings using domain-specific relationships.
    
    Based on Faruqui et al. (2015) retrofitting.
    
    Args:
        word_vectors: Original embeddings {word: vector}
        lexicon: Domain relationships {word: [related_words]}
        num_iters: Number of retrofitting iterations
        alpha: Balance between original and neighbors
    """
    new_vectors = {w: v.copy() for w, v in word_vectors.items()}
    
    for iteration in range(num_iters):
        for word in lexicon:
            if word not in word_vectors:
                continue
            
            neighbors = [n for n in lexicon[word] if n in word_vectors]
            if not neighbors:
                continue
            
            # Weighted average of original + neighbors
            neighbor_avg = np.mean([new_vectors[n] for n in neighbors], axis=0)
            new_vectors[word] = (
                alpha * word_vectors[word] +
                (1 - alpha) * neighbor_avg
            )
    
    return new_vectors
 
 
# ============================================
# Strategy 4: Linear Projection
# ============================================
 
def learn_domain_projection(
    general_vectors,
    domain_pairs: list,  # [(general_word, domain_word), ...]
) -> np.ndarray:
    """
    Learn a linear transformation from general to domain space.
    
    Uses word pairs that appear in both spaces.
    """
    from sklearn.linear_model import LinearRegression
    
    X = []  # General embeddings
    y = []  # Domain embeddings
    
    for gen_word, dom_word in domain_pairs:
        if gen_word in general_vectors and dom_word in domain_vectors:
            X.append(general_vectors[gen_word])
            y.append(domain_vectors[dom_word])
    
    X = np.array(X)
    y = np.array(y)
    
    # Fit linear transformation
    regression = LinearRegression()
    regression.fit(X, y)
    
    return regression.coef_.T  # Transformation matrix
 
 
# Example: Create domain-adapted embeddings
domain_corpus = [
    "patient presents with acute myocardial infarction".split(),
    "ecg shows st elevation in leads v1 through v4".split(),
    # ... more medical text
]
 
# Option 1: Fine-tune
# adapted = finetune_fasttext('cc.en.300.bin', domain_corpus)
 
# Option 2: Combine with domain-trained
# domain_model = Word2Vec(domain_corpus, vector_size=100, epochs=50)
# combined = combine_embeddings(general_vectors, domain_model.wv, alpha=0.3)
 
# Option 3: Retrofit with domain lexicon (e.g., UMLS for medical)
# medical_lexicon = load_umls_synonyms()
# retrofitted = retrofit_embeddings(general_vectors, medical_lexicon)

Practical Recommendation

Start simple: Use pre-trained embeddings as-is and measure task performance. Only invest in domain adaptation if (1) vocabulary coverage is poor, or (2) domain semantics differ significantly from general text. Fine-tuning during end-task training (in your neural network) often provides sufficient adaptation without explicit embedding modification.

When to Train Custom Embeddings

The decision between training custom embeddings versus using pre-trained ones is fundamental. Here's a framework for making this choice:

Train Custom When

•Large domain corpus available (100M+ tokens)
•Highly specialized vocabulary (scientific, legal, medical)
•Domain semantics differ significantly ('virus' in biology vs computing)
•Pre-trained coverage is very low (<70%)
•Language not well-supported by pre-trained models
•Maximum task performance needed and resources available

Use Pre-trained When

•Limited domain data (<10M tokens)
•General vocabulary with good coverage
•Quick development needed (no time for training)
•Standard NLP tasks (classification, similarity)
•Transfer learning benefit likely (domain not too specialized)
•Compute resources limited

Decision Matrix
Domain Corpus Size	Vocabulary Coverage	Recommendation
< 1M tokens	Any	Use pre-trained; not enough data for quality custom embeddings
1M - 10M tokens	90%	Use pre-trained; coverage is sufficient
1M - 10M tokens	< 90%	Pre-trained + fine-tuning or FastText for OOV
10M - 100M tokens	80%	Pre-trained with optional fine-tuning
10M - 100M tokens	< 80%	Consider custom training or extensive adaptation
100M tokens	Any	Custom training becomes viable; experiment with both

Hybrid Approaches Work Well

You don't have to choose one or the other. Common hybrids: (1) Initialize with pre-trained, fine-tune on domain data. (2) Use pre-trained for common words, train custom for domain terms. (3) Concatenate pre-trained and domain embeddings as features.

Evaluating Embedding Quality

Before committing to an embedding choice, evaluate quality through intrinsic and extrinsic measures:

embedding_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
import numpy as np
from scipy.stats import spearmanr
from sklearn.metrics.pairwise import cosine_similarity
 
# ============================================
# Intrinsic Evaluation: Word Similarity
# ============================================
 
def load_similarity_dataset(filepath: str):
    """
    Load a word similarity benchmark dataset.
    
    Standard datasets:
    - SimLex-999: https://fh295.github.io/simlex.html
    - WordSim-353: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
    - MEN: https://staff.fnwi.uva.nl/e.bruni/MEN
    """
    pairs = []
    with open(filepath, 'r') as f:
        for line in f:
            parts = line.strip().split()
            if len(parts) >= 3:
                word1, word2, score = parts[0], parts[1], float(parts[2])
                pairs.append((word1, word2, score))
    return pairs
 
 
def evaluate_word_similarity(word_vectors, similarity_pairs):
    """
    Evaluate embeddings on word similarity benchmark.
    Returns Spearman correlation between human and model similarities.
    """
    human_scores = []
    model_scores = []
    
    skipped = 0
    for word1, word2, human_score in similarity_pairs:
        if word1 not in word_vectors or word2 not in word_vectors:
            skipped += 1
            continue
        
        vec1 = word_vectors[word1]
        vec2 = word_vectors[word2]
        model_sim = cosine_similarity([vec1], [vec2])[0, 0]
        
        human_scores.append(human_score)
        model_scores.append(model_sim)
    
    if len(human_scores) < 10:
        print(f"Too few pairs ({len(human_scores)}) for reliable evaluation")
        return None
    
    correlation, p_value = spearmanr(human_scores, model_scores)
    
    print(f"Word Similarity Evaluation:")
    print(f"  Pairs evaluated: {len(human_scores)} / {len(similarity_pairs)}")
    print(f"  Skipped (OOV): {skipped}")
    print(f"  Spearman correlation: {correlation:.4f} (p={p_value:.4e})")
    
    return correlation
 
 
# ============================================
# Intrinsic Evaluation: Analogies
# ============================================
 
def load_analogy_dataset(filepath: str):
    """
    Load analogy dataset (Google's or similar).
    Format: king man queen woman (for king:queen :: man:woman)
    """
    analogies = []
    current_category = None
    
    with open(filepath, 'r') as f:
        for line in f:
            line = line.strip()
            if line.startswith(':'):
                current_category = line[2:]
            elif line:
                parts = line.lower().split()
                if len(parts) == 4:
                    analogies.append((parts[0], parts[1], parts[2], parts[3], current_category))
    
    return analogies
 
 
def evaluate_analogies(word_vectors, analogies, top_k: int = 4):
    """
    Evaluate embeddings on analogy task.
    a:b :: c:d → Is d in top-k for vec(b) - vec(a) + vec(c)?
    """
    correct = 0
    total = 0
    
    for a, b, c, d, category in analogies:
        if any(w not in word_vectors for w in [a, b, c, d]):
            continue
        
        total += 1
        
        # Compute target vector
        target = word_vectors[b] - word_vectors[a] + word_vectors[c]
        
        # Find nearest neighbors (excluding a, b, c)
        try:
            result = word_vectors.most_similar(
                positive=[b, c],
                negative=[a],
                topn=top_k
            )
            predictions = [word for word, _ in result]
            
            if d in predictions:
                correct += 1
        except KeyError:
            continue
    
    accuracy = correct / total if total > 0 else 0
    
    print(f"Analogy Evaluation:")
    print(f"  Analogies evaluated: {total} / {len(analogies)}")
    print(f"  Accuracy (top-{top_k}): {accuracy:.3%}")
    
    return accuracy
 
 
# ============================================
# Extrinsic Evaluation: Downstream Task
# ============================================
 
def evaluate_on_classification(
    word_vectors,
    train_docs, train_labels,
    test_docs, test_labels
):
    """
    Evaluate embeddings on text classification task.
    Uses simple average embedding + logistic regression.
    """
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, f1_score
    
    def get_doc_embedding(doc, wv):
        words = doc.lower().split()
        vectors = [wv[w] for w in words if w in wv]
        if vectors:
            return np.mean(vectors, axis=0)
        return np.zeros(wv.vector_size)
    
    X_train = np.array([get_doc_embedding(d, word_vectors) for d in train_docs])
    X_test = np.array([get_doc_embedding(d, word_vectors) for d in test_docs])
    
    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, train_labels)
    
    predictions = clf.predict(X_test)
    acc = accuracy_score(test_labels, predictions)
    f1 = f1_score(test_labels, predictions, average='weighted')
    
    print(f"Classification Evaluation:")
    print(f"  Accuracy: {acc:.3%}")
    print(f"  F1 Score: {f1:.3%}")
    
    return acc, f1
 
 
# ============================================
# Comprehensive Evaluation
# ============================================
 
def full_embedding_evaluation(word_vectors, datasets: dict):
    """
    Run comprehensive evaluation suite.
    
    Args:
        word_vectors: Embeddings to evaluate
        datasets: Dict with paths to similarity, analogy, classification data
    """
    results = {}
    
    # Word similarity
    if 'similarity' in datasets:
        pairs = load_similarity_dataset(datasets['similarity'])
        results['similarity'] = evaluate_word_similarity(word_vectors, pairs)
    
    # Analogies
    if 'analogies' in datasets:
        analogies = load_analogy_dataset(datasets['analogies'])
        results['analogy'] = evaluate_analogies(word_vectors, analogies)
    
    # Classification
    if 'classification' in datasets:
        data = datasets['classification']
        acc, f1 = evaluate_on_classification(
            word_vectors,
            data['train_docs'], data['train_labels'],
            data['test_docs'], data['test_labels']
        )
        results['classification_acc'] = acc
        results['classification_f1'] = f1
    
    return results

Extrinsic is Most Important

While intrinsic metrics (similarity, analogies) are informative, extrinsic evaluation on your actual task matters most. Embeddings that score poorly on general benchmarks may excel on your specific domain task. Always evaluate on a held-out portion of your actual data before finalizing embedding choice.

Best Practices Summary

Let's consolidate the key practices for working with pre-trained embeddings:

Embedding Best Practices

•Check vocabulary coverage first — compute overlap between your data and embedding vocabulary before committing.
•Match domain and style — GloVe Twitter for social; GloVe/FastText Wikipedia for formal; domain-specific if available.
•Normalize consistently — lowercase your input if embeddings are lowercased; consider L2-normalizing vectors for similarity.
•Handle OOV systematically — use FastText for subword handling; assign zero/mean vectors; or map to closest in-vocab word.
•Start frozen, then fine-tune — initially freeze embeddings to leverage pre-training; unfreeze if you have enough data.
•Evaluate on your task — intrinsic metrics are proxy; actual task performance is ground truth.
•Optimize for production — save in fast-loading formats; filter to task vocabulary; consider quantization.

Quick Reference: Embedding Selection
Scenario	Recommended Approach
General English NLP, quality first	GloVe 840B (300d) or GloVe 6B (300d)
General English NLP, size constrained	GloVe 6B (100d) or (50d)
Social media / informal text	GloVe Twitter (200d) or FastText CC
Non-English language	FastText Wiki/CC for that language
OOV handling critical	FastText (any variant with subwords)
Medical / legal / scientific	Domain-specific embeddings if available; else fine-tune FastText
Fairness / bias concerns	ConceptNet Numberbatch (debiased)

Summary: Pre-trained Embeddings

We've comprehensively covered the practical aspects of using pre-trained word embeddings. Let's consolidate the key insights:

Key Takeaways

•Pre-trained embeddings provide transfer learning — leverage semantic knowledge from billions of words without training cost.
•Selection requires matching — domain, vocabulary coverage, language, OOV needs, and size constraints all matter.
•Loading methods vary in tradeoffs — gensim API is easiest; file loading gives control; memory mapping saves RAM.
•Deep learning integration requires building embedding matrices with proper initialization for OOV words.
•Domain adaptation can improve fit through fine-tuning, combining embeddings, or retrofitting to domain lexicons.
•Train custom only when you have sufficient domain data (100M+ tokens) and pre-trained coverage is poor.
•Evaluate properly — extrinsic evaluation on your actual task trumps intrinsic benchmarks.

Module Complete:

Congratulations! You've completed the comprehensive module on Word Embeddings. You now understand:

Word2Vec (Skip-gram and CBOW) for learning embeddings from local context
Average Word2Vec and TF-IDF weighted averaging for document representation
GloVe for learning from global co-occurrence statistics
FastText for handling OOV words through subword information
Pre-trained embeddings for practical transfer learning

These techniques form the foundation for representing text in machine learning. While contextualized embeddings (BERT, etc.) are increasingly popular, static word embeddings remain valuable for their efficiency, interpretability, and effectiveness on many tasks.

Module Complete

You now have a comprehensive understanding of word embeddings—from foundational methods through practical usage. These techniques power everything from search engines to chatbots, and mastery of embeddings is essential for any NLP practitioner. Apply this knowledge to represent text effectively in your machine learning projects.