Loading content...
Training high-quality word embeddings requires massive amounts of text and significant computational resources. The original Word2Vec paper used 100 billion words; GloVe's best models trained on 840 billion tokens from Common Crawl. For most practitioners, replicating this is neither feasible nor necessary.
Pre-trained embeddings are word vectors trained by research teams on enormous corpora and released publicly. Using them, you inherit the semantic knowledge from billions of words of training without the cost. This is often called transfer learning for NLP—leveraging representations learned on one task (language modeling) for another (your specific application).
This page covers the practical art of selecting, loading, evaluating, and adapting pre-trained embeddings for your specific needs.
By the end of this page, you will understand the landscape of available pre-trained embeddings, selection criteria for choosing the right embeddings, practical loading and integration techniques, domain adaptation strategies when pre-trained embeddings don't quite fit, and principled decision-making on when to train custom embeddings.
Several organizations have released high-quality pre-trained word embeddings. Understanding what's available helps you make informed choices.
| Name | Type | Training Data | Vocab Size | Dimensions | Notable Features |
|---|---|---|---|---|---|
| Google Word2Vec | Word2Vec | Google News (100B words) | 3M | 300 | Classic; widely used baseline |
| GloVe 6B | GloVe | Wikipedia + Gigaword | 400K | 50, 100, 200, 300 | Clean, balanced corpus |
| GloVe 42B | GloVe | Common Crawl (42B tokens) | 1.9M | 300 | Larger vocabulary |
| GloVe 840B | GloVe | Common Crawl (840B tokens) | 2.2M | 300 | Largest GloVe; best quality |
| GloVe Twitter | GloVe | Twitter (27B tokens) | 1.2M | 25, 50, 100, 200 | Informal language, hashtags |
| FastText Wiki | FastText | Wikipedia (per language) | ~1M | 300 | 157 languages; subword support |
| FastText CC | FastText | Common Crawl + Wiki | ~2M | 300 | 157 languages; best coverage |
| ConceptNet Numberbatch | Ensemble | Multiple sources | 516K | 300 | Knowledge-enhanced; debiased |
This module covers static word embeddings (one vector per word). Modern NLP increasingly uses contextualized embeddings like ELMo, BERT, and GPT, where the same word gets different vectors depending on context. These are covered in the Transformers chapter. Static embeddings remain valuable for efficiency-critical applications and as input features to larger models.
Selecting appropriate pre-trained embeddings requires matching the embedding properties to your task requirements:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
import numpy as npfrom typing import List, Dict, Tuplefrom collections import Counter def evaluate_vocabulary_coverage( word_vectors, domain_corpus: List[str], min_frequency: int = 5) -> Dict[str, float]: """ Evaluate how well pre-trained embeddings cover your domain vocabulary. Args: word_vectors: Pre-trained word vectors (gensim or dict) domain_corpus: List of documents from your domain min_frequency: Only consider words appearing at least this often Returns: Coverage statistics """ # Tokenize and count all_words = [] for doc in domain_corpus: all_words.extend(doc.lower().split()) word_counts = Counter(all_words) total_tokens = sum(word_counts.values()) # Vocabulary (words meeting min frequency) domain_vocab = {w for w, c in word_counts.items() if c >= min_frequency} # Check coverage covered = sum(1 for w in domain_vocab if w in word_vectors) covered_tokens = sum(word_counts[w] for w in domain_vocab if w in word_vectors) # Find OOV words with high frequency (most impactful gaps) oov_words = [(w, word_counts[w]) for w in domain_vocab if w not in word_vectors] oov_words.sort(key=lambda x: -x[1]) return { 'total_tokens': total_tokens, 'unique_words': len(domain_vocab), 'covered_words': covered, 'coverage_by_type': covered / len(domain_vocab) if domain_vocab else 0, 'coverage_by_token': covered_tokens / sum(word_counts[w] for w in domain_vocab) if domain_vocab else 0, 'top_oov': oov_words[:20], # Most frequent OOV words } def print_coverage_report(stats: Dict): """Pretty-print coverage statistics.""" print("=" * 50) print("VOCABULARY COVERAGE REPORT") print("=" * 50) print(f"Total tokens in corpus: {stats['total_tokens']:,}") print(f"Unique words (min freq met): {stats['unique_words']:,}") print(f"Words in embeddings: {stats['covered_words']:,}") print(f"Coverage by type (unique words): {stats['coverage_by_type']:.1%}") print(f"Coverage by token (occurrences): {stats['coverage_by_token']:.1%}") if stats['top_oov']: print(f"Top OOV words (high frequency but missing):") for word, count in stats['top_oov'][:10]: print(f" '{word}': {count} occurrences") # Example usage# domain_corpus = [load your domain documents]# word_vectors = load_glove_vectors() # or any embedding stats = evaluate_vocabulary_coverage(word_vectors, domain_corpus, min_frequency=5)print_coverage_report(stats) # Decision guidanceif stats['coverage_by_token'] < 0.80: print("⚠️ Low coverage - consider:") print(" 1. Using FastText (handles OOV)") print(" 2. Training domain-specific embeddings") print(" 3. Using larger pre-trained vocabulary")elif stats['coverage_by_token'] < 0.95: print("✓ Good coverage - pre-trained embeddings should work well") print(" Consider handling critical OOV terms with fallbacks")else: print("✅ Excellent coverage - use pre-trained embeddings confidently")There are multiple ways to load pre-trained embeddings, each with trade-offs between convenience and control:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
import numpy as npfrom typing import Dict, Tuple, Optionalimport gensim.downloader as gensim_apifrom gensim.models import KeyedVectors # ============================================# Method 1: Gensim Downloader (Easiest)# ============================================ def load_via_gensim_api(name: str = 'glove-wiki-gigaword-100'): """ Load embeddings via gensim's model API. Downloads automatically if not cached. Available models: 'word2vec-google-news-300' # 1.6GB, Google News 'glove-wiki-gigaword-50' # 66MB 'glove-wiki-gigaword-100' # 128MB 'glove-wiki-gigaword-200' # 252MB 'glove-wiki-gigaword-300' # 376MB 'glove-twitter-25' # 105MB 'glove-twitter-50' # 200MB 'glove-twitter-100' # 387MB 'glove-twitter-200' # 758MB 'fasttext-wiki-news-subwords-300' # 958MB """ print(f"Loading {name}...") model = gensim_api.load(name) print(f"Loaded {len(model)} vectors of dimension {model.vector_size}") return model # ============================================# Method 2: Load from File (More Control)# ============================================ def load_glove_from_file( filepath: str, vocab_limit: Optional[int] = None) -> Tuple[Dict[str, np.ndarray], int]: """ Load GloVe vectors from text file. Args: filepath: Path to .txt file (word vec1 vec2 ...) vocab_limit: Only load first N words (for memory) Returns: (word_to_vector dict, dimension) """ embeddings = {} dim = None with open(filepath, 'r', encoding='utf-8') as f: for i, line in enumerate(f): if vocab_limit and i >= vocab_limit: break values = line.rstrip().split(' ') word = values[0] try: vector = np.array([float(x) for x in values[1:]], dtype=np.float32) if dim is None: dim = len(vector) embeddings[word] = vector except ValueError: continue # Skip malformed lines if (i + 1) % 100000 == 0: print(f" Loaded {i+1:,} vectors...") print(f"Loaded {len(embeddings):,} vectors of dimension {dim}") return embeddings, dim # ============================================# Method 3: Load Word2Vec Binary Format# ============================================ def load_word2vec_binary(filepath: str, limit: int = None): """ Load Word2Vec binary format (e.g., Google News vectors). """ model = KeyedVectors.load_word2vec_format( filepath, binary=True, limit=limit # Load only first N words for memory ) return model # ============================================# Method 4: Memory-Mapped Loading (Large Models)# ============================================ def load_memory_mapped(filepath: str): """ Load large models with memory mapping. Reduces RAM usage by loading from disk on-demand. """ model = KeyedVectors.load(filepath, mmap='r') return model # ============================================# Converting Between Formats# ============================================ def convert_glove_to_word2vec_format(glove_path: str, output_path: str): """ Convert GloVe text format to Word2Vec format. Adds the header line required by gensim. """ from gensim.scripts.glove2word2vec import glove2word2vec glove2word2vec(glove_path, output_path) print(f"Converted to {output_path}") def save_for_fast_loading(model, output_path: str): """ Save in gensim's native format for fast subsequent loading. """ model.save(output_path) print(f"Saved to {output_path}") # Next time, load with: # model = KeyedVectors.load(output_path) # ============================================# Practical Example: Load, Filter, Save# ============================================ def prepare_embeddings_for_task( pretrained_path: str, task_vocabulary: set, output_path: str): """ Load pre-trained embeddings and save only words needed for task. Reduces model size significantly. """ # Load full model full_model = load_glove_from_file(pretrained_path) embeddings, dim = full_model # Filter to task vocabulary filtered = {w: embeddings[w] for w in task_vocabulary if w in embeddings} coverage = len(filtered) / len(task_vocabulary) print(f"Filtered to {len(filtered)} vectors ({coverage:.1%} coverage)") # Save filtered embeddings np.savez_compressed( output_path, words=list(filtered.keys()), vectors=np.array(list(filtered.values())) ) print(f"Saved to {output_path}.npz") return filtered # Usage exampleglove = load_via_gensim_api('glove-wiki-gigaword-100') # Basic operationsprint(f"Similarity examples:")print(f" king ↔ queen: {glove.similarity('king', 'queen'):.4f}")print(f" python ↔ java: {glove.similarity('python', 'java'):.4f}") # Analogyresult = glove.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)print(f" king - man + woman = {result[0][0]}")First load from text/binary format, then save in gensim's native format using .save(). Subsequent loads will be 10-50x faster. For production, preload embeddings at service startup and keep them in memory.
Deep learning models typically need embeddings as a weight matrix where row i contains the embedding for vocabulary word i. Here's how to construct this properly:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172
import numpy as npfrom typing import Dict, List, Tupleimport torchimport tensorflow as tf def create_embedding_matrix( word_to_idx: Dict[str, int], pretrained_vectors, embedding_dim: int, oov_strategy: str = 'random', # 'random', 'zero', 'mean' scale: float = 0.25) -> Tuple[np.ndarray, float]: """ Create embedding matrix for deep learning models. Args: word_to_idx: Vocabulary mapping {word: index} pretrained_vectors: Pre-trained word vectors embedding_dim: Dimension of embeddings oov_strategy: How to handle words not in pre-trained scale: Scale for random initialization Returns: (embedding_matrix, coverage_ratio) """ vocab_size = len(word_to_idx) # Initialize matrix if oov_strategy == 'zero': embedding_matrix = np.zeros((vocab_size, embedding_dim), dtype=np.float32) else: # Random initialization for OOV words embedding_matrix = np.random.uniform( -scale, scale, (vocab_size, embedding_dim) ).astype(np.float32) # Optionally compute mean for 'mean' strategy if oov_strategy == 'mean': sample_size = min(10000, len(pretrained_vectors)) sample_words = list(pretrained_vectors.key_to_index.keys())[:sample_size] mean_vec = np.mean([pretrained_vectors[w] for w in sample_words], axis=0) # Fill in pre-trained vectors found = 0 for word, idx in word_to_idx.items(): if word in pretrained_vectors: embedding_matrix[idx] = pretrained_vectors[word] found += 1 elif oov_strategy == 'mean': embedding_matrix[idx] = mean_vec coverage = found / vocab_size print(f"Created embedding matrix: {vocab_size} x {embedding_dim}") print(f"Pre-trained coverage: {found}/{vocab_size} ({coverage:.1%})") return embedding_matrix, coverage # ============================================# PyTorch Integration# ============================================ def create_pytorch_embedding_layer( word_to_idx: Dict[str, int], pretrained_vectors, embedding_dim: int, trainable: bool = True, padding_idx: int = 0) -> torch.nn.Embedding: """ Create a PyTorch Embedding layer initialized with pre-trained vectors. """ # Create numpy matrix embedding_matrix, _ = create_embedding_matrix( word_to_idx, pretrained_vectors, embedding_dim ) # Convert to PyTorch tensor embedding_tensor = torch.FloatTensor(embedding_matrix) # Create embedding layer vocab_size = len(word_to_idx) embedding_layer = torch.nn.Embedding( num_embeddings=vocab_size, embedding_dim=embedding_dim, padding_idx=padding_idx ) # Initialize with pre-trained weights embedding_layer.weight = torch.nn.Parameter(embedding_tensor) # Optionally freeze weights embedding_layer.weight.requires_grad = trainable return embedding_layer # Example PyTorch model with pre-trained embeddingsclass TextClassifier(torch.nn.Module): def __init__(self, word_to_idx, pretrained_vectors, num_classes): super().__init__() embedding_dim = pretrained_vectors.vector_size # Pre-trained embedding layer self.embedding = create_pytorch_embedding_layer( word_to_idx, pretrained_vectors, embedding_dim, trainable=True # Fine-tune embeddings ) self.lstm = torch.nn.LSTM( embedding_dim, 128, batch_first=True, bidirectional=True ) self.fc = torch.nn.Linear(256, num_classes) def forward(self, x): # x: (batch, seq_len) of word indices embedded = self.embedding(x) # (batch, seq_len, embedding_dim) _, (hidden, _) = self.lstm(embedded) hidden = torch.cat([hidden[-2], hidden[-1]], dim=1) # (batch, 256) return self.fc(hidden) # ============================================# TensorFlow/Keras Integration# ============================================ def create_keras_embedding_layer( word_to_idx: Dict[str, int], pretrained_vectors, embedding_dim: int, trainable: bool = True, mask_zero: bool = True): """ Create a Keras Embedding layer initialized with pre-trained vectors. """ # Create numpy matrix embedding_matrix, _ = create_embedding_matrix( word_to_idx, pretrained_vectors, embedding_dim ) # Create Keras layer embedding_layer = tf.keras.layers.Embedding( input_dim=len(word_to_idx), output_dim=embedding_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), trainable=trainable, mask_zero=mask_zero # Mask padding tokens ) return embedding_layer # Example Keras modeldef build_keras_model(word_to_idx, pretrained_vectors, num_classes): embedding_dim = pretrained_vectors.vector_size vocab_size = len(word_to_idx) model = tf.keras.Sequential([ create_keras_embedding_layer( word_to_idx, pretrained_vectors, embedding_dim, trainable=True ), tf.keras.layers.Bidirectional( tf.keras.layers.LSTM(128) ), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(num_classes, activation='softmax') ]) return modelFreeze embeddings (trainable=False) when: you have very little training data (prevents overfitting), pre-trained domain matches well, or you want faster training. Fine-tune embeddings (trainable=True) when: you have moderate+ training data, domain differs from pre-trained, or maximum task performance is critical. A middle ground: freeze initially, then unfreeze later in training.
Pre-trained embeddings may not perfectly match your domain. Several strategies can adapt them:
1. Fine-tuning on domain corpus:
Continue training the embedding model on your domain-specific text. This updates vectors to better reflect domain usage while retaining general knowledge.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202
from gensim.models import Word2Vec, FastText, KeyedVectorsfrom gensim.models.callbacks import CallbackAny2Vecimport numpy as np # ============================================# Strategy 1: Continue Training (Fine-tuning)# ============================================ def finetune_word2vec( pretrained_path: str, domain_sentences, epochs: int = 10, learning_rate: float = 0.01): """ Fine-tune pre-trained Word2Vec on domain corpus. Note: Requires the full model (.model), not just vectors (.bin/.vec) """ # Load pre-trained model model = Word2Vec.load(pretrained_path) # Update learning rate for fine-tuning model.min_alpha = learning_rate * 0.0001 model.alpha = learning_rate # Build vocabulary increment (new words) model.build_vocab(domain_sentences, update=True) # Train on domain data model.train( domain_sentences, total_examples=model.corpus_count, epochs=epochs ) return model def finetune_fasttext( pretrained_path: str, domain_sentences, epochs: int = 10): """ Fine-tune pre-trained FastText on domain corpus. FastText handles this better due to subword information. """ from gensim.models.fasttext import load_facebook_model # Load pre-trained model = load_facebook_model(pretrained_path) # Build vocabulary increment model.build_vocab(domain_sentences, update=True) # Fine-tune model.train( domain_sentences, total_examples=model.corpus_count, epochs=epochs ) return model # ============================================# Strategy 2: Train Domain Model + Combine# ============================================ def combine_embeddings( general_vectors, domain_vectors, alpha: float = 0.5) -> dict: """ Combine general and domain embeddings via weighted average. For words in both: blend For words in only one: use that one Args: general_vectors: Pre-trained general embeddings domain_vectors: Domain-trained embeddings alpha: Weight for general (1-alpha for domain) """ combined = {} dim = general_vectors.vector_size all_words = set(general_vectors.key_to_index.keys()) | set(domain_vectors.key_to_index.keys()) for word in all_words: in_general = word in general_vectors in_domain = word in domain_vectors if in_general and in_domain: # Blend combined[word] = ( alpha * general_vectors[word] + (1 - alpha) * domain_vectors[word] ) elif in_general: combined[word] = general_vectors[word] else: combined[word] = domain_vectors[word] print(f"Combined vocabulary: {len(combined)}") return combined # ============================================# Strategy 3: Retrofitting to Domain Lexicon# ============================================ def retrofit_embeddings( word_vectors: dict, lexicon: dict, # word -> list of related words num_iters: int = 10, alpha: float = 0.5) -> dict: """ Retrofit embeddings using domain-specific relationships. Based on Faruqui et al. (2015) retrofitting. Args: word_vectors: Original embeddings {word: vector} lexicon: Domain relationships {word: [related_words]} num_iters: Number of retrofitting iterations alpha: Balance between original and neighbors """ new_vectors = {w: v.copy() for w, v in word_vectors.items()} for iteration in range(num_iters): for word in lexicon: if word not in word_vectors: continue neighbors = [n for n in lexicon[word] if n in word_vectors] if not neighbors: continue # Weighted average of original + neighbors neighbor_avg = np.mean([new_vectors[n] for n in neighbors], axis=0) new_vectors[word] = ( alpha * word_vectors[word] + (1 - alpha) * neighbor_avg ) return new_vectors # ============================================# Strategy 4: Linear Projection# ============================================ def learn_domain_projection( general_vectors, domain_pairs: list, # [(general_word, domain_word), ...]) -> np.ndarray: """ Learn a linear transformation from general to domain space. Uses word pairs that appear in both spaces. """ from sklearn.linear_model import LinearRegression X = [] # General embeddings y = [] # Domain embeddings for gen_word, dom_word in domain_pairs: if gen_word in general_vectors and dom_word in domain_vectors: X.append(general_vectors[gen_word]) y.append(domain_vectors[dom_word]) X = np.array(X) y = np.array(y) # Fit linear transformation regression = LinearRegression() regression.fit(X, y) return regression.coef_.T # Transformation matrix # Example: Create domain-adapted embeddingsdomain_corpus = [ "patient presents with acute myocardial infarction".split(), "ecg shows st elevation in leads v1 through v4".split(), # ... more medical text] # Option 1: Fine-tune# adapted = finetune_fasttext('cc.en.300.bin', domain_corpus) # Option 2: Combine with domain-trained# domain_model = Word2Vec(domain_corpus, vector_size=100, epochs=50)# combined = combine_embeddings(general_vectors, domain_model.wv, alpha=0.3) # Option 3: Retrofit with domain lexicon (e.g., UMLS for medical)# medical_lexicon = load_umls_synonyms()# retrofitted = retrofit_embeddings(general_vectors, medical_lexicon)Start simple: Use pre-trained embeddings as-is and measure task performance. Only invest in domain adaptation if (1) vocabulary coverage is poor, or (2) domain semantics differ significantly from general text. Fine-tuning during end-task training (in your neural network) often provides sufficient adaptation without explicit embedding modification.
The decision between training custom embeddings versus using pre-trained ones is fundamental. Here's a framework for making this choice:
| Domain Corpus Size | Vocabulary Coverage | Recommendation |
|---|---|---|
| < 1M tokens | Any | Use pre-trained; not enough data for quality custom embeddings |
| 1M - 10M tokens | 90% | Use pre-trained; coverage is sufficient |
| 1M - 10M tokens | < 90% | Pre-trained + fine-tuning or FastText for OOV |
| 10M - 100M tokens | 80% | Pre-trained with optional fine-tuning |
| 10M - 100M tokens | < 80% | Consider custom training or extensive adaptation |
100M tokens | Any | Custom training becomes viable; experiment with both |
You don't have to choose one or the other. Common hybrids: (1) Initialize with pre-trained, fine-tune on domain data. (2) Use pre-trained for common words, train custom for domain terms. (3) Concatenate pre-trained and domain embeddings as features.
Before committing to an embedding choice, evaluate quality through intrinsic and extrinsic measures:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203
import numpy as npfrom scipy.stats import spearmanrfrom sklearn.metrics.pairwise import cosine_similarity # ============================================# Intrinsic Evaluation: Word Similarity# ============================================ def load_similarity_dataset(filepath: str): """ Load a word similarity benchmark dataset. Standard datasets: - SimLex-999: https://fh295.github.io/simlex.html - WordSim-353: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ - MEN: https://staff.fnwi.uva.nl/e.bruni/MEN """ pairs = [] with open(filepath, 'r') as f: for line in f: parts = line.strip().split() if len(parts) >= 3: word1, word2, score = parts[0], parts[1], float(parts[2]) pairs.append((word1, word2, score)) return pairs def evaluate_word_similarity(word_vectors, similarity_pairs): """ Evaluate embeddings on word similarity benchmark. Returns Spearman correlation between human and model similarities. """ human_scores = [] model_scores = [] skipped = 0 for word1, word2, human_score in similarity_pairs: if word1 not in word_vectors or word2 not in word_vectors: skipped += 1 continue vec1 = word_vectors[word1] vec2 = word_vectors[word2] model_sim = cosine_similarity([vec1], [vec2])[0, 0] human_scores.append(human_score) model_scores.append(model_sim) if len(human_scores) < 10: print(f"Too few pairs ({len(human_scores)}) for reliable evaluation") return None correlation, p_value = spearmanr(human_scores, model_scores) print(f"Word Similarity Evaluation:") print(f" Pairs evaluated: {len(human_scores)} / {len(similarity_pairs)}") print(f" Skipped (OOV): {skipped}") print(f" Spearman correlation: {correlation:.4f} (p={p_value:.4e})") return correlation # ============================================# Intrinsic Evaluation: Analogies# ============================================ def load_analogy_dataset(filepath: str): """ Load analogy dataset (Google's or similar). Format: king man queen woman (for king:queen :: man:woman) """ analogies = [] current_category = None with open(filepath, 'r') as f: for line in f: line = line.strip() if line.startswith(':'): current_category = line[2:] elif line: parts = line.lower().split() if len(parts) == 4: analogies.append((parts[0], parts[1], parts[2], parts[3], current_category)) return analogies def evaluate_analogies(word_vectors, analogies, top_k: int = 4): """ Evaluate embeddings on analogy task. a:b :: c:d → Is d in top-k for vec(b) - vec(a) + vec(c)? """ correct = 0 total = 0 for a, b, c, d, category in analogies: if any(w not in word_vectors for w in [a, b, c, d]): continue total += 1 # Compute target vector target = word_vectors[b] - word_vectors[a] + word_vectors[c] # Find nearest neighbors (excluding a, b, c) try: result = word_vectors.most_similar( positive=[b, c], negative=[a], topn=top_k ) predictions = [word for word, _ in result] if d in predictions: correct += 1 except KeyError: continue accuracy = correct / total if total > 0 else 0 print(f"Analogy Evaluation:") print(f" Analogies evaluated: {total} / {len(analogies)}") print(f" Accuracy (top-{top_k}): {accuracy:.3%}") return accuracy # ============================================# Extrinsic Evaluation: Downstream Task# ============================================ def evaluate_on_classification( word_vectors, train_docs, train_labels, test_docs, test_labels): """ Evaluate embeddings on text classification task. Uses simple average embedding + logistic regression. """ from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, f1_score def get_doc_embedding(doc, wv): words = doc.lower().split() vectors = [wv[w] for w in words if w in wv] if vectors: return np.mean(vectors, axis=0) return np.zeros(wv.vector_size) X_train = np.array([get_doc_embedding(d, word_vectors) for d in train_docs]) X_test = np.array([get_doc_embedding(d, word_vectors) for d in test_docs]) clf = LogisticRegression(max_iter=1000) clf.fit(X_train, train_labels) predictions = clf.predict(X_test) acc = accuracy_score(test_labels, predictions) f1 = f1_score(test_labels, predictions, average='weighted') print(f"Classification Evaluation:") print(f" Accuracy: {acc:.3%}") print(f" F1 Score: {f1:.3%}") return acc, f1 # ============================================# Comprehensive Evaluation# ============================================ def full_embedding_evaluation(word_vectors, datasets: dict): """ Run comprehensive evaluation suite. Args: word_vectors: Embeddings to evaluate datasets: Dict with paths to similarity, analogy, classification data """ results = {} # Word similarity if 'similarity' in datasets: pairs = load_similarity_dataset(datasets['similarity']) results['similarity'] = evaluate_word_similarity(word_vectors, pairs) # Analogies if 'analogies' in datasets: analogies = load_analogy_dataset(datasets['analogies']) results['analogy'] = evaluate_analogies(word_vectors, analogies) # Classification if 'classification' in datasets: data = datasets['classification'] acc, f1 = evaluate_on_classification( word_vectors, data['train_docs'], data['train_labels'], data['test_docs'], data['test_labels'] ) results['classification_acc'] = acc results['classification_f1'] = f1 return resultsWhile intrinsic metrics (similarity, analogies) are informative, extrinsic evaluation on your actual task matters most. Embeddings that score poorly on general benchmarks may excel on your specific domain task. Always evaluate on a held-out portion of your actual data before finalizing embedding choice.
Let's consolidate the key practices for working with pre-trained embeddings:
| Scenario | Recommended Approach |
|---|---|
| General English NLP, quality first | GloVe 840B (300d) or GloVe 6B (300d) |
| General English NLP, size constrained | GloVe 6B (100d) or (50d) |
| Social media / informal text | GloVe Twitter (200d) or FastText CC |
| Non-English language | FastText Wiki/CC for that language |
| OOV handling critical | FastText (any variant with subwords) |
| Medical / legal / scientific | Domain-specific embeddings if available; else fine-tune FastText |
| Fairness / bias concerns | ConceptNet Numberbatch (debiased) |
We've comprehensively covered the practical aspects of using pre-trained word embeddings. Let's consolidate the key insights:
Module Complete:
Congratulations! You've completed the comprehensive module on Word Embeddings. You now understand:
These techniques form the foundation for representing text in machine learning. While contextualized embeddings (BERT, etc.) are increasingly popular, static word embeddings remain valuable for their efficiency, interpretability, and effectiveness on many tasks.
You now have a comprehensive understanding of word embeddings—from foundational methods through practical usage. These techniques power everything from search engines to chatbots, and mastery of embeddings is essential for any NLP practitioner. Apply this knowledge to represent text effectively in your machine learning projects.