Loading learning content...
In 2013, a team at Google led by Tomas Mikolov published a paper that fundamentally transformed natural language processing. The paper introduced Word2Vec—a family of neural network models that could learn to represent words as dense, low-dimensional vectors in a continuous vector space. What made this remarkable wasn't just technical elegance; it was the stunning discovery that the resulting vectors captured semantic relationships through vector arithmetic.
king - man + woman ≈ queen
This simple equation demonstrated something profound: the neural network had learned that the relationship between 'king' and 'man' parallels the relationship between 'queen' and 'woman.' The model hadn't been explicitly taught about gender or royalty—it discovered these concepts by analyzing billions of words in context.
By the end of this page, you will understand the complete Word2Vec architecture—both Skip-gram and CBOW variants—from intuition through mathematical formulation to training optimization. You'll grasp why these models work, how to train them efficiently, and when to choose each architecture. This knowledge forms the foundation for all modern word embedding techniques.
Before Word2Vec, the dominant paradigm for representing words in computational systems was one-hot encoding and its extension, the bag-of-words model. While these approaches enabled significant progress in information retrieval and text classification, they suffer from fundamental limitations that Word2Vec elegantly addresses.
| Property | One-Hot / Bag-of-Words | Word2Vec Embeddings |
|---|---|---|
| Dimensionality | V (vocabulary size: 10K-1M+) | d (typically 100-300) |
| Vector Density | Sparse (one 1, rest 0s) | Dense (all dimensions meaningful) |
| Similarity Computation | All word pairs equally distant | Similar words → similar vectors |
| Semantic Relationships | Not captured | Captured through linear relationships |
| Memory Requirements | O(V) per word | O(d) per word, d << V |
| Generalization | No generalization between words | Transfers knowledge between similar words |
The curse of dimensionality in word spaces:
Consider a vocabulary of 100,000 words represented as one-hot vectors. Each word occupies a unique axis in a 100,000-dimensional space. In this representation:
This orthogonality destroys any hope of generalization. A model trained to classify 'happy' as positive sentiment must relearn this from scratch for 'joyful,' 'elated,' 'content,' and every other synonym.
Word2Vec succeeds because it operationalizes the distributional hypothesis from linguistics: 'You shall know a word by the company it keeps' (J.R. Firth, 1957). Words appearing in similar contexts tend to have similar meanings. 'Cat' and 'dog' both appear near 'pet,' 'feed,' 'veterinarian,' while 'algorithm' and 'theory' appear near 'computer,' 'mathematical,' 'prove.'
The key insight of distributed representations:
Instead of dedicating one dimension per word (one-hot), Word2Vec distributes each word's meaning across many dimensions, and each dimension participates in representing many words. This sharing creates:
The Skip-gram model represents the more powerful of Word2Vec's two architectures. Its fundamental task is deceptively simple: given a center word, predict the surrounding context words. This seemingly trivial objective forces the model to learn rich semantic representations.
The training objective:
Given a sequence of training words w₁, w₂, ..., wₜ, the Skip-gram model maximizes the average log probability:
$$\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)$$
where:
12345678910111213141516171819202122232425262728293031
# Intuition: Skip-gram training pairs# For the sentence: "The quick brown fox jumps over the lazy dog"# With center word "fox" and window size c=2 sentence = "The quick brown fox jumps over the lazy dog".split()center_word = "fox"center_idx = sentence.index(center_word) # 3window_size = 2 # Skip-gram generates training pairs: (center_word, context_word)training_pairs = []for j in range(-window_size, window_size + 1): if j != 0: # Skip the center word itself context_idx = center_idx + j if 0 <= context_idx < len(sentence): context_word = sentence[context_idx] training_pairs.append((center_word, context_word)) print("Skip-gram training pairs for center word 'fox':")for center, context in training_pairs: print(f" ({center}, {context})") # Output:# Skip-gram training pairs for center word 'fox':# (fox, quick)# (fox, brown)# (fox, jumps)# (fox, over) # The model learns: given "fox", predict likely neighbors# This forces "fox" to encode information about what appears nearbyThe neural network structure:
Skip-gram uses a simple two-layer neural network:
Crucially, there are two weight matrices:
After training, we typically use only W as our word embeddings, though averaging W and W' sometimes yields better results.
The softmax probability:
The basic Skip-gram formulation uses a softmax to define the probability of observing a context word given a center word:
$$P(w_O | w_I) = \frac{\exp(v'{w_O}^\top v{w_I})}{\sum_{w=1}^{V} \exp(v'w^\top v{w_I})}$$
where:
The computational problem:
This softmax requires summing over the entire vocabulary (often 100K+ words) for every training example. With billions of training examples, this is computationally prohibitive. This motivates the efficiency techniques we'll cover shortly.
Having separate input (W) and output (W') matrices prevents a degenerate solution where the model simply memorizes word co-occurrences. With a single matrix, the optimization could collapse to a trivial solution. The separation forces the model to learn generalizable representations that predict context words through a compressed bottleneck.
CBOW (Continuous Bag-of-Words) inverts the Skip-gram objective: instead of predicting context words from a center word, CBOW predicts the center word from its surrounding context. This architectural flip has significant implications for training efficiency and the nature of learned representations.
The training objective:
Given a sequence of training words, CBOW maximizes:
$$\frac{1}{T} \sum_{t=1}^{T} \log P(w_t | w_{t-c}, ..., w_{t-1}, w_{t+1}, ..., w_{t+c})$$
The model takes all context words within window size c and predicts the center word.
123456789101112131415161718192021222324252627282930
# Intuition: CBOW training examples# For the sentence: "The quick brown fox jumps over the lazy dog"# With target word "fox" and window size c=2 sentence = "The quick brown fox jumps over the lazy dog".split()target_word = "fox"target_idx = sentence.index(target_word) # 3window_size = 2 # CBOW generates: (context_words, target_word)context_words = []for j in range(-window_size, window_size + 1): if j != 0: # Skip the target word itself context_idx = target_idx + j if 0 <= context_idx < len(sentence): context_words.append(sentence[context_idx]) print("CBOW training example for target word 'fox':")print(f" Context: {context_words}")print(f" Target: {target_word}")print(f" Training pair: ({context_words}, '{target_word}')") # Output:# CBOW training example for target word 'fox':# Context: ['quick', 'brown', 'jumps', 'over']# Target: fox# Training pair: (['quick', 'brown', 'jumps', 'over'], 'fox') # The model learns: given surrounding words, predict the center# Context words are averaged (bag-of-words) before predictionThe neural network structure:
CBOW also uses a two-layer network, but with a key difference in how context is processed:
The averaging step:
Given context words {w_{t-c}, ..., w_{t+c}} (excluding wₜ), CBOW computes:
$$\hat{v} = \frac{1}{2c} \sum_{-c \leq j \leq c, j \neq 0} v_{w_{t+j}}$$
This averaged vector then predicts the center word via softmax:
$$P(w_t | \text{context}) = \frac{\exp(v'{w_t}^\top \hat{v})}{\sum{w=1}^{V} \exp(v'_w^\top \hat{v})}$$
The averaging operation means CBOW treats context as an unordered bag—the words 'dog bites man' and 'man bites dog' produce identical context representations. This is a deliberate simplification that trades positional information for training efficiency. The model learns from word co-occurrence patterns, not sequential structure.
The choice between Skip-gram and CBOW depends on your data characteristics, computational constraints, and downstream task requirements. Each architecture makes different tradeoffs.
| Property | Skip-gram | CBOW |
|---|---|---|
| Training time (relative) | Slower (10x more updates) | Faster (1 update per window) |
| Rare word quality | Excellent | Poor to moderate |
| Frequent word quality | Good | Excellent |
| Semantic accuracy | Higher (analogies, similarities) | Moderate |
| Syntactic accuracy | Moderate | Higher (part-of-speech, morphology) |
| Recommended corpus size | Small to medium (< 1B words) | Large (> 1B words) |
| Context utilization | Each context word separately | All context words averaged together |
The gradient update perspective:
The key difference lies in how training updates flow:
Skip-gram: For each center word occurrence, generates 2c separate predictions. Each context word gets a direct gradient signal. A rare word appearing in context still receives meaningful updates.
CBOW: For each center word occurrence, generates 1 prediction from averaged context. Rare context words get diluted updates (their contribution is averaged with frequent words).
This explains why Skip-gram better represents rare words—they receive undiluted gradient updates when they appear in either center or context position.
For most NLP applications, Skip-gram with negative sampling (covered next) is the default choice. It produces high-quality embeddings for both frequent and rare words, and modern implementations are fast enough for large corpora. Use CBOW only when training speed is critical and your vocabulary has few rare words of importance.
The naive softmax formulation requires computing a sum over the entire vocabulary for every training example—clearly impractical for real-world vocabularies. Negative Sampling is the key optimization that made Word2Vec practical at scale.
The core idea:
Instead of predicting the probability of a context word among all vocabulary words, negative sampling reframes the task as binary classification: distinguish real (center, context) pairs from fake ones.
For each real pair (wI, wO), we generate k 'negative' examples by sampling random words that didn't appear in the context. The model learns to assign high probability to real pairs and low probability to fake pairs.
The negative sampling objective:
For a (center word, context word) pair, maximize:
$$\log \sigma(v'{w_O}^\top v{w_I}) + \sum_{i=1}^{k} \mathbb{E}{w_i \sim P_n(w)} [\log \sigma(-v'{w_i}^\top v_{w_I})]$$
where:
The noise distribution:
The original Word2Vec paper uses a modified unigram distribution:
$$P_n(w) = \frac{f(w)^{0.75}}{\sum_{w'} f(w')^{0.75}}$$
The 3/4 power (0.75) flattens the distribution, giving rare words more chance of being sampled as negatives. This prevents the model from simply learning to predict against very frequent words.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import numpy as npfrom collections import Counter def build_negative_sampling_distribution(word_counts, power=0.75): """ Build the noise distribution for negative sampling. Uses the 3/4 power to smooth frequencies. """ words = list(word_counts.keys()) frequencies = np.array([word_counts[w] for w in words]) # Apply power to flatten distribution adjusted_freqs = frequencies ** power # Normalize to probability distribution sampling_probs = adjusted_freqs / adjusted_freqs.sum() return words, sampling_probs def sample_negatives(positive_word, words, sampling_probs, k=5): """ Sample k negative words, excluding the positive context word. """ negatives = [] while len(negatives) < k: idx = np.random.choice(len(words), p=sampling_probs) if words[idx] != positive_word: negatives.append(words[idx]) return negatives # Example usageword_counts = Counter({ 'the': 1000000, 'a': 800000, 'is': 600000, 'cat': 5000, 'dog': 4500, 'elephant': 100, 'algorithm': 2000, 'neural': 1500, 'network': 3000}) words, probs = build_negative_sampling_distribution(word_counts) print("Sampling distribution comparison:")print(f"{'Word':<12} {'Count':<10} {'Raw P(w)':<12} {'Adjusted P(w)':<12}")print("-" * 46)for w in ['the', 'cat', 'elephant']: raw_prob = word_counts[w] / sum(word_counts.values()) adj_prob = probs[words.index(w)] print(f"{w:<12} {word_counts[w]:<10} {raw_prob:<12.6f} {adj_prob:<12.6f}") # Output shows how rare words get boosted probability| Word | Frequency | Unigram P(w) | Adjusted P(w)^0.75 | Relative Change |
|---|---|---|---|---|
| 'the' | 1,000,000 | 41.7% | 28.2% | ↓ 32% |
| 'cat' | 5,000 | 0.21% | 0.45% | ↑ 114% |
| 'elephant' | 100 | 0.004% | 0.028% | ↑ 600% |
Negative sampling can be understood as an approximation to the gradient of the original softmax objective. More specifically, it approximates the partition function using importance sampling. The method is mathematically motivated by Noise Contrastive Estimation (NCE), though Word2Vec's formulation is a simplified variant optimized for embedding quality rather than density estimation.
Hierarchical Softmax is an alternative approximation to the full softmax that reduces the computational complexity from O(V) to O(log V). It organizes the vocabulary into a binary tree structure, where each leaf node represents a word.
The core idea:
Instead of computing probabilities for all V words, hierarchical softmax represents each word as a path from the root to a leaf in a binary tree. The probability of a word is the product of probabilities at each branch point along its path:
$$P(w | w_I) = \prod_{j=1}^{L(w)} \sigma\left([![ n(w,j+1) = \text{left-child}(n(w,j)) ]!] \cdot v'{n(w,j)}^\top v{w_I}\right)$$
where:
Building the Huffman tree:
Word2Vec uses a Huffman tree where frequent words have shorter paths. This is optimal because:
Comparison with negative sampling:
| Property | Negative Sampling | Hierarchical Softmax |
|---|---|---|
| Complexity per example | O(k·d), k ≈ 5-20 | O(log V · d) |
| Best for | Skip-gram, large corpora | CBOW, smaller vocabularies |
| Representation | Direct word vectors | Internal node vectors + word positions |
| Quality on rare words | Better (direct updates) | Slightly worse (path-based) |
| Implementation | Simple | More complex |
Negative sampling is almost always preferred in practice. It's simpler to implement, typically produces slightly better embeddings, and the constant factor (k ≈ 5-20) is often smaller than log V for reasonable vocabularies. Use hierarchical softmax only when you need a proper probability distribution over words (negative sampling doesn't normalize).
Extremely common words like 'the,' 'a,' 'is' present two problems for Word2Vec training:
The subsampling technique:
Word2Vec probabilistically discards frequent words during training. Each word w is kept with probability:
$$P(\text{keep } w) = \sqrt{\frac{t}{f(w)}} + \frac{t}{f(w)}$$
where:
Words with frequency < t are always kept. Words with frequency >> t are heavily downsampled.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as np def compute_keep_probability(word_freq, threshold=1e-5): """ Compute probability of keeping a word during training. Frequent words are heavily downsampled. """ if word_freq < threshold: return 1.0 # Always keep rare words # Original Word2Vec formula prob = np.sqrt(threshold / word_freq) + threshold / word_freq return min(prob, 1.0) # Example: Effect on common English wordsword_frequencies = { 'the': 0.07, # ~7% of words 'is': 0.01, # ~1% of words 'algorithm': 0.0001, # 0.01% 'eigenvalue': 0.000001 # 0.0001%} print("Subsampling effect (threshold = 1e-5):")print(f"{'Word':<15} {'Frequency':<12} {'Keep Prob':<12} {'Effect'}")print("-" * 55) for word, freq in word_frequencies.items(): prob = compute_keep_probability(freq) if prob < 0.05: effect = "Heavily downsampled" elif prob < 0.5: effect = "Moderately downsampled" elif prob < 1.0: effect = "Slightly downsampled" else: effect = "Always kept" print(f"{word:<15} {freq:<12.6f} {prob:<12.4f} {effect}") # Output:# Word Frequency Keep Prob Effect# -------------------------------------------------------# the 0.070000 0.0139 Heavily downsampled# is 0.010000 0.0041 Heavily downsampled# algorithm 0.000100 0.4162 Moderately downsampled# eigenvalue 0.000001 1.0000 Always keptSubsampling has a dual benefit: (1) it speeds up training by processing fewer tokens, and (2) it widens effective context windows. When 'the' is removed from 'the quick brown fox,' the effective window now connects 'quick' more directly to words that were previously too far away. This strengthens meaningful word associations.
Training high-quality Word2Vec embeddings requires careful attention to hyperparameters. Each parameter involves tradeoffs between quality, speed, and memory.
| Parameter | Typical Range | Effect | Recommendation |
|---|---|---|---|
| Embedding dimension (d) | 50-300 | Higher = more expressive but slower; diminishing returns past ~300 | 100-200 for most tasks; 300 for demanding semantic tasks |
| Window size (c) | 2-10 | Larger = more syntactic; smaller = more semantic | 5 for balanced; 2-3 for semantic; 8-10 for syntactic |
| Negative samples (k) | 5-20 | More = stabler gradients but slower; diminishing returns past ~15 | 5-10 for large data; 15-20 for small data |
| Subsampling threshold (t) | 10^-5 to 10^-3 | Smaller = more aggressive downsampling of frequent words | 10^-5 for general use |
| Min word count | 5-100 | Words below threshold are excluded from vocabulary | 5-10 for large corpora; higher for small |
| Learning rate (α) | 0.01-0.05 | Higher = faster but less stable; typically decays during training | 0.025 initial, decaying to 0.0001 |
| Training epochs | 1-20 | More = better but diminishing returns; large corpora need fewer | 5 for large data; 10-20 for small |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
from gensim.models import Word2Vecfrom gensim.models.callbacks import CallbackAny2Vec class LossCallback(CallbackAny2Vec): """Track training progress.""" def __init__(self): self.epoch = 0 def on_epoch_end(self, model): self.epoch += 1 print(f"Epoch {self.epoch} completed") # Assume 'sentences' is an iterable of tokenized sentences# sentences = [["the", "quick", "brown", "fox"], ["jumped", "over", "lazy", "dog"], ...] # Skip-gram with negative sampling (recommended defaults)sg_model = Word2Vec( sentences=sentences, vector_size=200, # Embedding dimension window=5, # Context window size min_count=5, # Ignore words with < 5 occurrences sg=1, # 1 for Skip-gram, 0 for CBOW negative=10, # Number of negative samples sample=1e-5, # Subsampling threshold epochs=5, # Training epochs workers=4, # Parallel threads callbacks=[LossCallback()]) # CBOW with hierarchical softmax (alternative)cbow_model = Word2Vec( sentences=sentences, vector_size=200, window=5, min_count=5, sg=0, # CBOW hs=1, # Hierarchical softmax negative=0, # Disable negative sampling epochs=5, workers=4) # Access trained embeddingsking_vector = sg_model.wv['king'] # 200-dimensional numpy array # Semantic similaritysimilarity = sg_model.wv.similarity('king', 'queen')print(f"Similarity(king, queen) = {similarity:.4f}") # Analogy: king - man + woman ≈ ?result = sg_model.wv.most_similar( positive=['king', 'woman'], negative=['man'], topn=5)print("king - man + woman ≈", result[0][0]) # Should be 'queen' # Save and load modelssg_model.save("word2vec_skipgram.model")loaded_model = Word2Vec.load("word2vec_skipgram.model")Vocabulary issues: Words not in the training vocabulary get no embedding. For out-of-vocabulary words, consider FastText (covered later) which handles subwords.
Overfitting on small data: With fewer than ~1 million tokens, consider using pre-trained embeddings and fine-tuning rather than training from scratch.
Non-determinism: Word2Vec uses random initialization and random negative sampling. Set seeds for reproducibility if needed.
What exactly does Word2Vec learn? The embedding space exhibits remarkable structure that emerges purely from co-occurrence statistics:
Linear substructures:
Word2Vec embeddings capture semantic and syntactic relationships as approximately linear transformations. The classic examples:
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as np def analogy(model, a, b, c, topn=5): """ Solve analogy: a is to b as c is to ? Returns the word whose vector is closest to (b - a + c) """ try: # Compute the target vector # We want: a:b :: c:? # So: ? = b - a + c result = model.wv.most_similar( positive=[b, c], negative=[a], topn=topn ) return result except KeyError as e: return f"Word not in vocabulary: {e}" # Semantic analogiesprint("Semantic Analogies:")print(f"king:queen :: man:? → {analogy(model, 'king', 'queen', 'man')[0][0]}")print(f"France:Paris :: Italy:? → {analogy(model, 'France', 'Paris', 'Italy')[0][0]}")print(f"big:bigger :: small:? → {analogy(model, 'big', 'bigger', 'small')[0][0]}") # Syntactic analogies print("Syntactic Analogies:")print(f"walk:walked :: run:? → {analogy(model, 'walk', 'walked', 'run')[0][0]}")print(f"slow:slowly :: quick:? → {analogy(model, 'slow', 'slowly', 'quick')[0][0]}") # Explore the neighborhood of a worddef explore_neighborhood(model, word, topn=10): """Show the nearest neighbors of a word in embedding space.""" neighbors = model.wv.most_similar(word, topn=topn) print(f"Nearest neighbors of '{word}':") for neighbor, similarity in neighbors: print(f" {neighbor}: {similarity:.4f}") explore_neighborhood(model, 'python')# Expected output might include: java, ruby, perl, programming, ...Why linear relationships emerge:
The linear structure isn't explicitly encoded—it emerges from the training objective. When Skip-gram maximizes log P(context | center) using dot products, it implicitly learns that:
If 'king' and 'queen' differ primarily in the gender of associated contexts (he/she, his/her), while 'man' and 'woman' share this same pattern, then the vectors will naturally organize such that (king - queen) ≈ (man - woman).
While compelling, the analogy evaluations have caveats: (1) They work for some relationships better than others, (2) They're sensitive to how the evaluation set is constructed, (3) They measure a narrow aspect of semantic knowledge. Don't over-interpret analogy accuracy as a complete measure of embedding quality.
Word2Vec fundamentally changed how we represent words in machine learning. Let's consolidate the key concepts:
What's next:
Now that we understand how Word2Vec produces individual word embeddings, the next pages explore how to use these embeddings for representing documents and sentences. We'll cover Average Word2Vec (simple averaging) and TF-IDF weighted Word2Vec (weighted averaging that respects word importance).
You now have a comprehensive understanding of Word2Vec's architecture, training procedures, and the remarkable properties of learned embeddings. These concepts form the foundation for all modern word embedding techniques and remain central to practical NLP applications.