Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

1 / 6

Word2Vec: Skip-gram and CBOW Architectures

The Revolution in Word Representation

In 2013, a team at Google led by Tomas Mikolov published a paper that fundamentally transformed natural language processing. The paper introduced Word2Vec—a family of neural network models that could learn to represent words as dense, low-dimensional vectors in a continuous vector space. What made this remarkable wasn't just technical elegance; it was the stunning discovery that the resulting vectors captured semantic relationships through vector arithmetic.

king - man + woman ≈ queen

This simple equation demonstrated something profound: the neural network had learned that the relationship between 'king' and 'man' parallels the relationship between 'queen' and 'woman.' The model hadn't been explicitly taught about gender or royalty—it discovered these concepts by analyzing billions of words in context.

What You Will Master

By the end of this page, you will understand the complete Word2Vec architecture—both Skip-gram and CBOW variants—from intuition through mathematical formulation to training optimization. You'll grasp why these models work, how to train them efficiently, and when to choose each architecture. This knowledge forms the foundation for all modern word embedding techniques.

The Problem with Traditional Representations

Before Word2Vec, the dominant paradigm for representing words in computational systems was one-hot encoding and its extension, the bag-of-words model. While these approaches enabled significant progress in information retrieval and text classification, they suffer from fundamental limitations that Word2Vec elegantly addresses.

Traditional vs. Distributed Representations
Property	One-Hot / Bag-of-Words	Word2Vec Embeddings
Dimensionality	V (vocabulary size: 10K-1M+)	d (typically 100-300)
Vector Density	Sparse (one 1, rest 0s)	Dense (all dimensions meaningful)
Similarity Computation	All word pairs equally distant	Similar words → similar vectors
Semantic Relationships	Not captured	Captured through linear relationships
Memory Requirements	O(V) per word	O(d) per word, d << V
Generalization	No generalization between words	Transfers knowledge between similar words

The curse of dimensionality in word spaces:

Consider a vocabulary of 100,000 words represented as one-hot vectors. Each word occupies a unique axis in a 100,000-dimensional space. In this representation:

The words 'happy' and 'joyful' are orthogonal—their dot product is zero
The words 'happy' and 'refrigerator' are also orthogonal—equally distant
No amount of training data about 'happy' helps the model understand 'joyful'

This orthogonality destroys any hope of generalization. A model trained to classify 'happy' as positive sentiment must relearn this from scratch for 'joyful,' 'elated,' 'content,' and every other synonym.

The Distributional Hypothesis

Word2Vec succeeds because it operationalizes the distributional hypothesis from linguistics: 'You shall know a word by the company it keeps' (J.R. Firth, 1957). Words appearing in similar contexts tend to have similar meanings. 'Cat' and 'dog' both appear near 'pet,' 'feed,' 'veterinarian,' while 'algorithm' and 'theory' appear near 'computer,' 'mathematical,' 'prove.'

The key insight of distributed representations:

Instead of dedicating one dimension per word (one-hot), Word2Vec distributes each word's meaning across many dimensions, and each dimension participates in representing many words. This sharing creates:

Compression: 300 dimensions can represent 1 million+ words
Similarity structure: Related words cluster together in the space
Compositionality: Vector arithmetic captures semantic relationships
Generalization: What the model learns about one word transfers to similar words

The Skip-gram Architecture

The Skip-gram model represents the more powerful of Word2Vec's two architectures. Its fundamental task is deceptively simple: given a center word, predict the surrounding context words. This seemingly trivial objective forces the model to learn rich semantic representations.

The training objective:

Given a sequence of training words w₁, w₂, ..., wₜ, the Skip-gram model maximizes the average log probability:

$$\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)$$

where:

T is the total number of words in the corpus
c is the context window size (typically 5-10)
P(w_{t+j} | wₜ) is the probability of seeing context word w_{t+j} given center word wₜ

skip_gram_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Intuition: Skip-gram training pairs
# For the sentence: "The quick brown fox jumps over the lazy dog"
# With center word "fox" and window size c=2
 
sentence = "The quick brown fox jumps over the lazy dog".split()
center_word = "fox"
center_idx = sentence.index(center_word)  # 3
window_size = 2
 
# Skip-gram generates training pairs: (center_word, context_word)
training_pairs = []
for j in range(-window_size, window_size + 1):
    if j != 0:  # Skip the center word itself
        context_idx = center_idx + j
        if 0 <= context_idx < len(sentence):
            context_word = sentence[context_idx]
            training_pairs.append((center_word, context_word))
 
print("Skip-gram training pairs for center word 'fox':")
for center, context in training_pairs:
    print(f"  ({center}, {context})")
 
# Output:
# Skip-gram training pairs for center word 'fox':
#   (fox, quick)
#   (fox, brown)
#   (fox, jumps)
#   (fox, over)
 
# The model learns: given "fox", predict likely neighbors
# This forces "fox" to encode information about what appears nearby

The neural network structure:

Skip-gram uses a simple two-layer neural network:

Input layer: One-hot vector of the center word (dimension V)
Hidden layer: Dense embedding (dimension d, typically 100-300)
Output layer: Probability distribution over vocabulary (dimension V)

Crucially, there are two weight matrices:

W (V × d): Input word embeddings — row wᵢ is the embedding when word i is the center
W' (d × V): Output word embeddings — column w'ⱼ is the embedding when word j is in context

After training, we typically use only W as our word embeddings, though averaging W and W' sometimes yields better results.

Converting Mermaid diagram...

The softmax probability:

The basic Skip-gram formulation uses a softmax to define the probability of observing a context word given a center word:

$$P(w_O | w_I) = \frac{\exp(v'{w_O}^\top v{w_I})}{\sum_{w=1}^{V} \exp(v'w^\top v{w_I})}$$

where:

v_{wI} is the input embedding of the center word (from W)
v'_{wO} is the output embedding of the context word (from W')
The dot product v'{wO}ᵀ v{wI} measures compatibility between center and context

The computational problem:

This softmax requires summing over the entire vocabulary (often 100K+ words) for every training example. With billions of training examples, this is computationally prohibitive. This motivates the efficiency techniques we'll cover shortly.

Why Two Embedding Matrices?

Having separate input (W) and output (W') matrices prevents a degenerate solution where the model simply memorizes word co-occurrences. With a single matrix, the optimization could collapse to a trivial solution. The separation forces the model to learn generalizable representations that predict context words through a compressed bottleneck.

The Continuous Bag-of-Words (CBOW) Architecture

CBOW (Continuous Bag-of-Words) inverts the Skip-gram objective: instead of predicting context words from a center word, CBOW predicts the center word from its surrounding context. This architectural flip has significant implications for training efficiency and the nature of learned representations.

The training objective:

Given a sequence of training words, CBOW maximizes:

$$\frac{1}{T} \sum_{t=1}^{T} \log P(w_t | w_{t-c}, ..., w_{t-1}, w_{t+1}, ..., w_{t+c})$$

The model takes all context words within window size c and predicts the center word.

cbow_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Intuition: CBOW training examples
# For the sentence: "The quick brown fox jumps over the lazy dog"
# With target word "fox" and window size c=2
 
sentence = "The quick brown fox jumps over the lazy dog".split()
target_word = "fox"
target_idx = sentence.index(target_word)  # 3
window_size = 2
 
# CBOW generates: (context_words, target_word)
context_words = []
for j in range(-window_size, window_size + 1):
    if j != 0:  # Skip the target word itself
        context_idx = target_idx + j
        if 0 <= context_idx < len(sentence):
            context_words.append(sentence[context_idx])
 
print("CBOW training example for target word 'fox':")
print(f"  Context: {context_words}")
print(f"  Target: {target_word}")
print(f"  Training pair: ({context_words}, '{target_word}')")
 
# Output:
# CBOW training example for target word 'fox':
#   Context: ['quick', 'brown', 'jumps', 'over']
#   Target: fox
#   Training pair: (['quick', 'brown', 'jumps', 'over'], 'fox')
 
# The model learns: given surrounding words, predict the center
# Context words are averaged (bag-of-words) before prediction

The neural network structure:

CBOW also uses a two-layer network, but with a key difference in how context is processed:

Input layer: Multiple one-hot vectors (one per context word)
Projection layer: Average (or sum) of context word embeddings
Output layer: Probability distribution predicting the center word

The averaging step:

Given context words {w_{t-c}, ..., w_{t+c}} (excluding wₜ), CBOW computes:

$$\hat{v} = \frac{1}{2c} \sum_{-c \leq j \leq c, j \neq 0} v_{w_{t+j}}$$

This averaged vector then predicts the center word via softmax:

$$P(w_t | \text{context}) = \frac{\exp(v'{w_t}^\top \hat{v})}{\sum{w=1}^{V} \exp(v'_w^\top \hat{v})}$$

Converting Mermaid diagram...

The 'Bag' in Bag-of-Words

The averaging operation means CBOW treats context as an unordered bag—the words 'dog bites man' and 'man bites dog' produce identical context representations. This is a deliberate simplification that trades positional information for training efficiency. The model learns from word co-occurrence patterns, not sequential structure.

Skip-gram vs. CBOW: When to Use Each

The choice between Skip-gram and CBOW depends on your data characteristics, computational constraints, and downstream task requirements. Each architecture makes different tradeoffs.

Skip-gram Strengths

•Better for rare words — Each word gets equal update opportunity regardless of frequency
•Better for small datasets — More training examples per word occurrence (one center generates multiple pairs)
•Captures subtle relationships — Individual word-word predictions preserve fine-grained patterns
•Best for semantic tasks — Excels at analogies, similarity judgments

CBOW Strengths

•Faster training — Fewer gradient updates per window position (one prediction vs. 2c predictions)
•Better for frequent words — Averaging context smooths noise, benefits high-frequency terms
•Better for large datasets — Speed advantage compounds with more data
•Good for syntactic tasks — Averaging captures typical syntactic patterns

Empirical Comparison: Skip-gram vs. CBOW
Property	Skip-gram	CBOW
Training time (relative)	Slower (10x more updates)	Faster (1 update per window)
Rare word quality	Excellent	Poor to moderate
Frequent word quality	Good	Excellent
Semantic accuracy	Higher (analogies, similarities)	Moderate
Syntactic accuracy	Moderate	Higher (part-of-speech, morphology)
Recommended corpus size	Small to medium (< 1B words)	Large (> 1B words)
Context utilization	Each context word separately	All context words averaged together

The gradient update perspective:

The key difference lies in how training updates flow:

Skip-gram: For each center word occurrence, generates 2c separate predictions. Each context word gets a direct gradient signal. A rare word appearing in context still receives meaningful updates.
CBOW: For each center word occurrence, generates 1 prediction from averaged context. Rare context words get diluted updates (their contribution is averaged with frequent words).

This explains why Skip-gram better represents rare words—they receive undiluted gradient updates when they appear in either center or context position.

Practical Recommendation

For most NLP applications, Skip-gram with negative sampling (covered next) is the default choice. It produces high-quality embeddings for both frequent and rare words, and modern implementations are fast enough for large corpora. Use CBOW only when training speed is critical and your vocabulary has few rare words of importance.

Training Efficiency: Negative Sampling

The naive softmax formulation requires computing a sum over the entire vocabulary for every training example—clearly impractical for real-world vocabularies. Negative Sampling is the key optimization that made Word2Vec practical at scale.

The core idea:

Instead of predicting the probability of a context word among all vocabulary words, negative sampling reframes the task as binary classification: distinguish real (center, context) pairs from fake ones.

For each real pair (wI, wO), we generate k 'negative' examples by sampling random words that didn't appear in the context. The model learns to assign high probability to real pairs and low probability to fake pairs.

The negative sampling objective:

For a (center word, context word) pair, maximize:

$$\log \sigma(v'{w_O}^\top v{w_I}) + \sum_{i=1}^{k} \mathbb{E}{w_i \sim P_n(w)} [\log \sigma(-v'{w_i}^\top v_{w_I})]$$

where:

σ(x) = 1/(1 + e^{-x}) is the sigmoid function
k is the number of negative samples (typically 5-20)
Pₙ(w) is the noise distribution for sampling negative words

The noise distribution:

The original Word2Vec paper uses a modified unigram distribution:

$$P_n(w) = \frac{f(w)^{0.75}}{\sum_{w'} f(w')^{0.75}}$$

The 3/4 power (0.75) flattens the distribution, giving rare words more chance of being sampled as negatives. This prevents the model from simply learning to predict against very frequent words.

negative_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
from collections import Counter
 
def build_negative_sampling_distribution(word_counts, power=0.75):
    """
    Build the noise distribution for negative sampling.
    Uses the 3/4 power to smooth frequencies.
    """
    words = list(word_counts.keys())
    frequencies = np.array([word_counts[w] for w in words])
    
    # Apply power to flatten distribution
    adjusted_freqs = frequencies ** power
    
    # Normalize to probability distribution
    sampling_probs = adjusted_freqs / adjusted_freqs.sum()
    
    return words, sampling_probs
 
def sample_negatives(positive_word, words, sampling_probs, k=5):
    """
    Sample k negative words, excluding the positive context word.
    """
    negatives = []
    while len(negatives) < k:
        idx = np.random.choice(len(words), p=sampling_probs)
        if words[idx] != positive_word:
            negatives.append(words[idx])
    return negatives
 
# Example usage
word_counts = Counter({
    'the': 1000000, 'a': 800000, 'is': 600000,
    'cat': 5000, 'dog': 4500, 'elephant': 100,
    'algorithm': 2000, 'neural': 1500, 'network': 3000
})
 
words, probs = build_negative_sampling_distribution(word_counts)
 
print("Sampling distribution comparison:")
print(f"{'Word':<12} {'Count':<10} {'Raw P(w)':<12} {'Adjusted P(w)':<12}")
print("-" * 46)
for w in ['the', 'cat', 'elephant']:
    raw_prob = word_counts[w] / sum(word_counts.values())
    adj_prob = probs[words.index(w)]
    print(f"{w:<12} {word_counts[w]:<10} {raw_prob:<12.6f} {adj_prob:<12.6f}")
 
# Output shows how rare words get boosted probability

Effect of the 0.75 Power on Sampling Distribution
Word	Frequency	Unigram P(w)	Adjusted P(w)^0.75	Relative Change
'the'	1,000,000	41.7%	28.2%	↓ 32%
'cat'	5,000	0.21%	0.45%	↑ 114%
'elephant'	100	0.004%	0.028%	↑ 600%

Why Negative Sampling Works

Negative sampling can be understood as an approximation to the gradient of the original softmax objective. More specifically, it approximates the partition function using importance sampling. The method is mathematically motivated by Noise Contrastive Estimation (NCE), though Word2Vec's formulation is a simplified variant optimized for embedding quality rather than density estimation.

Training Efficiency: Hierarchical Softmax

Hierarchical Softmax is an alternative approximation to the full softmax that reduces the computational complexity from O(V) to O(log V). It organizes the vocabulary into a binary tree structure, where each leaf node represents a word.

The core idea:

Instead of computing probabilities for all V words, hierarchical softmax represents each word as a path from the root to a leaf in a binary tree. The probability of a word is the product of probabilities at each branch point along its path:

$$P(w | w_I) = \prod_{j=1}^{L(w)} \sigma\left([![ n(w,j+1) = \text{left-child}(n(w,j)) ]!] \cdot v'{n(w,j)}^\top v{w_I}\right)$$

where:

L(w) is the path length from root to word w
n(w,j) is the j-th node on the path to w
[[·]] is 1 if the condition is true, -1 otherwise

Building the Huffman tree:

Word2Vec uses a Huffman tree where frequent words have shorter paths. This is optimal because:

Frequent words (trained more often) require fewer sigmoid computations
Rare words (trained less often) have longer paths but this occurs infrequently
Expected path length is minimized: O(H(p)) ≈ O(log V) for typical word distributions

Comparison with negative sampling:

Property	Negative Sampling	Hierarchical Softmax
Complexity per example	O(k·d), k ≈ 5-20	O(log V · d)
Best for	Skip-gram, large corpora	CBOW, smaller vocabularies
Representation	Direct word vectors	Internal node vectors + word positions
Quality on rare words	Better (direct updates)	Slightly worse (path-based)
Implementation	Simple	More complex

Practical Choice

Negative sampling is almost always preferred in practice. It's simpler to implement, typically produces slightly better embeddings, and the constant factor (k ≈ 5-20) is often smaller than log V for reasonable vocabularies. Use hierarchical softmax only when you need a proper probability distribution over words (negative sampling doesn't normalize).

Subsampling Frequent Words

Extremely common words like 'the,' 'a,' 'is' present two problems for Word2Vec training:

Uninformative context: 'the' appears near virtually every word, providing little discriminative signal
Wasted computation: These words dominate training despite contributing little to embedding quality

The subsampling technique:

Word2Vec probabilistically discards frequent words during training. Each word w is kept with probability:

$$P(\text{keep } w) = \sqrt{\frac{t}{f(w)}} + \frac{t}{f(w)}$$

where:

f(w) is the frequency of word w in the corpus
t is a threshold (typically 10^{-5})

Words with frequency < t are always kept. Words with frequency >> t are heavily downsampled.

subsampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def compute_keep_probability(word_freq, threshold=1e-5):
    """
    Compute probability of keeping a word during training.
    Frequent words are heavily downsampled.
    """
    if word_freq < threshold:
        return 1.0  # Always keep rare words
    
    # Original Word2Vec formula
    prob = np.sqrt(threshold / word_freq) + threshold / word_freq
    return min(prob, 1.0)
 
# Example: Effect on common English words
word_frequencies = {
    'the': 0.07,      # ~7% of words
    'is': 0.01,       # ~1% of words
    'algorithm': 0.0001,  # 0.01%
    'eigenvalue': 0.000001  # 0.0001%
}
 
print("Subsampling effect (threshold = 1e-5):")
print(f"{'Word':<15} {'Frequency':<12} {'Keep Prob':<12} {'Effect'}")
print("-" * 55)
 
for word, freq in word_frequencies.items():
    prob = compute_keep_probability(freq)
    if prob < 0.05:
        effect = "Heavily downsampled"
    elif prob < 0.5:
        effect = "Moderately downsampled"
    elif prob < 1.0:
        effect = "Slightly downsampled"
    else:
        effect = "Always kept"
    print(f"{word:<15} {freq:<12.6f} {prob:<12.4f} {effect}")
 
# Output:
# Word            Frequency    Keep Prob    Effect
# -------------------------------------------------------
# the             0.070000     0.0139       Heavily downsampled
# is              0.010000     0.0041       Heavily downsampled
# algorithm       0.000100     0.4162       Moderately downsampled
# eigenvalue      0.000001     1.0000       Always kept

Why This Improves Quality

Subsampling has a dual benefit: (1) it speeds up training by processing fewer tokens, and (2) it widens effective context windows. When 'the' is removed from 'the quick brown fox,' the effective window now connects 'quick' more directly to words that were previously too far away. This strengthens meaningful word associations.

Implementation Details and Hyperparameters

Training high-quality Word2Vec embeddings requires careful attention to hyperparameters. Each parameter involves tradeoffs between quality, speed, and memory.

Word2Vec Hyperparameters and Guidelines
Parameter	Typical Range	Effect	Recommendation
Embedding dimension (d)	50-300	Higher = more expressive but slower; diminishing returns past ~300	100-200 for most tasks; 300 for demanding semantic tasks
Window size (c)	2-10	Larger = more syntactic; smaller = more semantic	5 for balanced; 2-3 for semantic; 8-10 for syntactic
Negative samples (k)	5-20	More = stabler gradients but slower; diminishing returns past ~15	5-10 for large data; 15-20 for small data
Subsampling threshold (t)	10^-5 to 10^-3	Smaller = more aggressive downsampling of frequent words	10^-5 for general use
Min word count	5-100	Words below threshold are excluded from vocabulary	5-10 for large corpora; higher for small
Learning rate (α)	0.01-0.05	Higher = faster but less stable; typically decays during training	0.025 initial, decaying to 0.0001
Training epochs	1-20	More = better but diminishing returns; large corpora need fewer	5 for large data; 10-20 for small

word2vec_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
 
class LossCallback(CallbackAny2Vec):
    """Track training progress."""
    def __init__(self):
        self.epoch = 0
    
    def on_epoch_end(self, model):
        self.epoch += 1
        print(f"Epoch {self.epoch} completed")
 
# Assume 'sentences' is an iterable of tokenized sentences
# sentences = [["the", "quick", "brown", "fox"], ["jumped", "over", "lazy", "dog"], ...]
 
# Skip-gram with negative sampling (recommended defaults)
sg_model = Word2Vec(
    sentences=sentences,
    vector_size=200,      # Embedding dimension
    window=5,             # Context window size
    min_count=5,          # Ignore words with < 5 occurrences
    sg=1,                 # 1 for Skip-gram, 0 for CBOW
    negative=10,          # Number of negative samples
    sample=1e-5,          # Subsampling threshold
    epochs=5,             # Training epochs
    workers=4,            # Parallel threads
    callbacks=[LossCallback()]
)
 
# CBOW with hierarchical softmax (alternative)
cbow_model = Word2Vec(
    sentences=sentences,
    vector_size=200,
    window=5,
    min_count=5,
    sg=0,                 # CBOW
    hs=1,                 # Hierarchical softmax
    negative=0,           # Disable negative sampling
    epochs=5,
    workers=4
)
 
# Access trained embeddings
king_vector = sg_model.wv['king']  # 200-dimensional numpy array
 
# Semantic similarity
similarity = sg_model.wv.similarity('king', 'queen')
print(f"Similarity(king, queen) = {similarity:.4f}")
 
# Analogy: king - man + woman ≈ ?
result = sg_model.wv.most_similar(
    positive=['king', 'woman'],
    negative=['man'],
    topn=5
)
print("king - man + woman ≈", result[0][0])  # Should be 'queen'
 
# Save and load models
sg_model.save("word2vec_skipgram.model")
loaded_model = Word2Vec.load("word2vec_skipgram.model")

Common Pitfalls

Vocabulary issues: Words not in the training vocabulary get no embedding. For out-of-vocabulary words, consider FastText (covered later) which handles subwords.

Overfitting on small data: With fewer than ~1 million tokens, consider using pre-trained embeddings and fine-tuning rather than training from scratch.

Non-determinism: Word2Vec uses random initialization and random negative sampling. Set seeds for reproducibility if needed.

Understanding the Learned Representations

What exactly does Word2Vec learn? The embedding space exhibits remarkable structure that emerges purely from co-occurrence statistics:

Linear substructures:

Word2Vec embeddings capture semantic and syntactic relationships as approximately linear transformations. The classic examples:

Gender: king - man + woman ≈ queen
Tense: walking - walk + swim ≈ swimming
Country-capital: Paris - France + Italy ≈ Rome
Comparative: bigger - big + small ≈ smaller

analogy_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
def analogy(model, a, b, c, topn=5):
    """
    Solve analogy: a is to b as c is to ?
    Returns the word whose vector is closest to (b - a + c)
    """
    try:
        # Compute the target vector
        # We want: a:b :: c:?
        # So: ? = b - a + c
        result = model.wv.most_similar(
            positive=[b, c],
            negative=[a],
            topn=topn
        )
        return result
    except KeyError as e:
        return f"Word not in vocabulary: {e}"
 
# Semantic analogies
print("Semantic Analogies:")
print(f"king:queen :: man:? → {analogy(model, 'king', 'queen', 'man')[0][0]}")
print(f"France:Paris :: Italy:? → {analogy(model, 'France', 'Paris', 'Italy')[0][0]}")
print(f"big:bigger :: small:? → {analogy(model, 'big', 'bigger', 'small')[0][0]}")
 
# Syntactic analogies  
print("
Syntactic Analogies:")
print(f"walk:walked :: run:? → {analogy(model, 'walk', 'walked', 'run')[0][0]}")
print(f"slow:slowly :: quick:? → {analogy(model, 'slow', 'slowly', 'quick')[0][0]}")
 
# Explore the neighborhood of a word
def explore_neighborhood(model, word, topn=10):
    """Show the nearest neighbors of a word in embedding space."""
    neighbors = model.wv.most_similar(word, topn=topn)
    print(f"
Nearest neighbors of '{word}':")
    for neighbor, similarity in neighbors:
        print(f"  {neighbor}: {similarity:.4f}")
 
explore_neighborhood(model, 'python')
# Expected output might include: java, ruby, perl, programming, ...

Why linear relationships emerge:

The linear structure isn't explicitly encoded—it emerges from the training objective. When Skip-gram maximizes log P(context | center) using dot products, it implicitly learns that:

Words in similar contexts should have similar vectors (cluster together)
Consistent context differences should produce consistent vector differences

If 'king' and 'queen' differ primarily in the gender of associated contexts (he/she, his/her), while 'man' and 'woman' share this same pattern, then the vectors will naturally organize such that (king - queen) ≈ (man - woman).

Limitations of Vector Arithmetic

While compelling, the analogy evaluations have caveats: (1) They work for some relationships better than others, (2) They're sensitive to how the evaluation set is constructed, (3) They measure a narrow aspect of semantic knowledge. Don't over-interpret analogy accuracy as a complete measure of embedding quality.

Summary: Word2Vec Foundations

Word2Vec fundamentally changed how we represent words in machine learning. Let's consolidate the key concepts:

Key Takeaways

•Distributed representations compress vocabulary into dense vectors where similar words are nearby and relationships are captured as linear offsets.
•Skip-gram predicts context from center words—better for rare words and semantic tasks, slower to train.
•CBOW predicts center from averaged context—faster to train, better for frequent words and syntactic patterns.
•Negative sampling makes training tractable by reformulating as binary classification against randomly sampled negative examples.
•Subsampling frequent words speeds training and improves quality by extending effective context windows.
•The resulting embeddings exhibit remarkable properties: similar words cluster together, and semantic relationships emerge as vector arithmetic.

What's next:

Now that we understand how Word2Vec produces individual word embeddings, the next pages explore how to use these embeddings for representing documents and sentences. We'll cover Average Word2Vec (simple averaging) and TF-IDF weighted Word2Vec (weighted averaging that respects word importance).

Page Complete

You now have a comprehensive understanding of Word2Vec's architecture, training procedures, and the remarkable properties of learned embeddings. These concepts form the foundation for all modern word embedding techniques and remain central to practical NLP applications.

1 / 6

Loading learning content...

Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

1 / 6

Word2Vec: Skip-gram and CBOW Architectures

The Revolution in Word Representation

king - man + woman ≈ queen

What You Will Master

The Problem with Traditional Representations

Traditional vs. Distributed Representations
Property	One-Hot / Bag-of-Words	Word2Vec Embeddings
Dimensionality	V (vocabulary size: 10K-1M+)	d (typically 100-300)
Vector Density	Sparse (one 1, rest 0s)	Dense (all dimensions meaningful)
Similarity Computation	All word pairs equally distant	Similar words → similar vectors
Semantic Relationships	Not captured	Captured through linear relationships
Memory Requirements	O(V) per word	O(d) per word, d << V
Generalization	No generalization between words	Transfers knowledge between similar words

The curse of dimensionality in word spaces:

Consider a vocabulary of 100,000 words represented as one-hot vectors. Each word occupies a unique axis in a 100,000-dimensional space. In this representation:

The words 'happy' and 'joyful' are orthogonal—their dot product is zero
The words 'happy' and 'refrigerator' are also orthogonal—equally distant
No amount of training data about 'happy' helps the model understand 'joyful'

The Distributional Hypothesis

The key insight of distributed representations:

Compression: 300 dimensions can represent 1 million+ words
Similarity structure: Related words cluster together in the space
Compositionality: Vector arithmetic captures semantic relationships
Generalization: What the model learns about one word transfers to similar words

The Skip-gram Architecture

The training objective:

Given a sequence of training words w₁, w₂, ..., wₜ, the Skip-gram model maximizes the average log probability:

$$\frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)$$

where:

T is the total number of words in the corpus
c is the context window size (typically 5-10)
P(w_{t+j} | wₜ) is the probability of seeing context word w_{t+j} given center word wₜ

skip_gram_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Intuition: Skip-gram training pairs
# For the sentence: "The quick brown fox jumps over the lazy dog"
# With center word "fox" and window size c=2
 
sentence = "The quick brown fox jumps over the lazy dog".split()
center_word = "fox"
center_idx = sentence.index(center_word)  # 3
window_size = 2
 
# Skip-gram generates training pairs: (center_word, context_word)
training_pairs = []
for j in range(-window_size, window_size + 1):
    if j != 0:  # Skip the center word itself
        context_idx = center_idx + j
        if 0 <= context_idx < len(sentence):
            context_word = sentence[context_idx]
            training_pairs.append((center_word, context_word))
 
print("Skip-gram training pairs for center word 'fox':")
for center, context in training_pairs:
    print(f"  ({center}, {context})")
 
# Output:
# Skip-gram training pairs for center word 'fox':
#   (fox, quick)
#   (fox, brown)
#   (fox, jumps)
#   (fox, over)
 
# The model learns: given "fox", predict likely neighbors
# This forces "fox" to encode information about what appears nearby

The neural network structure:

Skip-gram uses a simple two-layer neural network:

Input layer: One-hot vector of the center word (dimension V)
Hidden layer: Dense embedding (dimension d, typically 100-300)
Output layer: Probability distribution over vocabulary (dimension V)

Crucially, there are two weight matrices:

W (V × d): Input word embeddings — row wᵢ is the embedding when word i is the center
W' (d × V): Output word embeddings — column w'ⱼ is the embedding when word j is in context

After training, we typically use only W as our word embeddings, though averaging W and W' sometimes yields better results.

Converting Mermaid diagram...

The softmax probability:

The basic Skip-gram formulation uses a softmax to define the probability of observing a context word given a center word:

$$P(w_O | w_I) = \frac{\exp(v'{w_O}^\top v{w_I})}{\sum_{w=1}^{V} \exp(v'w^\top v{w_I})}$$

where:

v_{wI} is the input embedding of the center word (from W)
v'_{wO} is the output embedding of the context word (from W')
The dot product v'{wO}ᵀ v{wI} measures compatibility between center and context

The computational problem:

Why Two Embedding Matrices?

The Continuous Bag-of-Words (CBOW) Architecture

The training objective:

Given a sequence of training words, CBOW maximizes:

$$\frac{1}{T} \sum_{t=1}^{T} \log P(w_t | w_{t-c}, ..., w_{t-1}, w_{t+1}, ..., w_{t+c})$$

The model takes all context words within window size c and predicts the center word.

cbow_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Intuition: CBOW training examples
# For the sentence: "The quick brown fox jumps over the lazy dog"
# With target word "fox" and window size c=2
 
sentence = "The quick brown fox jumps over the lazy dog".split()
target_word = "fox"
target_idx = sentence.index(target_word)  # 3
window_size = 2
 
# CBOW generates: (context_words, target_word)
context_words = []
for j in range(-window_size, window_size + 1):
    if j != 0:  # Skip the target word itself
        context_idx = target_idx + j
        if 0 <= context_idx < len(sentence):
            context_words.append(sentence[context_idx])
 
print("CBOW training example for target word 'fox':")
print(f"  Context: {context_words}")
print(f"  Target: {target_word}")
print(f"  Training pair: ({context_words}, '{target_word}')")
 
# Output:
# CBOW training example for target word 'fox':
#   Context: ['quick', 'brown', 'jumps', 'over']
#   Target: fox
#   Training pair: (['quick', 'brown', 'jumps', 'over'], 'fox')
 
# The model learns: given surrounding words, predict the center
# Context words are averaged (bag-of-words) before prediction

The neural network structure:

CBOW also uses a two-layer network, but with a key difference in how context is processed:

Input layer: Multiple one-hot vectors (one per context word)
Projection layer: Average (or sum) of context word embeddings
Output layer: Probability distribution predicting the center word

The averaging step:

Given context words {w_{t-c}, ..., w_{t+c}} (excluding wₜ), CBOW computes:

$$\hat{v} = \frac{1}{2c} \sum_{-c \leq j \leq c, j \neq 0} v_{w_{t+j}}$$

This averaged vector then predicts the center word via softmax:

$$P(w_t | \text{context}) = \frac{\exp(v'{w_t}^\top \hat{v})}{\sum{w=1}^{V} \exp(v'_w^\top \hat{v})}$$

Converting Mermaid diagram...

The 'Bag' in Bag-of-Words

Skip-gram vs. CBOW: When to Use Each

The choice between Skip-gram and CBOW depends on your data characteristics, computational constraints, and downstream task requirements. Each architecture makes different tradeoffs.

Skip-gram Strengths

•Better for rare words — Each word gets equal update opportunity regardless of frequency
•Better for small datasets — More training examples per word occurrence (one center generates multiple pairs)
•Captures subtle relationships — Individual word-word predictions preserve fine-grained patterns
•Best for semantic tasks — Excels at analogies, similarity judgments

CBOW Strengths

•Faster training — Fewer gradient updates per window position (one prediction vs. 2c predictions)
•Better for frequent words — Averaging context smooths noise, benefits high-frequency terms
•Better for large datasets — Speed advantage compounds with more data
•Good for syntactic tasks — Averaging captures typical syntactic patterns

Empirical Comparison: Skip-gram vs. CBOW
Property	Skip-gram	CBOW
Training time (relative)	Slower (10x more updates)	Faster (1 update per window)
Rare word quality	Excellent	Poor to moderate
Frequent word quality	Good	Excellent
Semantic accuracy	Higher (analogies, similarities)	Moderate
Syntactic accuracy	Moderate	Higher (part-of-speech, morphology)
Recommended corpus size	Small to medium (< 1B words)	Large (> 1B words)
Context utilization	Each context word separately	All context words averaged together

The gradient update perspective:

The key difference lies in how training updates flow:

Skip-gram: For each center word occurrence, generates 2c separate predictions. Each context word gets a direct gradient signal. A rare word appearing in context still receives meaningful updates.
CBOW: For each center word occurrence, generates 1 prediction from averaged context. Rare context words get diluted updates (their contribution is averaged with frequent words).

This explains why Skip-gram better represents rare words—they receive undiluted gradient updates when they appear in either center or context position.

Practical Recommendation

Training Efficiency: Negative Sampling

The core idea:

The negative sampling objective:

For a (center word, context word) pair, maximize:

$$\log \sigma(v'{w_O}^\top v{w_I}) + \sum_{i=1}^{k} \mathbb{E}{w_i \sim P_n(w)} [\log \sigma(-v'{w_i}^\top v_{w_I})]$$

where:

σ(x) = 1/(1 + e^{-x}) is the sigmoid function
k is the number of negative samples (typically 5-20)
Pₙ(w) is the noise distribution for sampling negative words

The noise distribution:

The original Word2Vec paper uses a modified unigram distribution:

$$P_n(w) = \frac{f(w)^{0.75}}{\sum_{w'} f(w')^{0.75}}$$

The 3/4 power (0.75) flattens the distribution, giving rare words more chance of being sampled as negatives. This prevents the model from simply learning to predict against very frequent words.

negative_sampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
from collections import Counter
 
def build_negative_sampling_distribution(word_counts, power=0.75):
    """
    Build the noise distribution for negative sampling.
    Uses the 3/4 power to smooth frequencies.
    """
    words = list(word_counts.keys())
    frequencies = np.array([word_counts[w] for w in words])
    
    # Apply power to flatten distribution
    adjusted_freqs = frequencies ** power
    
    # Normalize to probability distribution
    sampling_probs = adjusted_freqs / adjusted_freqs.sum()
    
    return words, sampling_probs
 
def sample_negatives(positive_word, words, sampling_probs, k=5):
    """
    Sample k negative words, excluding the positive context word.
    """
    negatives = []
    while len(negatives) < k:
        idx = np.random.choice(len(words), p=sampling_probs)
        if words[idx] != positive_word:
            negatives.append(words[idx])
    return negatives
 
# Example usage
word_counts = Counter({
    'the': 1000000, 'a': 800000, 'is': 600000,
    'cat': 5000, 'dog': 4500, 'elephant': 100,
    'algorithm': 2000, 'neural': 1500, 'network': 3000
})
 
words, probs = build_negative_sampling_distribution(word_counts)
 
print("Sampling distribution comparison:")
print(f"{'Word':<12} {'Count':<10} {'Raw P(w)':<12} {'Adjusted P(w)':<12}")
print("-" * 46)
for w in ['the', 'cat', 'elephant']:
    raw_prob = word_counts[w] / sum(word_counts.values())
    adj_prob = probs[words.index(w)]
    print(f"{w:<12} {word_counts[w]:<10} {raw_prob:<12.6f} {adj_prob:<12.6f}")
 
# Output shows how rare words get boosted probability

Effect of the 0.75 Power on Sampling Distribution
Word	Frequency	Unigram P(w)	Adjusted P(w)^0.75	Relative Change
'the'	1,000,000	41.7%	28.2%	↓ 32%
'cat'	5,000	0.21%	0.45%	↑ 114%
'elephant'	100	0.004%	0.028%	↑ 600%

Why Negative Sampling Works

Training Efficiency: Hierarchical Softmax

The core idea:

$$P(w | w_I) = \prod_{j=1}^{L(w)} \sigma\left([![ n(w,j+1) = \text{left-child}(n(w,j)) ]!] \cdot v'{n(w,j)}^\top v{w_I}\right)$$

where:

L(w) is the path length from root to word w
n(w,j) is the j-th node on the path to w
[[·]] is 1 if the condition is true, -1 otherwise

Building the Huffman tree:

Word2Vec uses a Huffman tree where frequent words have shorter paths. This is optimal because:

Frequent words (trained more often) require fewer sigmoid computations
Rare words (trained less often) have longer paths but this occurs infrequently
Expected path length is minimized: O(H(p)) ≈ O(log V) for typical word distributions

Comparison with negative sampling:

Property	Negative Sampling	Hierarchical Softmax
Complexity per example	O(k·d), k ≈ 5-20	O(log V · d)
Best for	Skip-gram, large corpora	CBOW, smaller vocabularies
Representation	Direct word vectors	Internal node vectors + word positions
Quality on rare words	Better (direct updates)	Slightly worse (path-based)
Implementation	Simple	More complex

Practical Choice

Subsampling Frequent Words

Extremely common words like 'the,' 'a,' 'is' present two problems for Word2Vec training:

Uninformative context: 'the' appears near virtually every word, providing little discriminative signal
Wasted computation: These words dominate training despite contributing little to embedding quality

The subsampling technique:

Word2Vec probabilistically discards frequent words during training. Each word w is kept with probability:

$$P(\text{keep } w) = \sqrt{\frac{t}{f(w)}} + \frac{t}{f(w)}$$

where:

f(w) is the frequency of word w in the corpus
t is a threshold (typically 10^{-5})

Words with frequency < t are always kept. Words with frequency >> t are heavily downsampled.

subsampling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def compute_keep_probability(word_freq, threshold=1e-5):
    """
    Compute probability of keeping a word during training.
    Frequent words are heavily downsampled.
    """
    if word_freq < threshold:
        return 1.0  # Always keep rare words
    
    # Original Word2Vec formula
    prob = np.sqrt(threshold / word_freq) + threshold / word_freq
    return min(prob, 1.0)
 
# Example: Effect on common English words
word_frequencies = {
    'the': 0.07,      # ~7% of words
    'is': 0.01,       # ~1% of words
    'algorithm': 0.0001,  # 0.01%
    'eigenvalue': 0.000001  # 0.0001%
}
 
print("Subsampling effect (threshold = 1e-5):")
print(f"{'Word':<15} {'Frequency':<12} {'Keep Prob':<12} {'Effect'}")
print("-" * 55)
 
for word, freq in word_frequencies.items():
    prob = compute_keep_probability(freq)
    if prob < 0.05:
        effect = "Heavily downsampled"
    elif prob < 0.5:
        effect = "Moderately downsampled"
    elif prob < 1.0:
        effect = "Slightly downsampled"
    else:
        effect = "Always kept"
    print(f"{word:<15} {freq:<12.6f} {prob:<12.4f} {effect}")
 
# Output:
# Word            Frequency    Keep Prob    Effect
# -------------------------------------------------------
# the             0.070000     0.0139       Heavily downsampled
# is              0.010000     0.0041       Heavily downsampled
# algorithm       0.000100     0.4162       Moderately downsampled
# eigenvalue      0.000001     1.0000       Always kept

Why This Improves Quality

Implementation Details and Hyperparameters

Training high-quality Word2Vec embeddings requires careful attention to hyperparameters. Each parameter involves tradeoffs between quality, speed, and memory.

Word2Vec Hyperparameters and Guidelines
Parameter	Typical Range	Effect	Recommendation
Embedding dimension (d)	50-300	Higher = more expressive but slower; diminishing returns past ~300	100-200 for most tasks; 300 for demanding semantic tasks
Window size (c)	2-10	Larger = more syntactic; smaller = more semantic	5 for balanced; 2-3 for semantic; 8-10 for syntactic
Negative samples (k)	5-20	More = stabler gradients but slower; diminishing returns past ~15	5-10 for large data; 15-20 for small data
Subsampling threshold (t)	10^-5 to 10^-3	Smaller = more aggressive downsampling of frequent words	10^-5 for general use
Min word count	5-100	Words below threshold are excluded from vocabulary	5-10 for large corpora; higher for small
Learning rate (α)	0.01-0.05	Higher = faster but less stable; typically decays during training	0.025 initial, decaying to 0.0001
Training epochs	1-20	More = better but diminishing returns; large corpora need fewer	5 for large data; 10-20 for small

word2vec_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from gensim.models import Word2Vec
from gensim.models.callbacks import CallbackAny2Vec
 
class LossCallback(CallbackAny2Vec):
    """Track training progress."""
    def __init__(self):
        self.epoch = 0
    
    def on_epoch_end(self, model):
        self.epoch += 1
        print(f"Epoch {self.epoch} completed")
 
# Assume 'sentences' is an iterable of tokenized sentences
# sentences = [["the", "quick", "brown", "fox"], ["jumped", "over", "lazy", "dog"], ...]
 
# Skip-gram with negative sampling (recommended defaults)
sg_model = Word2Vec(
    sentences=sentences,
    vector_size=200,      # Embedding dimension
    window=5,             # Context window size
    min_count=5,          # Ignore words with < 5 occurrences
    sg=1,                 # 1 for Skip-gram, 0 for CBOW
    negative=10,          # Number of negative samples
    sample=1e-5,          # Subsampling threshold
    epochs=5,             # Training epochs
    workers=4,            # Parallel threads
    callbacks=[LossCallback()]
)
 
# CBOW with hierarchical softmax (alternative)
cbow_model = Word2Vec(
    sentences=sentences,
    vector_size=200,
    window=5,
    min_count=5,
    sg=0,                 # CBOW
    hs=1,                 # Hierarchical softmax
    negative=0,           # Disable negative sampling
    epochs=5,
    workers=4
)
 
# Access trained embeddings
king_vector = sg_model.wv['king']  # 200-dimensional numpy array
 
# Semantic similarity
similarity = sg_model.wv.similarity('king', 'queen')
print(f"Similarity(king, queen) = {similarity:.4f}")
 
# Analogy: king - man + woman ≈ ?
result = sg_model.wv.most_similar(
    positive=['king', 'woman'],
    negative=['man'],
    topn=5
)
print("king - man + woman ≈", result[0][0])  # Should be 'queen'
 
# Save and load models
sg_model.save("word2vec_skipgram.model")
loaded_model = Word2Vec.load("word2vec_skipgram.model")

Common Pitfalls

Vocabulary issues: Words not in the training vocabulary get no embedding. For out-of-vocabulary words, consider FastText (covered later) which handles subwords.

Overfitting on small data: With fewer than ~1 million tokens, consider using pre-trained embeddings and fine-tuning rather than training from scratch.

Non-determinism: Word2Vec uses random initialization and random negative sampling. Set seeds for reproducibility if needed.

Understanding the Learned Representations

What exactly does Word2Vec learn? The embedding space exhibits remarkable structure that emerges purely from co-occurrence statistics:

Linear substructures:

Word2Vec embeddings capture semantic and syntactic relationships as approximately linear transformations. The classic examples:

Gender: king - man + woman ≈ queen
Tense: walking - walk + swim ≈ swimming
Country-capital: Paris - France + Italy ≈ Rome
Comparative: bigger - big + small ≈ smaller

analogy_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
 
def analogy(model, a, b, c, topn=5):
    """
    Solve analogy: a is to b as c is to ?
    Returns the word whose vector is closest to (b - a + c)
    """
    try:
        # Compute the target vector
        # We want: a:b :: c:?
        # So: ? = b - a + c
        result = model.wv.most_similar(
            positive=[b, c],
            negative=[a],
            topn=topn
        )
        return result
    except KeyError as e:
        return f"Word not in vocabulary: {e}"
 
# Semantic analogies
print("Semantic Analogies:")
print(f"king:queen :: man:? → {analogy(model, 'king', 'queen', 'man')[0][0]}")
print(f"France:Paris :: Italy:? → {analogy(model, 'France', 'Paris', 'Italy')[0][0]}")
print(f"big:bigger :: small:? → {analogy(model, 'big', 'bigger', 'small')[0][0]}")
 
# Syntactic analogies  
print("
Syntactic Analogies:")
print(f"walk:walked :: run:? → {analogy(model, 'walk', 'walked', 'run')[0][0]}")
print(f"slow:slowly :: quick:? → {analogy(model, 'slow', 'slowly', 'quick')[0][0]}")
 
# Explore the neighborhood of a word
def explore_neighborhood(model, word, topn=10):
    """Show the nearest neighbors of a word in embedding space."""
    neighbors = model.wv.most_similar(word, topn=topn)
    print(f"
Nearest neighbors of '{word}':")
    for neighbor, similarity in neighbors:
        print(f"  {neighbor}: {similarity:.4f}")
 
explore_neighborhood(model, 'python')
# Expected output might include: java, ruby, perl, programming, ...

Why linear relationships emerge:

The linear structure isn't explicitly encoded—it emerges from the training objective. When Skip-gram maximizes log P(context | center) using dot products, it implicitly learns that:

Words in similar contexts should have similar vectors (cluster together)
Consistent context differences should produce consistent vector differences

Limitations of Vector Arithmetic

Summary: Word2Vec Foundations

Word2Vec fundamentally changed how we represent words in machine learning. Let's consolidate the key concepts:

Key Takeaways

•Distributed representations compress vocabulary into dense vectors where similar words are nearby and relationships are captured as linear offsets.
•Skip-gram predicts context from center words—better for rare words and semantic tasks, slower to train.
•CBOW predicts center from averaged context—faster to train, better for frequent words and syntactic patterns.
•Negative sampling makes training tractable by reformulating as binary classification against randomly sampled negative examples.
•Subsampling frequent words speeds training and improves quality by extending effective context windows.
•The resulting embeddings exhibit remarkable properties: similar words cluster together, and semantic relationships emerge as vector arithmetic.

What's next:

Page Complete

1 / 6