Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

4 / 6

GloVe: Global Vectors for Word Representation

A Different Path to the Same Destination

Word2Vec learns word embeddings by predicting context words from center words (or vice versa)—a local, window-based approach. But there's a fundamentally different way to think about word relationships: through the global statistics of word co-occurrence.

GloVe (Global Vectors for Word Representation), developed by Pennington, Socher, and Manning at Stanford in 2014, takes this alternative path. Instead of sliding windows through text, GloVe first constructs a global co-occurrence matrix from the entire corpus, then learns embeddings that best explain these co-occurrence patterns.

Remarkably, both approaches produce embeddings with similar properties—words cluster by meaning, and semantic relationships emerge as vector arithmetic. But GloVe's global perspective offers different tradeoffs and insights.

What You Will Master

By the end of this page, you will understand GloVe's theoretical foundation in log-bilinear regression on co-occurrence counts, the weighted least squares objective and its rationale, how GloVe unifies count-based and prediction-based methods, practical considerations for training and using GloVe embeddings, and when to choose GloVe over Word2Vec (and vice versa).

The Co-occurrence Matrix Foundation

GloVe's foundation is the word co-occurrence matrix X, where X_{ij} counts how often word j appears in the context of word i. This matrix captures global corpus statistics.

Building the co-occurrence matrix:

For each word i in the vocabulary, we scan a context window around all its occurrences and increment X_{ij} for each context word j. Weights can be distance-dependent:

$$X_{ij} = \sum_{\text{occurrences of i}} \sum_{k=-\text{window}}^{\text{window}, k \neq 0} \frac{1}{|k|} \cdot \mathbf{1}[\text{word at position } k \text{ is } j]$$

The 1/|k| weighting gives closer words higher weight, similar to Word2Vec's implicit weighting.

build_cooccurrence_matrix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from collections import Counter, defaultdict
from scipy.sparse import lil_matrix, csr_matrix
from typing import List, Dict, Tuple
 
def build_cooccurrence_matrix(
    corpus: List[List[str]],
    vocab: Dict[str, int],
    window_size: int = 10,
    distance_weighting: bool = True
) -> csr_matrix:
    """
    Build word co-occurrence matrix from corpus.
    
    Args:
        corpus: List of tokenized sentences/documents
        vocab: Word to index mapping
        window_size: Context window size on each side
        distance_weighting: Weight by 1/distance (like GloVe)
    
    Returns:
        Sparse co-occurrence matrix of shape (vocab_size, vocab_size)
    """
    vocab_size = len(vocab)
    
    # Use lil_matrix for efficient incremental updates
    cooc = lil_matrix((vocab_size, vocab_size), dtype=np.float64)
    
    for sentence in corpus:
        # Filter to words in vocabulary
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            # Look at context windows
            for offset in range(-window_size, window_size + 1):
                if offset == 0:
                    continue
                
                context_pos = center_pos + offset
                if 0 <= context_pos < len(indices):
                    context_idx = indices[context_pos]
                    
                    # Weight by distance
                    if distance_weighting:
                        weight = 1.0 / abs(offset)
                    else:
                        weight = 1.0
                    
                    cooc[center_idx, context_idx] += weight
    
    return cooc.tocsr()
 
 
def analyze_cooccurrence_matrix(cooc: csr_matrix, vocab: Dict[str, int]):
    """Analyze properties of the co-occurrence matrix."""
    # Reverse vocabulary for lookup
    idx_to_word = {v: k for k, v in vocab.items()}
    
    # Basic statistics
    nnz = cooc.nnz
    density = nnz / (cooc.shape[0] ** 2)
    
    print(f"Co-occurrence matrix statistics:")
    print(f"  Shape: {cooc.shape}")
    print(f"  Non-zero entries: {nnz:,}")
    print(f"  Density: {density:.4%}")
    print(f"  Total co-occurrences: {cooc.sum():,.0f}")
    
    # Find most frequent co-occurrences
    cooc_dense = cooc.toarray()
    top_indices = np.unravel_index(
        np.argsort(cooc_dense.ravel())[-10:],
        cooc_dense.shape
    )
    
    print(f"
Top co-occurrences:")
    for i, j in zip(top_indices[0][::-1], top_indices[1][::-1]):
        w1, w2 = idx_to_word.get(i, '?'), idx_to_word.get(j, '?')
        count = cooc_dense[i, j]
        print(f"  ({w1}, {w2}): {count:.2f}")
 
 
# Example usage
corpus = [
    "the quick brown fox jumps over the lazy dog".split(),
    "machine learning algorithms process large datasets".split(),
    "the neural network learns patterns from data".split(),
]
 
# Build vocabulary (normally from full corpus)
all_words = [w for sent in corpus for w in sent]
word_counts = Counter(all_words)
vocab = {word: idx for idx, (word, _) in enumerate(word_counts.most_common())}
 
# Build matrix
cooc = build_cooccurrence_matrix(corpus, vocab, window_size=5)
analyze_cooccurrence_matrix(cooc, vocab)

Matrix Properties

The co-occurrence matrix X is typically: (1) Symmetric if we don't distinguish direction (X_{ij} = X_{ji}), (2) Sparse — most word pairs never co-occur, (3) Heavy-tailed — a few pairs have very high counts (the, a), most pairs have zero or small counts. GloVe's weighting function addresses this imbalance.

The Key Insight: Ratio of Probabilities

The brilliant insight behind GloVe is that ratios of co-occurrence probabilities encode semantic meaning more clearly than raw probabilities.

Define P(k|i) as the probability that word k appears in the context of word i:

$$P(k|i) = \frac{X_{ik}}{X_i}$$

where X_i = Σ_k X_{ik} is the total co-occurrence for word i.

Consider the words "ice" and "steam":

Now compare how these relate to different context words:

Co-occurrence Probability Ratios (from GloVe paper)
Probability	k = solid	k = gas	k = water	k = fashion
P(k\|ice)	1.9 × 10⁻⁴	6.6 × 10⁻⁵	3.0 × 10⁻³	1.7 × 10⁻⁵
P(k\|steam)	2.2 × 10⁻⁵	7.8 × 10⁻⁴	2.2 × 10⁻³	1.8 × 10⁻⁵
P(k\|ice) / P(k\|steam)	8.9 (large)	8.5 × 10⁻² (small)	1.36 (~1)	0.96 (~1)

Interpreting the ratio:

"solid": Ratio >> 1 means "solid" is much more associated with "ice" than "steam" → relates to differentiating property of ice
"gas": Ratio << 1 means "gas" is much more associated with "steam" than "ice" → relates to differentiating property of steam
"water": Ratio ≈ 1 means both ice and steam relate similarly to "water" → shared property
"fashion": Ratio ≈ 1 means neither word relates to "fashion" → irrelevant context

The GloVe hypothesis:

If word embeddings capture meaning, then the ratio P(k|i)/P(k|j) should be expressible as some function of the word vectors:

$$F(w_i, w_j, \tilde{w}_k) = \frac{P(k|i)}{P(k|j)}$$

GloVe derives that this function should be:

$$F = \exp((w_i - w_j)^T \tilde{w}_k)$$

Why Ratios Matter

Raw co-occurrence counts are dominated by frequent words ('the', 'of', 'to'). Ratios normalize away these base rates, revealing the differential relationships that actually distinguish word meanings. This is similar to how TF-IDF normalized by IDF reveals informative terms.

The GloVe Objective Function

Working through the mathematical derivation leads to GloVe's objective function—a weighted least squares regression on log co-occurrences:

$$J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij} \right)^2$$

where:

w_i is the word vector for word i (the one we ultimately use)
w̃_j is the context vector for word j (a separate set of parameters)
b_i and b̃_j are bias terms
f(X_{ij}) is a weighting function
The sum is over all non-zero entries of X

Key components:

1. The log co-occurrence target (log X_{ij}):

We predict log counts, not raw counts. This is crucial because:

Raw counts are heavy-tailed (ranging from 0 to millions)
Log compresses the range, making optimization stable
The derivation from probability ratios naturally yields logs

2. The bilinear form (w_i^T w̃_j):

The model predicts log co-occurrence as a dot product of word and context vectors. This captures the core relationship between words.

3. Bias terms (b_i + b̃_j):

Biases absorb word-specific effects like overall frequency. Without biases, frequent words would need artificially large vectors.

glove_weighting.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
 
def glove_weighting_function(x: np.ndarray, x_max: float = 100, alpha: float = 0.75) -> np.ndarray:
    """
    GloVe weighting function f(x).
    
    Properties:
    - f(0) = 0: Zero counts contribute nothing
    - f(x) is non-decreasing: More counts = more weight
    - f(x) bounded: Very frequent pairs don't dominate
    
    Args:
        x: Co-occurrence counts
        x_max: Saturation threshold
        alpha: Power parameter (paper uses 3/4)
    
    Returns:
        Weights in [0, 1]
    """
    # f(x) = (x/x_max)^alpha if x < x_max, else 1
    return np.where(x < x_max, (x / x_max) ** alpha, 1.0)
 
 
# Visualize the weighting function
x = np.linspace(0, 150, 300)
y = glove_weighting_function(x)
 
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, y, 'b-', linewidth=2)
plt.axhline(y=1, color='r', linestyle='--', alpha=0.5, label='Maximum weight')
plt.axvline(x=100, color='g', linestyle='--', alpha=0.5, label='x_max')
plt.xlabel('Co-occurrence count X_ij')
plt.ylabel('Weight f(X_ij)')
plt.title('GloVe Weighting Function')
plt.legend()
plt.grid(True, alpha=0.3)
 
# Compare different alpha values
plt.subplot(1, 2, 2)
for alpha in [0.5, 0.75, 1.0]:
    y = glove_weighting_function(x, alpha=alpha)
    plt.plot(x, y, label=f'α = {alpha}')
plt.xlabel('Co-occurrence count X_ij')
plt.ylabel('Weight f(X_ij)')
plt.title('Effect of α parameter')
plt.legend()
plt.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('glove_weighting.png', dpi=150)
 
 
# Why this weighting matters
print("Weighting function values at various counts:")
for count in [1, 5, 10, 50, 100, 500, 10000]:
    weight = glove_weighting_function(np.array([count]))[0]
    print(f"  f({count}) = {weight:.4f}")
 
# Output:
# f(1) = 0.0178
# f(5) = 0.0562
# f(10) = 0.0944
# f(50) = 0.2973
# f(100) = 1.0000
# f(500) = 1.0000
# f(10000) = 1.0000

Why Weighting is Essential

Without the weighting function, extremely frequent pairs ('the'-'of') would dominate the loss, forcing the model to fit these perfectly at the expense of rare but informative pairs. The weighting function caps the influence of frequent pairs while still giving some weight to moderately frequent ones.

Training GloVe Embeddings

Training GloVe differs significantly from Word2Vec. Instead of stochastic updates from text windows, GloVe performs batch optimization on the co-occurrence matrix.

Training procedure:

Build co-occurrence matrix X from corpus (one-time preprocessing)
Initialize word vectors w_i, context vectors w̃_j, and biases randomly
Optimize the weighted least squares objective using AdaGrad or similar
Combine final embeddings as w_i + w̃_i (sum of word and context vectors)

glove_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from scipy.sparse import csr_matrix
from typing import Tuple, Optional
 
class GloVeTrainer:
                  """
    GloVe training implementation.
    """
    
    def __init__(
        self,
        embedding_dim: int = 100,
        x_max: float = 100.0,
        alpha: float = 0.75,
        learning_rate: float = 0.05,
        epochs: int = 50
    ):
        self.embedding_dim = embedding_dim
        self.x_max = x_max
        self.alpha = alpha
        self.learning_rate = learning_rate
        self.epochs = epochs
        
        self.W = None  # Word embeddings
        self.W_tilde = None  # Context embeddings
        self.b = None  # Word bias
        self.b_tilde = None  # Context bias
    
    def _weighting_function(self, x: np.ndarray) -> np.ndarray:
        """GloVe weighting function."""
        return np.where(x < self.x_max, (x / self.x_max) ** self.alpha, 1.0)
    
    def _initialize_parameters(self, vocab_size: int):
        """Random initialization of parameters."""
        # Initialize uniformly in [-0.5/dim, 0.5/dim]
        bound = 0.5 / self.embedding_dim
        
        self.W = np.random.uniform(-bound, bound, (vocab_size, self.embedding_dim))
        self.W_tilde = np.random.uniform(-bound, bound, (vocab_size, self.embedding_dim))
        self.b = np.zeros(vocab_size)
        self.b_tilde = np.zeros(vocab_size)
        
        # AdaGrad accumulators
        self.grad_W = np.ones_like(self.W)
        self.grad_W_tilde = np.ones_like(self.W_tilde)
        self.grad_b = np.ones_like(self.b)
        self.grad_b_tilde = np.ones_like(self.b_tilde)
    
    def fit(self, cooc_matrix: csr_matrix) -> Tuple[np.ndarray, np.ndarray]:
        """
        Train GloVe embeddings on co-occurrence matrix.
        
        Args:
            cooc_matrix: Sparse co-occurrence matrix
        
        Returns:
            Tuple of (word_embeddings, context_embeddings)
        """
        vocab_size = cooc_matrix.shape[0]
        self._initialize_parameters(vocab_size)
        
        # Get non-zero entries
        rows, cols = cooc_matrix.nonzero()
        data = np.array(cooc_matrix[rows, cols]).flatten()
        
        # Pre-compute weights and log targets
        weights = self._weighting_function(data)
        log_cooc = np.log(data)
        
        n_samples = len(rows)
        
        for epoch in range(self.epochs):
            # Shuffle training samples
            perm = np.random.permutation(n_samples)
            total_loss = 0.0
            
            for idx in perm:
                i, j = rows[idx], cols[idx]
                weight = weights[idx]
                log_target = log_cooc[idx]
                
                # Compute prediction: w_i^T w̃_j + b_i + b̃_j
                prediction = np.dot(self.W[i], self.W_tilde[j]) + self.b[i] + self.b_tilde[j]
                
                # Compute error
                diff = prediction - log_target
                loss = weight * diff ** 2
                total_loss += loss
                
                # Compute gradients
                grad_common = weight * diff
                grad_w = grad_common * self.W_tilde[j]
                grad_w_tilde = grad_common * self.W[i]
                grad_b = grad_common
                grad_b_tilde = grad_common
                
                # AdaGrad updates
                self.grad_W[i] += grad_w ** 2
                self.grad_W_tilde[j] += grad_w_tilde ** 2
                self.grad_b[i] += grad_b ** 2
                self.grad_b_tilde[j] += grad_b_tilde ** 2
                
                # Update parameters
                self.W[i] -= self.learning_rate * grad_w / np.sqrt(self.grad_W[i])
                self.W_tilde[j] -= self.learning_rate * grad_w_tilde / np.sqrt(self.grad_W_tilde[j])
                self.b[i] -= self.learning_rate * grad_b / np.sqrt(self.grad_b[i])
                self.b_tilde[j] -= self.learning_rate * grad_b_tilde / np.sqrt(self.grad_b_tilde[j])
            
            avg_loss = total_loss / n_samples
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {avg_loss:.6f}")
        
        return self.W, self.W_tilde
    
    def get_embeddings(self, combine: bool = True) -> np.ndarray:
        """
        Get final word embeddings.
        
        Args:
            combine: If True, return W + W_tilde (recommended by GloVe paper)
        """
        if combine:
            return self.W + self.W_tilde
        return self.W
 
 
# Example usage (with small vocabulary for demonstration)
# In practice, use the official GloVe C implementation for large corpora
 
trainer = GloVeTrainer(
    embedding_dim=50,
    learning_rate=0.05,
    epochs=100
)
 
# Train on co-occurrence matrix
W, W_tilde = trainer.fit(cooccurrence_matrix)
 
# Get final embeddings
embeddings = trainer.get_embeddings(combine=True)
print(f"Learned embeddings shape: {embeddings.shape}")

Practical Training

For production use, the official GloVe C implementation is highly optimized. The Python implementation above is for understanding; it's 100x slower than the C version. Pre-trained GloVe vectors (Wikipedia, Common Crawl) are freely available and excellent for most applications.

GloVe vs. Word2Vec: A Deep Comparison

GloVe and Word2Vec represent different philosophical approaches to learning word representations. Understanding their differences informs practical choices.

Fundamental Comparison
Aspect	Word2Vec (Skip-gram)	GloVe
Approach	Prediction-based (local)	Count-based (global)
Training data	Stream of (word, context) pairs	Co-occurrence matrix
Objective	Maximize log probability of context	Minimize weighted MSE on log counts
Optimization	SGD on streaming pairs	SGD on matrix entries
Memory	Low (stream processing)	High (store full matrix)
Training time	Scales with corpus size	Scales with vocabulary × non-zero entries
Parallelization	Data parallelism (training pairs)	Easy parallelism (matrix rows independent)

GloVe Strengths

•Explicit use of global statistics — captures corpus-wide patterns in one pass
•Better on word analogy tasks — the ratio formulation directly models analogies
•Faster training for fixed vocabulary — matrix operations, not streaming
•Interpretable objective — weighted least squares on observable quantities
•Pre-computed matrix enables experimentation with different objectives

Word2Vec Strengths

•Lower memory requirement — streams through corpus, no matrix storage
•Simpler to train incrementally — can add new documents without rebuilding matrix
•Better for rare words — each occurrence updates directly
•Subsampling built-in — automatic handling of frequent words
•Easier to fine-tune — continue training on domain-specific data

In Practice, They're Similar

Despite the theoretical differences, empirical studies show GloVe and Word2Vec produce embeddings of comparable quality on most downstream tasks. The choice often comes down to practical factors: memory constraints favor Word2Vec; available pre-trained vectors may favor GloVe; need for incremental updates favors Word2Vec.

Using Pre-trained GloVe Embeddings

Stanford provides pre-trained GloVe vectors on several corpora. These are the most common starting points:

Available Pre-trained Vectors:

Name	Corpus	Vocabulary	Dimensions	Size
glove.6B	Wikipedia + Gigaword	400K	50, 100, 200, 300	66MB-1GB
glove.42B	Common Crawl 42B tokens	1.9M	300	5.3GB
glove.840B	Common Crawl 840B tokens	2.2M	300	5.6GB
glove.twitter.27B	Twitter	1.2M	25, 50, 100, 200	1.5GB

load_pretrained_glove.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from typing import Dict, Tuple
import gensim.downloader as api
 
# Option 1: Load via gensim (easiest)
def load_glove_via_gensim(name: str = 'glove-wiki-gigaword-100'):
    """
    Load GloVe vectors via gensim's downloader.
    
    Available models:
    - glove-wiki-gigaword-50
    - glove-wiki-gigaword-100
    - glove-wiki-gigaword-200
    - glove-wiki-gigaword-300
    - glove-twitter-25
    - glove-twitter-50
    - glove-twitter-100
    - glove-twitter-200
    """
    print(f"Loading {name}...")
    model = api.load(name)
    print(f"Loaded {len(model)} vectors of dimension {model.vector_size}")
    return model
 
 
# Option 2: Load from file directly (for raw GloVe files)
def load_glove_from_file(filepath: str) -> Tuple[Dict[str, np.ndarray], int]:
    """
    Load GloVe vectors from raw text file.
    
    Format: word dim1 dim2 dim3 ... dimN
    """
    word_vectors = {}
    embedding_dim = None
    
    print(f"Loading GloVe from {filepath}...")
    with open(filepath, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f):
            values = line.rstrip().split(' ')
            word = values[0]
            
            try:
                vector = np.array([float(x) for x in values[1:]])
                
                if embedding_dim is None:
                    embedding_dim = len(vector)
                
                word_vectors[word] = vector
                
            except ValueError:
                print(f"Skipping malformed line {line_num}: {word}")
                continue
            
            if (line_num + 1) % 100000 == 0:
                print(f"  Loaded {line_num + 1:,} vectors...")
    
    print(f"Loaded {len(word_vectors):,} vectors of dimension {embedding_dim}")
    return word_vectors, embedding_dim
 
 
# Option 3: Convert to gensim format for compatibility
def convert_glove_to_gensim_format(glove_file: str, output_file: str):
    """
    Convert raw GloVe file to gensim key-vectors format.
    Adds header line required by gensim.
    """
    from gensim.models import KeyedVectors
    from gensim.scripts.glove2word2vec import glove2word2vec
    
    # Convert format
    temp_file = output_file + '.tmp'
    glove2word2vec(glove_file, temp_file)
    
    # Load and save in native format
    model = KeyedVectors.load_word2vec_format(temp_file, binary=False)
    model.save(output_file)
    
    import os
    os.remove(temp_file)
    
    return model
 
 
# Usage example
# Method 1: Via gensim downloader
glove = load_glove_via_gensim('glove-wiki-gigaword-100')
 
# Test the embeddings
print("
Word similarity:")
print(f"similarity('king', 'queen') = {glove.similarity('king', 'queen'):.4f}")
print(f"similarity('cat', 'dog') = {glove.similarity('cat', 'dog'):.4f}")
print(f"similarity('cat', 'car') = {glove.similarity('cat', 'car'):.4f}")
 
print("
Word analogies:")
result = glove.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)
print(f"king - man + woman = {result}")
 
# Nearest neighbors
print("
Nearest neighbors of 'python':")
for word, sim in glove.most_similar('python', topn=5):
    print(f"  {word}: {sim:.4f}")

Choosing Pre-trained Vectors

For general NLP: Use glove.6B (300d) or glove.840B — trained on clean, diverse text. For social media: Use glove.twitter — includes informal language, slang, hashtags. For limited compute: Use 100d or 200d versions — often 90-95% as effective as 300d at half the size.

Practical Considerations and Best Practices

When using GloVe embeddings in applications, several practical factors affect performance:

Best Practices for GloVe

•Lowercasing: GloVe models often have separate vectors for 'The' and 'the'. For most tasks, lowercase your input to match the vocabulary better.
•Normalization: Consider L2-normalizing vectors for similarity tasks. This makes cosine similarity equal to dot product.
•Missing words: Have a strategy for OOV words—zero vector, random vector, or subword fallback (though GloVe doesn't support subwords natively).
•Memory mapping: For large models, use memory-mapped loading to reduce RAM usage at the cost of slightly slower access.
•Domain mismatch: Pre-trained vectors from Wikipedia may not cover domain-specific terms. Consider training custom vectors or using domain-adapted alternatives.

glove_best_practices.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Optional
 
class GloVeEmbedder:
    """
    Production-ready GloVe embedding wrapper with best practices.
    """
    
    def __init__(
        self,
        word_vectors,
        normalize: bool = True,
        lowercase: bool = True,
        oov_strategy: str = 'zero'  # 'zero', 'random', 'mean'
    ):
        self.word_vectors = word_vectors
        self.normalize = normalize
        self.lowercase = lowercase
        self.oov_strategy = oov_strategy
        self.dim = word_vectors.vector_size
        
        # Precompute mean vector for 'mean' OOV strategy
        if oov_strategy == 'mean':
            sample_size = min(10000, len(word_vectors))
            sample_words = list(word_vectors.key_to_index.keys())[:sample_size]
            sample_vecs = [word_vectors[w] for w in sample_words]
            self.mean_vector = np.mean(sample_vecs, axis=0)
        else:
            self.mean_vector = None
        
        # Precompute normalized vectors if needed
        if normalize:
            self._normalized_cache = {}
    
    def get_word_vector(self, word: str) -> np.ndarray:
        """Get embedding for a single word with OOV handling."""
        if self.lowercase:
            word = word.lower()
        
        if word in self.word_vectors:
            vec = self.word_vectors[word]
            if self.normalize:
                if word not in self._normalized_cache:
                    norm = np.linalg.norm(vec)
                    self._normalized_cache[word] = vec / norm if norm > 0 else vec
                return self._normalized_cache[word]
            return vec
        
        # OOV handling
        if self.oov_strategy == 'zero':
            return np.zeros(self.dim)
        elif self.oov_strategy == 'random':
            # Small random vector
            return np.random.randn(self.dim) * 0.01
        elif self.oov_strategy == 'mean':
            return self.mean_vector
        else:
            raise ValueError(f"Unknown OOV strategy: {self.oov_strategy}")
    
    def get_document_vector(
        self,
        words: list,
        weights: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Get average embedding for a document with optional weighting."""
        vectors = []
        actual_weights = []
        
        for i, word in enumerate(words):
            vec = self.get_word_vector(word)
            if vec is not None and not np.allclose(vec, 0):  # Skip zero vectors
                vectors.append(vec)
                if weights is not None:
                    actual_weights.append(weights[i])
        
        if not vectors:
            return np.zeros(self.dim)
        
        vectors = np.array(vectors)
        
        if weights is not None and actual_weights:
            actual_weights = np.array(actual_weights)
            avg = np.average(vectors, weights=actual_weights, axis=0)
        else:
            avg = np.mean(vectors, axis=0)
        
        if self.normalize:
            norm = np.linalg.norm(avg)
            if norm > 0:
                avg = avg / norm
        
        return avg
    
    def get_coverage(self, words: list) -> float:
        """Compute vocabulary coverage for a list of words."""
        covered = sum(1 for w in words 
                     if (w.lower() if self.lowercase else w) in self.word_vectors)
        return covered / len(words) if words else 0.0
 
 
# Example usage
embedder = GloVeEmbedder(
    glove,
    normalize=True,
    lowercase=True,
    oov_strategy='zero'
)
 
# Single word
king_vec = embedder.get_word_vector('King')  # Will be lowercased
print(f"'King' vector norm: {np.linalg.norm(king_vec):.4f}")  # Should be 1.0
 
# Document
doc = "The neural network learns patterns from data".split()
doc_vec = embedder.get_document_vector(doc)
print(f"Document vector norm: {np.linalg.norm(doc_vec):.4f}")
 
# Coverage check
technical_doc = "BERT transformer attention mechanism self-supervised".split()
coverage = embedder.get_coverage(technical_doc)
print(f"Vocabulary coverage: {coverage:.1%}")

Advanced Topics

Several advanced techniques extend the basic GloVe approach:

1. Retrofitting to Lexical Resources:

Post-processing GloVe vectors to inject knowledge from external resources (WordNet, FrameNet). The retrofitting objective:

$$\Psi = \sum_{i=1}^{n} \left[ \alpha_i ||\hat{w}i - w_i||^2 + \sum{j:(i,j)\in E} \beta_{ij} ||\hat{w}_i - \hat{w}_j||^2 \right]$$

This pulls related words (from knowledge base edges E) closer together while staying near the original embeddings.

2. Debiasing:

Remove gender, racial, or other biases from embeddings. The classic approach identifies a "bias direction" and projects it out:

$$w_{\text{debiased}} = w - (w \cdot d) \cdot d$$

where d is the identified bias direction (e.g., from 'he'-'she' pairs).

3. Domain Adaptation:

Fine-tune pre-trained GloVe on domain-specific corpus. Options include:

Continued training on domain data (requires access to GloVe training code)
Retrofitting to domain-specific relationships
Linear transformation learning to adapt to domain

Bias in Word Embeddings

Word embeddings capture statistical patterns in training data, including societal biases. Famous examples: 'doctor' is closer to 'man', 'nurse' to 'woman'. For applications in hiring, lending, or other high-stakes domains, bias evaluation and mitigation is essential. See the Word Embedding Association Test (WEAT) for bias measurement.

Summary: GloVe Embeddings

We've comprehensively explored GloVe, a fundamentally different approach to learning word embeddings. Let's consolidate the key insights:

Key Takeaways

•GloVe uses global co-occurrence statistics rather than local context windows, building a full co-occurrence matrix from the corpus.
•The ratio insight is that ratios of co-occurrence probabilities encode semantic meaning more clearly than raw probabilities.
•The objective is weighted least squares regression on log co-occurrences, with a saturation function to handle frequency imbalance.
•Training differs from Word2Vec: batch optimization on matrix vs. streaming SGD on text. GloVe can be faster for fixed vocabulary.
•Pre-trained GloVe vectors are widely available and provide an excellent starting point for most NLP tasks.
•Practical differences from Word2Vec are subtle—both produce high-quality embeddings. Choice often depends on practical factors (memory, incremental updates, available pre-trained models).

What's next:

Both Word2Vec and GloVe treat each word as an atomic unit—they can't handle words not in their vocabulary. The next page covers FastText, which extends Word2Vec by representing words as bags of character n-grams. This enables embeddings for any word (even misspellings and neologisms) and often produces better representations for morphologically rich languages.

Page Complete

You now have a complete understanding of GloVe—from the theoretical foundation in co-occurrence ratios through the weighted least squares objective to practical usage of pre-trained vectors. Combined with Word2Vec knowledge, you can make informed choices about which embedding approach suits your specific application.

4 / 6

Loading learning content...

Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

4 / 6

GloVe: Global Vectors for Word Representation

A Different Path to the Same Destination

What You Will Master

The Co-occurrence Matrix Foundation

GloVe's foundation is the word co-occurrence matrix X, where X_{ij} counts how often word j appears in the context of word i. This matrix captures global corpus statistics.

Building the co-occurrence matrix:

For each word i in the vocabulary, we scan a context window around all its occurrences and increment X_{ij} for each context word j. Weights can be distance-dependent:

$$X_{ij} = \sum_{\text{occurrences of i}} \sum_{k=-\text{window}}^{\text{window}, k \neq 0} \frac{1}{|k|} \cdot \mathbf{1}[\text{word at position } k \text{ is } j]$$

The 1/|k| weighting gives closer words higher weight, similar to Word2Vec's implicit weighting.

build_cooccurrence_matrix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from collections import Counter, defaultdict
from scipy.sparse import lil_matrix, csr_matrix
from typing import List, Dict, Tuple
 
def build_cooccurrence_matrix(
    corpus: List[List[str]],
    vocab: Dict[str, int],
    window_size: int = 10,
    distance_weighting: bool = True
) -> csr_matrix:
    """
    Build word co-occurrence matrix from corpus.
    
    Args:
        corpus: List of tokenized sentences/documents
        vocab: Word to index mapping
        window_size: Context window size on each side
        distance_weighting: Weight by 1/distance (like GloVe)
    
    Returns:
        Sparse co-occurrence matrix of shape (vocab_size, vocab_size)
    """
    vocab_size = len(vocab)
    
    # Use lil_matrix for efficient incremental updates
    cooc = lil_matrix((vocab_size, vocab_size), dtype=np.float64)
    
    for sentence in corpus:
        # Filter to words in vocabulary
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            # Look at context windows
            for offset in range(-window_size, window_size + 1):
                if offset == 0:
                    continue
                
                context_pos = center_pos + offset
                if 0 <= context_pos < len(indices):
                    context_idx = indices[context_pos]
                    
                    # Weight by distance
                    if distance_weighting:
                        weight = 1.0 / abs(offset)
                    else:
                        weight = 1.0
                    
                    cooc[center_idx, context_idx] += weight
    
    return cooc.tocsr()
 
 
def analyze_cooccurrence_matrix(cooc: csr_matrix, vocab: Dict[str, int]):
    """Analyze properties of the co-occurrence matrix."""
    # Reverse vocabulary for lookup
    idx_to_word = {v: k for k, v in vocab.items()}
    
    # Basic statistics
    nnz = cooc.nnz
    density = nnz / (cooc.shape[0] ** 2)
    
    print(f"Co-occurrence matrix statistics:")
    print(f"  Shape: {cooc.shape}")
    print(f"  Non-zero entries: {nnz:,}")
    print(f"  Density: {density:.4%}")
    print(f"  Total co-occurrences: {cooc.sum():,.0f}")
    
    # Find most frequent co-occurrences
    cooc_dense = cooc.toarray()
    top_indices = np.unravel_index(
        np.argsort(cooc_dense.ravel())[-10:],
        cooc_dense.shape
    )
    
    print(f"
Top co-occurrences:")
    for i, j in zip(top_indices[0][::-1], top_indices[1][::-1]):
        w1, w2 = idx_to_word.get(i, '?'), idx_to_word.get(j, '?')
        count = cooc_dense[i, j]
        print(f"  ({w1}, {w2}): {count:.2f}")
 
 
# Example usage
corpus = [
    "the quick brown fox jumps over the lazy dog".split(),
    "machine learning algorithms process large datasets".split(),
    "the neural network learns patterns from data".split(),
]
 
# Build vocabulary (normally from full corpus)
all_words = [w for sent in corpus for w in sent]
word_counts = Counter(all_words)
vocab = {word: idx for idx, (word, _) in enumerate(word_counts.most_common())}
 
# Build matrix
cooc = build_cooccurrence_matrix(corpus, vocab, window_size=5)
analyze_cooccurrence_matrix(cooc, vocab)

Matrix Properties

The Key Insight: Ratio of Probabilities

The brilliant insight behind GloVe is that ratios of co-occurrence probabilities encode semantic meaning more clearly than raw probabilities.

Define P(k|i) as the probability that word k appears in the context of word i:

$$P(k|i) = \frac{X_{ik}}{X_i}$$

where X_i = Σ_k X_{ik} is the total co-occurrence for word i.

Consider the words "ice" and "steam":

Now compare how these relate to different context words:

Co-occurrence Probability Ratios (from GloVe paper)
Probability	k = solid	k = gas	k = water	k = fashion
P(k\|ice)	1.9 × 10⁻⁴	6.6 × 10⁻⁵	3.0 × 10⁻³	1.7 × 10⁻⁵
P(k\|steam)	2.2 × 10⁻⁵	7.8 × 10⁻⁴	2.2 × 10⁻³	1.8 × 10⁻⁵
P(k\|ice) / P(k\|steam)	8.9 (large)	8.5 × 10⁻² (small)	1.36 (~1)	0.96 (~1)

Interpreting the ratio:

"solid": Ratio >> 1 means "solid" is much more associated with "ice" than "steam" → relates to differentiating property of ice
"gas": Ratio << 1 means "gas" is much more associated with "steam" than "ice" → relates to differentiating property of steam
"water": Ratio ≈ 1 means both ice and steam relate similarly to "water" → shared property
"fashion": Ratio ≈ 1 means neither word relates to "fashion" → irrelevant context

The GloVe hypothesis:

If word embeddings capture meaning, then the ratio P(k|i)/P(k|j) should be expressible as some function of the word vectors:

$$F(w_i, w_j, \tilde{w}_k) = \frac{P(k|i)}{P(k|j)}$$

GloVe derives that this function should be:

$$F = \exp((w_i - w_j)^T \tilde{w}_k)$$

Why Ratios Matter

The GloVe Objective Function

Working through the mathematical derivation leads to GloVe's objective function—a weighted least squares regression on log co-occurrences:

$$J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij} \right)^2$$

where:

w_i is the word vector for word i (the one we ultimately use)
w̃_j is the context vector for word j (a separate set of parameters)
b_i and b̃_j are bias terms
f(X_{ij}) is a weighting function
The sum is over all non-zero entries of X

Key components:

1. The log co-occurrence target (log X_{ij}):

We predict log counts, not raw counts. This is crucial because:

Raw counts are heavy-tailed (ranging from 0 to millions)
Log compresses the range, making optimization stable
The derivation from probability ratios naturally yields logs

2. The bilinear form (w_i^T w̃_j):

The model predicts log co-occurrence as a dot product of word and context vectors. This captures the core relationship between words.

3. Bias terms (b_i + b̃_j):

Biases absorb word-specific effects like overall frequency. Without biases, frequent words would need artificially large vectors.

glove_weighting.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
import matplotlib.pyplot as plt
 
def glove_weighting_function(x: np.ndarray, x_max: float = 100, alpha: float = 0.75) -> np.ndarray:
    """
    GloVe weighting function f(x).
    
    Properties:
    - f(0) = 0: Zero counts contribute nothing
    - f(x) is non-decreasing: More counts = more weight
    - f(x) bounded: Very frequent pairs don't dominate
    
    Args:
        x: Co-occurrence counts
        x_max: Saturation threshold
        alpha: Power parameter (paper uses 3/4)
    
    Returns:
        Weights in [0, 1]
    """
    # f(x) = (x/x_max)^alpha if x < x_max, else 1
    return np.where(x < x_max, (x / x_max) ** alpha, 1.0)
 
 
# Visualize the weighting function
x = np.linspace(0, 150, 300)
y = glove_weighting_function(x)
 
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, y, 'b-', linewidth=2)
plt.axhline(y=1, color='r', linestyle='--', alpha=0.5, label='Maximum weight')
plt.axvline(x=100, color='g', linestyle='--', alpha=0.5, label='x_max')
plt.xlabel('Co-occurrence count X_ij')
plt.ylabel('Weight f(X_ij)')
plt.title('GloVe Weighting Function')
plt.legend()
plt.grid(True, alpha=0.3)
 
# Compare different alpha values
plt.subplot(1, 2, 2)
for alpha in [0.5, 0.75, 1.0]:
    y = glove_weighting_function(x, alpha=alpha)
    plt.plot(x, y, label=f'α = {alpha}')
plt.xlabel('Co-occurrence count X_ij')
plt.ylabel('Weight f(X_ij)')
plt.title('Effect of α parameter')
plt.legend()
plt.grid(True, alpha=0.3)
 
plt.tight_layout()
plt.savefig('glove_weighting.png', dpi=150)
 
 
# Why this weighting matters
print("Weighting function values at various counts:")
for count in [1, 5, 10, 50, 100, 500, 10000]:
    weight = glove_weighting_function(np.array([count]))[0]
    print(f"  f({count}) = {weight:.4f}")
 
# Output:
# f(1) = 0.0178
# f(5) = 0.0562
# f(10) = 0.0944
# f(50) = 0.2973
# f(100) = 1.0000
# f(500) = 1.0000
# f(10000) = 1.0000

Why Weighting is Essential

Training GloVe Embeddings

Training GloVe differs significantly from Word2Vec. Instead of stochastic updates from text windows, GloVe performs batch optimization on the co-occurrence matrix.

Training procedure:

Build co-occurrence matrix X from corpus (one-time preprocessing)
Initialize word vectors w_i, context vectors w̃_j, and biases randomly
Optimize the weighted least squares objective using AdaGrad or similar
Combine final embeddings as w_i + w̃_i (sum of word and context vectors)

glove_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from scipy.sparse import csr_matrix
from typing import Tuple, Optional
 
class GloVeTrainer:
                  """
    GloVe training implementation.
    """
    
    def __init__(
        self,
        embedding_dim: int = 100,
        x_max: float = 100.0,
        alpha: float = 0.75,
        learning_rate: float = 0.05,
        epochs: int = 50
    ):
        self.embedding_dim = embedding_dim
        self.x_max = x_max
        self.alpha = alpha
        self.learning_rate = learning_rate
        self.epochs = epochs
        
        self.W = None  # Word embeddings
        self.W_tilde = None  # Context embeddings
        self.b = None  # Word bias
        self.b_tilde = None  # Context bias
    
    def _weighting_function(self, x: np.ndarray) -> np.ndarray:
        """GloVe weighting function."""
        return np.where(x < self.x_max, (x / self.x_max) ** self.alpha, 1.0)
    
    def _initialize_parameters(self, vocab_size: int):
        """Random initialization of parameters."""
        # Initialize uniformly in [-0.5/dim, 0.5/dim]
        bound = 0.5 / self.embedding_dim
        
        self.W = np.random.uniform(-bound, bound, (vocab_size, self.embedding_dim))
        self.W_tilde = np.random.uniform(-bound, bound, (vocab_size, self.embedding_dim))
        self.b = np.zeros(vocab_size)
        self.b_tilde = np.zeros(vocab_size)
        
        # AdaGrad accumulators
        self.grad_W = np.ones_like(self.W)
        self.grad_W_tilde = np.ones_like(self.W_tilde)
        self.grad_b = np.ones_like(self.b)
        self.grad_b_tilde = np.ones_like(self.b_tilde)
    
    def fit(self, cooc_matrix: csr_matrix) -> Tuple[np.ndarray, np.ndarray]:
        """
        Train GloVe embeddings on co-occurrence matrix.
        
        Args:
            cooc_matrix: Sparse co-occurrence matrix
        
        Returns:
            Tuple of (word_embeddings, context_embeddings)
        """
        vocab_size = cooc_matrix.shape[0]
        self._initialize_parameters(vocab_size)
        
        # Get non-zero entries
        rows, cols = cooc_matrix.nonzero()
        data = np.array(cooc_matrix[rows, cols]).flatten()
        
        # Pre-compute weights and log targets
        weights = self._weighting_function(data)
        log_cooc = np.log(data)
        
        n_samples = len(rows)
        
        for epoch in range(self.epochs):
            # Shuffle training samples
            perm = np.random.permutation(n_samples)
            total_loss = 0.0
            
            for idx in perm:
                i, j = rows[idx], cols[idx]
                weight = weights[idx]
                log_target = log_cooc[idx]
                
                # Compute prediction: w_i^T w̃_j + b_i + b̃_j
                prediction = np.dot(self.W[i], self.W_tilde[j]) + self.b[i] + self.b_tilde[j]
                
                # Compute error
                diff = prediction - log_target
                loss = weight * diff ** 2
                total_loss += loss
                
                # Compute gradients
                grad_common = weight * diff
                grad_w = grad_common * self.W_tilde[j]
                grad_w_tilde = grad_common * self.W[i]
                grad_b = grad_common
                grad_b_tilde = grad_common
                
                # AdaGrad updates
                self.grad_W[i] += grad_w ** 2
                self.grad_W_tilde[j] += grad_w_tilde ** 2
                self.grad_b[i] += grad_b ** 2
                self.grad_b_tilde[j] += grad_b_tilde ** 2
                
                # Update parameters
                self.W[i] -= self.learning_rate * grad_w / np.sqrt(self.grad_W[i])
                self.W_tilde[j] -= self.learning_rate * grad_w_tilde / np.sqrt(self.grad_W_tilde[j])
                self.b[i] -= self.learning_rate * grad_b / np.sqrt(self.grad_b[i])
                self.b_tilde[j] -= self.learning_rate * grad_b_tilde / np.sqrt(self.grad_b_tilde[j])
            
            avg_loss = total_loss / n_samples
            if (epoch + 1) % 10 == 0:
                print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {avg_loss:.6f}")
        
        return self.W, self.W_tilde
    
    def get_embeddings(self, combine: bool = True) -> np.ndarray:
        """
        Get final word embeddings.
        
        Args:
            combine: If True, return W + W_tilde (recommended by GloVe paper)
        """
        if combine:
            return self.W + self.W_tilde
        return self.W
 
 
# Example usage (with small vocabulary for demonstration)
# In practice, use the official GloVe C implementation for large corpora
 
trainer = GloVeTrainer(
    embedding_dim=50,
    learning_rate=0.05,
    epochs=100
)
 
# Train on co-occurrence matrix
W, W_tilde = trainer.fit(cooccurrence_matrix)
 
# Get final embeddings
embeddings = trainer.get_embeddings(combine=True)
print(f"Learned embeddings shape: {embeddings.shape}")

Practical Training

GloVe vs. Word2Vec: A Deep Comparison

GloVe and Word2Vec represent different philosophical approaches to learning word representations. Understanding their differences informs practical choices.

Fundamental Comparison
Aspect	Word2Vec (Skip-gram)	GloVe
Approach	Prediction-based (local)	Count-based (global)
Training data	Stream of (word, context) pairs	Co-occurrence matrix
Objective	Maximize log probability of context	Minimize weighted MSE on log counts
Optimization	SGD on streaming pairs	SGD on matrix entries
Memory	Low (stream processing)	High (store full matrix)
Training time	Scales with corpus size	Scales with vocabulary × non-zero entries
Parallelization	Data parallelism (training pairs)	Easy parallelism (matrix rows independent)

GloVe Strengths

•Explicit use of global statistics — captures corpus-wide patterns in one pass
•Better on word analogy tasks — the ratio formulation directly models analogies
•Faster training for fixed vocabulary — matrix operations, not streaming
•Interpretable objective — weighted least squares on observable quantities
•Pre-computed matrix enables experimentation with different objectives

Word2Vec Strengths

•Lower memory requirement — streams through corpus, no matrix storage
•Simpler to train incrementally — can add new documents without rebuilding matrix
•Better for rare words — each occurrence updates directly
•Subsampling built-in — automatic handling of frequent words
•Easier to fine-tune — continue training on domain-specific data

In Practice, They're Similar

Using Pre-trained GloVe Embeddings

Stanford provides pre-trained GloVe vectors on several corpora. These are the most common starting points:

Available Pre-trained Vectors:

Name	Corpus	Vocabulary	Dimensions	Size
glove.6B	Wikipedia + Gigaword	400K	50, 100, 200, 300	66MB-1GB
glove.42B	Common Crawl 42B tokens	1.9M	300	5.3GB
glove.840B	Common Crawl 840B tokens	2.2M	300	5.6GB
glove.twitter.27B	Twitter	1.2M	25, 50, 100, 200	1.5GB

load_pretrained_glove.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from typing import Dict, Tuple
import gensim.downloader as api
 
# Option 1: Load via gensim (easiest)
def load_glove_via_gensim(name: str = 'glove-wiki-gigaword-100'):
    """
    Load GloVe vectors via gensim's downloader.
    
    Available models:
    - glove-wiki-gigaword-50
    - glove-wiki-gigaword-100
    - glove-wiki-gigaword-200
    - glove-wiki-gigaword-300
    - glove-twitter-25
    - glove-twitter-50
    - glove-twitter-100
    - glove-twitter-200
    """
    print(f"Loading {name}...")
    model = api.load(name)
    print(f"Loaded {len(model)} vectors of dimension {model.vector_size}")
    return model
 
 
# Option 2: Load from file directly (for raw GloVe files)
def load_glove_from_file(filepath: str) -> Tuple[Dict[str, np.ndarray], int]:
    """
    Load GloVe vectors from raw text file.
    
    Format: word dim1 dim2 dim3 ... dimN
    """
    word_vectors = {}
    embedding_dim = None
    
    print(f"Loading GloVe from {filepath}...")
    with open(filepath, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f):
            values = line.rstrip().split(' ')
            word = values[0]
            
            try:
                vector = np.array([float(x) for x in values[1:]])
                
                if embedding_dim is None:
                    embedding_dim = len(vector)
                
                word_vectors[word] = vector
                
            except ValueError:
                print(f"Skipping malformed line {line_num}: {word}")
                continue
            
            if (line_num + 1) % 100000 == 0:
                print(f"  Loaded {line_num + 1:,} vectors...")
    
    print(f"Loaded {len(word_vectors):,} vectors of dimension {embedding_dim}")
    return word_vectors, embedding_dim
 
 
# Option 3: Convert to gensim format for compatibility
def convert_glove_to_gensim_format(glove_file: str, output_file: str):
    """
    Convert raw GloVe file to gensim key-vectors format.
    Adds header line required by gensim.
    """
    from gensim.models import KeyedVectors
    from gensim.scripts.glove2word2vec import glove2word2vec
    
    # Convert format
    temp_file = output_file + '.tmp'
    glove2word2vec(glove_file, temp_file)
    
    # Load and save in native format
    model = KeyedVectors.load_word2vec_format(temp_file, binary=False)
    model.save(output_file)
    
    import os
    os.remove(temp_file)
    
    return model
 
 
# Usage example
# Method 1: Via gensim downloader
glove = load_glove_via_gensim('glove-wiki-gigaword-100')
 
# Test the embeddings
print("
Word similarity:")
print(f"similarity('king', 'queen') = {glove.similarity('king', 'queen'):.4f}")
print(f"similarity('cat', 'dog') = {glove.similarity('cat', 'dog'):.4f}")
print(f"similarity('cat', 'car') = {glove.similarity('cat', 'car'):.4f}")
 
print("
Word analogies:")
result = glove.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)
print(f"king - man + woman = {result}")
 
# Nearest neighbors
print("
Nearest neighbors of 'python':")
for word, sim in glove.most_similar('python', topn=5):
    print(f"  {word}: {sim:.4f}")

Choosing Pre-trained Vectors

Practical Considerations and Best Practices

When using GloVe embeddings in applications, several practical factors affect performance:

Best Practices for GloVe

•Lowercasing: GloVe models often have separate vectors for 'The' and 'the'. For most tasks, lowercase your input to match the vocabulary better.
•Normalization: Consider L2-normalizing vectors for similarity tasks. This makes cosine similarity equal to dot product.
•Missing words: Have a strategy for OOV words—zero vector, random vector, or subword fallback (though GloVe doesn't support subwords natively).
•Memory mapping: For large models, use memory-mapped loading to reduce RAM usage at the cost of slightly slower access.
•Domain mismatch: Pre-trained vectors from Wikipedia may not cover domain-specific terms. Consider training custom vectors or using domain-adapted alternatives.

glove_best_practices.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from typing import Optional
 
class GloVeEmbedder:
    """
    Production-ready GloVe embedding wrapper with best practices.
    """
    
    def __init__(
        self,
        word_vectors,
        normalize: bool = True,
        lowercase: bool = True,
        oov_strategy: str = 'zero'  # 'zero', 'random', 'mean'
    ):
        self.word_vectors = word_vectors
        self.normalize = normalize
        self.lowercase = lowercase
        self.oov_strategy = oov_strategy
        self.dim = word_vectors.vector_size
        
        # Precompute mean vector for 'mean' OOV strategy
        if oov_strategy == 'mean':
            sample_size = min(10000, len(word_vectors))
            sample_words = list(word_vectors.key_to_index.keys())[:sample_size]
            sample_vecs = [word_vectors[w] for w in sample_words]
            self.mean_vector = np.mean(sample_vecs, axis=0)
        else:
            self.mean_vector = None
        
        # Precompute normalized vectors if needed
        if normalize:
            self._normalized_cache = {}
    
    def get_word_vector(self, word: str) -> np.ndarray:
        """Get embedding for a single word with OOV handling."""
        if self.lowercase:
            word = word.lower()
        
        if word in self.word_vectors:
            vec = self.word_vectors[word]
            if self.normalize:
                if word not in self._normalized_cache:
                    norm = np.linalg.norm(vec)
                    self._normalized_cache[word] = vec / norm if norm > 0 else vec
                return self._normalized_cache[word]
            return vec
        
        # OOV handling
        if self.oov_strategy == 'zero':
            return np.zeros(self.dim)
        elif self.oov_strategy == 'random':
            # Small random vector
            return np.random.randn(self.dim) * 0.01
        elif self.oov_strategy == 'mean':
            return self.mean_vector
        else:
            raise ValueError(f"Unknown OOV strategy: {self.oov_strategy}")
    
    def get_document_vector(
        self,
        words: list,
        weights: Optional[np.ndarray] = None
    ) -> np.ndarray:
        """Get average embedding for a document with optional weighting."""
        vectors = []
        actual_weights = []
        
        for i, word in enumerate(words):
            vec = self.get_word_vector(word)
            if vec is not None and not np.allclose(vec, 0):  # Skip zero vectors
                vectors.append(vec)
                if weights is not None:
                    actual_weights.append(weights[i])
        
        if not vectors:
            return np.zeros(self.dim)
        
        vectors = np.array(vectors)
        
        if weights is not None and actual_weights:
            actual_weights = np.array(actual_weights)
            avg = np.average(vectors, weights=actual_weights, axis=0)
        else:
            avg = np.mean(vectors, axis=0)
        
        if self.normalize:
            norm = np.linalg.norm(avg)
            if norm > 0:
                avg = avg / norm
        
        return avg
    
    def get_coverage(self, words: list) -> float:
        """Compute vocabulary coverage for a list of words."""
        covered = sum(1 for w in words 
                     if (w.lower() if self.lowercase else w) in self.word_vectors)
        return covered / len(words) if words else 0.0
 
 
# Example usage
embedder = GloVeEmbedder(
    glove,
    normalize=True,
    lowercase=True,
    oov_strategy='zero'
)
 
# Single word
king_vec = embedder.get_word_vector('King')  # Will be lowercased
print(f"'King' vector norm: {np.linalg.norm(king_vec):.4f}")  # Should be 1.0
 
# Document
doc = "The neural network learns patterns from data".split()
doc_vec = embedder.get_document_vector(doc)
print(f"Document vector norm: {np.linalg.norm(doc_vec):.4f}")
 
# Coverage check
technical_doc = "BERT transformer attention mechanism self-supervised".split()
coverage = embedder.get_coverage(technical_doc)
print(f"Vocabulary coverage: {coverage:.1%}")

Advanced Topics

Several advanced techniques extend the basic GloVe approach:

1. Retrofitting to Lexical Resources:

Post-processing GloVe vectors to inject knowledge from external resources (WordNet, FrameNet). The retrofitting objective:

$$\Psi = \sum_{i=1}^{n} \left[ \alpha_i ||\hat{w}i - w_i||^2 + \sum{j:(i,j)\in E} \beta_{ij} ||\hat{w}_i - \hat{w}_j||^2 \right]$$

This pulls related words (from knowledge base edges E) closer together while staying near the original embeddings.

2. Debiasing:

Remove gender, racial, or other biases from embeddings. The classic approach identifies a "bias direction" and projects it out:

$$w_{\text{debiased}} = w - (w \cdot d) \cdot d$$

where d is the identified bias direction (e.g., from 'he'-'she' pairs).

3. Domain Adaptation:

Fine-tune pre-trained GloVe on domain-specific corpus. Options include:

Continued training on domain data (requires access to GloVe training code)
Retrofitting to domain-specific relationships
Linear transformation learning to adapt to domain

Bias in Word Embeddings

Summary: GloVe Embeddings

We've comprehensively explored GloVe, a fundamentally different approach to learning word embeddings. Let's consolidate the key insights:

Key Takeaways

•GloVe uses global co-occurrence statistics rather than local context windows, building a full co-occurrence matrix from the corpus.
•The ratio insight is that ratios of co-occurrence probabilities encode semantic meaning more clearly than raw probabilities.
•The objective is weighted least squares regression on log co-occurrences, with a saturation function to handle frequency imbalance.
•Training differs from Word2Vec: batch optimization on matrix vs. streaming SGD on text. GloVe can be faster for fixed vocabulary.
•Pre-trained GloVe vectors are widely available and provide an excellent starting point for most NLP tasks.
•Practical differences from Word2Vec are subtle—both produce high-quality embeddings. Choice often depends on practical factors (memory, incremental updates, available pre-trained models).

What's next:

Page Complete

4 / 6