Loading learning content...
Word2Vec learns word embeddings by predicting context words from center words (or vice versa)—a local, window-based approach. But there's a fundamentally different way to think about word relationships: through the global statistics of word co-occurrence.
GloVe (Global Vectors for Word Representation), developed by Pennington, Socher, and Manning at Stanford in 2014, takes this alternative path. Instead of sliding windows through text, GloVe first constructs a global co-occurrence matrix from the entire corpus, then learns embeddings that best explain these co-occurrence patterns.
Remarkably, both approaches produce embeddings with similar properties—words cluster by meaning, and semantic relationships emerge as vector arithmetic. But GloVe's global perspective offers different tradeoffs and insights.
By the end of this page, you will understand GloVe's theoretical foundation in log-bilinear regression on co-occurrence counts, the weighted least squares objective and its rationale, how GloVe unifies count-based and prediction-based methods, practical considerations for training and using GloVe embeddings, and when to choose GloVe over Word2Vec (and vice versa).
GloVe's foundation is the word co-occurrence matrix X, where X_{ij} counts how often word j appears in the context of word i. This matrix captures global corpus statistics.
Building the co-occurrence matrix:
For each word i in the vocabulary, we scan a context window around all its occurrences and increment X_{ij} for each context word j. Weights can be distance-dependent:
$$X_{ij} = \sum_{\text{occurrences of i}} \sum_{k=-\text{window}}^{\text{window}, k \neq 0} \frac{1}{|k|} \cdot \mathbf{1}[\text{word at position } k \text{ is } j]$$
The 1/|k| weighting gives closer words higher weight, similar to Word2Vec's implicit weighting.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import numpy as npfrom collections import Counter, defaultdictfrom scipy.sparse import lil_matrix, csr_matrixfrom typing import List, Dict, Tuple def build_cooccurrence_matrix( corpus: List[List[str]], vocab: Dict[str, int], window_size: int = 10, distance_weighting: bool = True) -> csr_matrix: """ Build word co-occurrence matrix from corpus. Args: corpus: List of tokenized sentences/documents vocab: Word to index mapping window_size: Context window size on each side distance_weighting: Weight by 1/distance (like GloVe) Returns: Sparse co-occurrence matrix of shape (vocab_size, vocab_size) """ vocab_size = len(vocab) # Use lil_matrix for efficient incremental updates cooc = lil_matrix((vocab_size, vocab_size), dtype=np.float64) for sentence in corpus: # Filter to words in vocabulary indices = [vocab[w] for w in sentence if w in vocab] for center_pos, center_idx in enumerate(indices): # Look at context windows for offset in range(-window_size, window_size + 1): if offset == 0: continue context_pos = center_pos + offset if 0 <= context_pos < len(indices): context_idx = indices[context_pos] # Weight by distance if distance_weighting: weight = 1.0 / abs(offset) else: weight = 1.0 cooc[center_idx, context_idx] += weight return cooc.tocsr() def analyze_cooccurrence_matrix(cooc: csr_matrix, vocab: Dict[str, int]): """Analyze properties of the co-occurrence matrix.""" # Reverse vocabulary for lookup idx_to_word = {v: k for k, v in vocab.items()} # Basic statistics nnz = cooc.nnz density = nnz / (cooc.shape[0] ** 2) print(f"Co-occurrence matrix statistics:") print(f" Shape: {cooc.shape}") print(f" Non-zero entries: {nnz:,}") print(f" Density: {density:.4%}") print(f" Total co-occurrences: {cooc.sum():,.0f}") # Find most frequent co-occurrences cooc_dense = cooc.toarray() top_indices = np.unravel_index( np.argsort(cooc_dense.ravel())[-10:], cooc_dense.shape ) print(f"Top co-occurrences:") for i, j in zip(top_indices[0][::-1], top_indices[1][::-1]): w1, w2 = idx_to_word.get(i, '?'), idx_to_word.get(j, '?') count = cooc_dense[i, j] print(f" ({w1}, {w2}): {count:.2f}") # Example usagecorpus = [ "the quick brown fox jumps over the lazy dog".split(), "machine learning algorithms process large datasets".split(), "the neural network learns patterns from data".split(),] # Build vocabulary (normally from full corpus)all_words = [w for sent in corpus for w in sent]word_counts = Counter(all_words)vocab = {word: idx for idx, (word, _) in enumerate(word_counts.most_common())} # Build matrixcooc = build_cooccurrence_matrix(corpus, vocab, window_size=5)analyze_cooccurrence_matrix(cooc, vocab)The co-occurrence matrix X is typically: (1) Symmetric if we don't distinguish direction (X_{ij} = X_{ji}), (2) Sparse — most word pairs never co-occur, (3) Heavy-tailed — a few pairs have very high counts (the, a), most pairs have zero or small counts. GloVe's weighting function addresses this imbalance.
The brilliant insight behind GloVe is that ratios of co-occurrence probabilities encode semantic meaning more clearly than raw probabilities.
Define P(k|i) as the probability that word k appears in the context of word i:
$$P(k|i) = \frac{X_{ik}}{X_i}$$
where X_i = Σ_k X_{ik} is the total co-occurrence for word i.
Consider the words "ice" and "steam":
Now compare how these relate to different context words:
| Probability | k = solid | k = gas | k = water | k = fashion |
|---|---|---|---|---|
| P(k|ice) | 1.9 × 10⁻⁴ | 6.6 × 10⁻⁵ | 3.0 × 10⁻³ | 1.7 × 10⁻⁵ |
| P(k|steam) | 2.2 × 10⁻⁵ | 7.8 × 10⁻⁴ | 2.2 × 10⁻³ | 1.8 × 10⁻⁵ |
| P(k|ice) / P(k|steam) | 8.9 (large) | 8.5 × 10⁻² (small) | 1.36 (~1) | 0.96 (~1) |
Interpreting the ratio:
The GloVe hypothesis:
If word embeddings capture meaning, then the ratio P(k|i)/P(k|j) should be expressible as some function of the word vectors:
$$F(w_i, w_j, \tilde{w}_k) = \frac{P(k|i)}{P(k|j)}$$
GloVe derives that this function should be:
$$F = \exp((w_i - w_j)^T \tilde{w}_k)$$
Raw co-occurrence counts are dominated by frequent words ('the', 'of', 'to'). Ratios normalize away these base rates, revealing the differential relationships that actually distinguish word meanings. This is similar to how TF-IDF normalized by IDF reveals informative terms.
Working through the mathematical derivation leads to GloVe's objective function—a weighted least squares regression on log co-occurrences:
$$J = \sum_{i,j=1}^{V} f(X_{ij}) \left( w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij} \right)^2$$
where:
Key components:
1. The log co-occurrence target (log X_{ij}):
We predict log counts, not raw counts. This is crucial because:
2. The bilinear form (w_i^T w̃_j):
The model predicts log co-occurrence as a dot product of word and context vectors. This captures the core relationship between words.
3. Bias terms (b_i + b̃_j):
Biases absorb word-specific effects like overall frequency. Without biases, frequent words would need artificially large vectors.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npimport matplotlib.pyplot as plt def glove_weighting_function(x: np.ndarray, x_max: float = 100, alpha: float = 0.75) -> np.ndarray: """ GloVe weighting function f(x). Properties: - f(0) = 0: Zero counts contribute nothing - f(x) is non-decreasing: More counts = more weight - f(x) bounded: Very frequent pairs don't dominate Args: x: Co-occurrence counts x_max: Saturation threshold alpha: Power parameter (paper uses 3/4) Returns: Weights in [0, 1] """ # f(x) = (x/x_max)^alpha if x < x_max, else 1 return np.where(x < x_max, (x / x_max) ** alpha, 1.0) # Visualize the weighting functionx = np.linspace(0, 150, 300)y = glove_weighting_function(x) plt.figure(figsize=(10, 5))plt.subplot(1, 2, 1)plt.plot(x, y, 'b-', linewidth=2)plt.axhline(y=1, color='r', linestyle='--', alpha=0.5, label='Maximum weight')plt.axvline(x=100, color='g', linestyle='--', alpha=0.5, label='x_max')plt.xlabel('Co-occurrence count X_ij')plt.ylabel('Weight f(X_ij)')plt.title('GloVe Weighting Function')plt.legend()plt.grid(True, alpha=0.3) # Compare different alpha valuesplt.subplot(1, 2, 2)for alpha in [0.5, 0.75, 1.0]: y = glove_weighting_function(x, alpha=alpha) plt.plot(x, y, label=f'α = {alpha}')plt.xlabel('Co-occurrence count X_ij')plt.ylabel('Weight f(X_ij)')plt.title('Effect of α parameter')plt.legend()plt.grid(True, alpha=0.3) plt.tight_layout()plt.savefig('glove_weighting.png', dpi=150) # Why this weighting mattersprint("Weighting function values at various counts:")for count in [1, 5, 10, 50, 100, 500, 10000]: weight = glove_weighting_function(np.array([count]))[0] print(f" f({count}) = {weight:.4f}") # Output:# f(1) = 0.0178# f(5) = 0.0562# f(10) = 0.0944# f(50) = 0.2973# f(100) = 1.0000# f(500) = 1.0000# f(10000) = 1.0000Without the weighting function, extremely frequent pairs ('the'-'of') would dominate the loss, forcing the model to fit these perfectly at the expense of rare but informative pairs. The weighting function caps the influence of frequent pairs while still giving some weight to moderately frequent ones.
Training GloVe differs significantly from Word2Vec. Instead of stochastic updates from text windows, GloVe performs batch optimization on the co-occurrence matrix.
Training procedure:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
import numpy as npfrom scipy.sparse import csr_matrixfrom typing import Tuple, Optional class GloVeTrainer: """ GloVe training implementation. """ def __init__( self, embedding_dim: int = 100, x_max: float = 100.0, alpha: float = 0.75, learning_rate: float = 0.05, epochs: int = 50 ): self.embedding_dim = embedding_dim self.x_max = x_max self.alpha = alpha self.learning_rate = learning_rate self.epochs = epochs self.W = None # Word embeddings self.W_tilde = None # Context embeddings self.b = None # Word bias self.b_tilde = None # Context bias def _weighting_function(self, x: np.ndarray) -> np.ndarray: """GloVe weighting function.""" return np.where(x < self.x_max, (x / self.x_max) ** self.alpha, 1.0) def _initialize_parameters(self, vocab_size: int): """Random initialization of parameters.""" # Initialize uniformly in [-0.5/dim, 0.5/dim] bound = 0.5 / self.embedding_dim self.W = np.random.uniform(-bound, bound, (vocab_size, self.embedding_dim)) self.W_tilde = np.random.uniform(-bound, bound, (vocab_size, self.embedding_dim)) self.b = np.zeros(vocab_size) self.b_tilde = np.zeros(vocab_size) # AdaGrad accumulators self.grad_W = np.ones_like(self.W) self.grad_W_tilde = np.ones_like(self.W_tilde) self.grad_b = np.ones_like(self.b) self.grad_b_tilde = np.ones_like(self.b_tilde) def fit(self, cooc_matrix: csr_matrix) -> Tuple[np.ndarray, np.ndarray]: """ Train GloVe embeddings on co-occurrence matrix. Args: cooc_matrix: Sparse co-occurrence matrix Returns: Tuple of (word_embeddings, context_embeddings) """ vocab_size = cooc_matrix.shape[0] self._initialize_parameters(vocab_size) # Get non-zero entries rows, cols = cooc_matrix.nonzero() data = np.array(cooc_matrix[rows, cols]).flatten() # Pre-compute weights and log targets weights = self._weighting_function(data) log_cooc = np.log(data) n_samples = len(rows) for epoch in range(self.epochs): # Shuffle training samples perm = np.random.permutation(n_samples) total_loss = 0.0 for idx in perm: i, j = rows[idx], cols[idx] weight = weights[idx] log_target = log_cooc[idx] # Compute prediction: w_i^T w̃_j + b_i + b̃_j prediction = np.dot(self.W[i], self.W_tilde[j]) + self.b[i] + self.b_tilde[j] # Compute error diff = prediction - log_target loss = weight * diff ** 2 total_loss += loss # Compute gradients grad_common = weight * diff grad_w = grad_common * self.W_tilde[j] grad_w_tilde = grad_common * self.W[i] grad_b = grad_common grad_b_tilde = grad_common # AdaGrad updates self.grad_W[i] += grad_w ** 2 self.grad_W_tilde[j] += grad_w_tilde ** 2 self.grad_b[i] += grad_b ** 2 self.grad_b_tilde[j] += grad_b_tilde ** 2 # Update parameters self.W[i] -= self.learning_rate * grad_w / np.sqrt(self.grad_W[i]) self.W_tilde[j] -= self.learning_rate * grad_w_tilde / np.sqrt(self.grad_W_tilde[j]) self.b[i] -= self.learning_rate * grad_b / np.sqrt(self.grad_b[i]) self.b_tilde[j] -= self.learning_rate * grad_b_tilde / np.sqrt(self.grad_b_tilde[j]) avg_loss = total_loss / n_samples if (epoch + 1) % 10 == 0: print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {avg_loss:.6f}") return self.W, self.W_tilde def get_embeddings(self, combine: bool = True) -> np.ndarray: """ Get final word embeddings. Args: combine: If True, return W + W_tilde (recommended by GloVe paper) """ if combine: return self.W + self.W_tilde return self.W # Example usage (with small vocabulary for demonstration)# In practice, use the official GloVe C implementation for large corpora trainer = GloVeTrainer( embedding_dim=50, learning_rate=0.05, epochs=100) # Train on co-occurrence matrixW, W_tilde = trainer.fit(cooccurrence_matrix) # Get final embeddingsembeddings = trainer.get_embeddings(combine=True)print(f"Learned embeddings shape: {embeddings.shape}")For production use, the official GloVe C implementation is highly optimized. The Python implementation above is for understanding; it's 100x slower than the C version. Pre-trained GloVe vectors (Wikipedia, Common Crawl) are freely available and excellent for most applications.
GloVe and Word2Vec represent different philosophical approaches to learning word representations. Understanding their differences informs practical choices.
| Aspect | Word2Vec (Skip-gram) | GloVe |
|---|---|---|
| Approach | Prediction-based (local) | Count-based (global) |
| Training data | Stream of (word, context) pairs | Co-occurrence matrix |
| Objective | Maximize log probability of context | Minimize weighted MSE on log counts |
| Optimization | SGD on streaming pairs | SGD on matrix entries |
| Memory | Low (stream processing) | High (store full matrix) |
| Training time | Scales with corpus size | Scales with vocabulary × non-zero entries |
| Parallelization | Data parallelism (training pairs) | Easy parallelism (matrix rows independent) |
Despite the theoretical differences, empirical studies show GloVe and Word2Vec produce embeddings of comparable quality on most downstream tasks. The choice often comes down to practical factors: memory constraints favor Word2Vec; available pre-trained vectors may favor GloVe; need for incremental updates favors Word2Vec.
Stanford provides pre-trained GloVe vectors on several corpora. These are the most common starting points:
Available Pre-trained Vectors:
| Name | Corpus | Vocabulary | Dimensions | Size |
|---|---|---|---|---|
| glove.6B | Wikipedia + Gigaword | 400K | 50, 100, 200, 300 | 66MB-1GB |
| glove.42B | Common Crawl 42B tokens | 1.9M | 300 | 5.3GB |
| glove.840B | Common Crawl 840B tokens | 2.2M | 300 | 5.6GB |
| glove.twitter.27B | 1.2M | 25, 50, 100, 200 | 1.5GB |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom typing import Dict, Tupleimport gensim.downloader as api # Option 1: Load via gensim (easiest)def load_glove_via_gensim(name: str = 'glove-wiki-gigaword-100'): """ Load GloVe vectors via gensim's downloader. Available models: - glove-wiki-gigaword-50 - glove-wiki-gigaword-100 - glove-wiki-gigaword-200 - glove-wiki-gigaword-300 - glove-twitter-25 - glove-twitter-50 - glove-twitter-100 - glove-twitter-200 """ print(f"Loading {name}...") model = api.load(name) print(f"Loaded {len(model)} vectors of dimension {model.vector_size}") return model # Option 2: Load from file directly (for raw GloVe files)def load_glove_from_file(filepath: str) -> Tuple[Dict[str, np.ndarray], int]: """ Load GloVe vectors from raw text file. Format: word dim1 dim2 dim3 ... dimN """ word_vectors = {} embedding_dim = None print(f"Loading GloVe from {filepath}...") with open(filepath, 'r', encoding='utf-8') as f: for line_num, line in enumerate(f): values = line.rstrip().split(' ') word = values[0] try: vector = np.array([float(x) for x in values[1:]]) if embedding_dim is None: embedding_dim = len(vector) word_vectors[word] = vector except ValueError: print(f"Skipping malformed line {line_num}: {word}") continue if (line_num + 1) % 100000 == 0: print(f" Loaded {line_num + 1:,} vectors...") print(f"Loaded {len(word_vectors):,} vectors of dimension {embedding_dim}") return word_vectors, embedding_dim # Option 3: Convert to gensim format for compatibilitydef convert_glove_to_gensim_format(glove_file: str, output_file: str): """ Convert raw GloVe file to gensim key-vectors format. Adds header line required by gensim. """ from gensim.models import KeyedVectors from gensim.scripts.glove2word2vec import glove2word2vec # Convert format temp_file = output_file + '.tmp' glove2word2vec(glove_file, temp_file) # Load and save in native format model = KeyedVectors.load_word2vec_format(temp_file, binary=False) model.save(output_file) import os os.remove(temp_file) return model # Usage example# Method 1: Via gensim downloaderglove = load_glove_via_gensim('glove-wiki-gigaword-100') # Test the embeddingsprint("Word similarity:")print(f"similarity('king', 'queen') = {glove.similarity('king', 'queen'):.4f}")print(f"similarity('cat', 'dog') = {glove.similarity('cat', 'dog'):.4f}")print(f"similarity('cat', 'car') = {glove.similarity('cat', 'car'):.4f}") print("Word analogies:")result = glove.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)print(f"king - man + woman = {result}") # Nearest neighborsprint("Nearest neighbors of 'python':")for word, sim in glove.most_similar('python', topn=5): print(f" {word}: {sim:.4f}")For general NLP: Use glove.6B (300d) or glove.840B — trained on clean, diverse text. For social media: Use glove.twitter — includes informal language, slang, hashtags. For limited compute: Use 100d or 200d versions — often 90-95% as effective as 300d at half the size.
When using GloVe embeddings in applications, several practical factors affect performance:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
import numpy as npfrom typing import Optional class GloVeEmbedder: """ Production-ready GloVe embedding wrapper with best practices. """ def __init__( self, word_vectors, normalize: bool = True, lowercase: bool = True, oov_strategy: str = 'zero' # 'zero', 'random', 'mean' ): self.word_vectors = word_vectors self.normalize = normalize self.lowercase = lowercase self.oov_strategy = oov_strategy self.dim = word_vectors.vector_size # Precompute mean vector for 'mean' OOV strategy if oov_strategy == 'mean': sample_size = min(10000, len(word_vectors)) sample_words = list(word_vectors.key_to_index.keys())[:sample_size] sample_vecs = [word_vectors[w] for w in sample_words] self.mean_vector = np.mean(sample_vecs, axis=0) else: self.mean_vector = None # Precompute normalized vectors if needed if normalize: self._normalized_cache = {} def get_word_vector(self, word: str) -> np.ndarray: """Get embedding for a single word with OOV handling.""" if self.lowercase: word = word.lower() if word in self.word_vectors: vec = self.word_vectors[word] if self.normalize: if word not in self._normalized_cache: norm = np.linalg.norm(vec) self._normalized_cache[word] = vec / norm if norm > 0 else vec return self._normalized_cache[word] return vec # OOV handling if self.oov_strategy == 'zero': return np.zeros(self.dim) elif self.oov_strategy == 'random': # Small random vector return np.random.randn(self.dim) * 0.01 elif self.oov_strategy == 'mean': return self.mean_vector else: raise ValueError(f"Unknown OOV strategy: {self.oov_strategy}") def get_document_vector( self, words: list, weights: Optional[np.ndarray] = None ) -> np.ndarray: """Get average embedding for a document with optional weighting.""" vectors = [] actual_weights = [] for i, word in enumerate(words): vec = self.get_word_vector(word) if vec is not None and not np.allclose(vec, 0): # Skip zero vectors vectors.append(vec) if weights is not None: actual_weights.append(weights[i]) if not vectors: return np.zeros(self.dim) vectors = np.array(vectors) if weights is not None and actual_weights: actual_weights = np.array(actual_weights) avg = np.average(vectors, weights=actual_weights, axis=0) else: avg = np.mean(vectors, axis=0) if self.normalize: norm = np.linalg.norm(avg) if norm > 0: avg = avg / norm return avg def get_coverage(self, words: list) -> float: """Compute vocabulary coverage for a list of words.""" covered = sum(1 for w in words if (w.lower() if self.lowercase else w) in self.word_vectors) return covered / len(words) if words else 0.0 # Example usageembedder = GloVeEmbedder( glove, normalize=True, lowercase=True, oov_strategy='zero') # Single wordking_vec = embedder.get_word_vector('King') # Will be lowercasedprint(f"'King' vector norm: {np.linalg.norm(king_vec):.4f}") # Should be 1.0 # Documentdoc = "The neural network learns patterns from data".split()doc_vec = embedder.get_document_vector(doc)print(f"Document vector norm: {np.linalg.norm(doc_vec):.4f}") # Coverage checktechnical_doc = "BERT transformer attention mechanism self-supervised".split()coverage = embedder.get_coverage(technical_doc)print(f"Vocabulary coverage: {coverage:.1%}")Several advanced techniques extend the basic GloVe approach:
1. Retrofitting to Lexical Resources:
Post-processing GloVe vectors to inject knowledge from external resources (WordNet, FrameNet). The retrofitting objective:
$$\Psi = \sum_{i=1}^{n} \left[ \alpha_i ||\hat{w}i - w_i||^2 + \sum{j:(i,j)\in E} \beta_{ij} ||\hat{w}_i - \hat{w}_j||^2 \right]$$
This pulls related words (from knowledge base edges E) closer together while staying near the original embeddings.
2. Debiasing:
Remove gender, racial, or other biases from embeddings. The classic approach identifies a "bias direction" and projects it out:
$$w_{\text{debiased}} = w - (w \cdot d) \cdot d$$
where d is the identified bias direction (e.g., from 'he'-'she' pairs).
3. Domain Adaptation:
Fine-tune pre-trained GloVe on domain-specific corpus. Options include:
Word embeddings capture statistical patterns in training data, including societal biases. Famous examples: 'doctor' is closer to 'man', 'nurse' to 'woman'. For applications in hiring, lending, or other high-stakes domains, bias evaluation and mitigation is essential. See the Word Embedding Association Test (WEAT) for bias measurement.
We've comprehensively explored GloVe, a fundamentally different approach to learning word embeddings. Let's consolidate the key insights:
What's next:
Both Word2Vec and GloVe treat each word as an atomic unit—they can't handle words not in their vocabulary. The next page covers FastText, which extends Word2Vec by representing words as bags of character n-grams. This enables embeddings for any word (even misspellings and neologisms) and often produces better representations for morphologically rich languages.
You now have a complete understanding of GloVe—from the theoretical foundation in co-occurrence ratios through the weighted least squares objective to practical usage of pre-trained vectors. Combined with Word2Vec knowledge, you can make informed choices about which embedding approach suits your specific application.