Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

2 / 6

Average Word2Vec: Document Representation through Vector Averaging

From Words to Documents

Word2Vec gives us magnificent representations for individual words—but real-world NLP tasks rarely operate at the word level. We classify documents, summarize articles, compare sentences, and cluster reviews. The fundamental challenge becomes: how do we combine individual word embeddings into a meaningful representation for an entire text?

The simplest answer—and often a surprisingly effective one—is to simply average the word vectors. This technique, known as Average Word2Vec or Mean Word Embedding, serves as both a powerful baseline and a foundation for understanding more sophisticated composition methods.

What You Will Master

By the end of this page, you will understand why averaging works (and when it doesn't), implement robust averaging with proper handling of edge cases, explore the theoretical justification from the perspective of centroids and Fisher vectors, and learn when average Word2Vec outperforms more complex methods. This simple technique often serves as the strongest baseline you'll encounter.

The Averaging Principle

The core idea is elegantly simple: represent a document as the centroid of its constituent word vectors. Given a document D containing words w₁, w₂, ..., wₙ, the document embedding is:

$$\mathbf{v}D = \frac{1}{n} \sum{i=1}^{n} \mathbf{v}_{w_i}$$

where v_{wᵢ} is the Word2Vec embedding of word wᵢ.

Geometric interpretation:

In the embedding space, this centroid sits at the "center of mass" of all words in the document. Documents about similar topics will have similar centroids because they contain similar words. A document about "dogs, puppies, veterinarians, and pet food" will have a centroid near the 'pet' region of the embedding space.

average_word2vec_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from typing import List, Optional
 
class AverageWord2Vec:
    """
    Document embedding via averaging word vectors.
    Handles out-of-vocabulary words and empty documents gracefully.
    """
    
    def __init__(self, word2vec_model, embedding_dim: int = 200):
        """
        Initialize with a trained Word2Vec model.
        
        Args:
            word2vec_model: A trained gensim Word2Vec model or KeyedVectors
            embedding_dim: Dimension of word embeddings (for zero vector fallback)
        """
        self.model = word2vec_model
        self.embedding_dim = embedding_dim
        
        # Handle both full model and KeyedVectors
        self.vectors = word2vec_model.wv if hasattr(word2vec_model, 'wv') else word2vec_model
    
    def get_document_vector(self, words: List[str]) -> np.ndarray:
        """
        Compute the average word vector for a document.
        
        Args:
            words: List of tokens in the document
            
        Returns:
            Average embedding as numpy array of shape (embedding_dim,)
        """
        # Collect vectors for in-vocabulary words
        word_vectors = []
        for word in words:
            if word in self.vectors:
                word_vectors.append(self.vectors[word])
        
        if len(word_vectors) == 0:
            # No known words: return zero vector
            # Alternative: return None and handle downstream
            return np.zeros(self.embedding_dim)
        
        # Compute centroid (average)
        return np.mean(word_vectors, axis=0)
    
    def transform(self, documents: List[List[str]]) -> np.ndarray:
        """
        Transform a list of documents to embedding matrix.
        
        Args:
            documents: List of tokenized documents
            
        Returns:
            Matrix of shape (n_documents, embedding_dim)
        """
        return np.array([self.get_document_vector(doc) for doc in documents])
 
 
# Example usage
from gensim.models import Word2Vec
 
# Assume we have a trained model
# model = Word2Vec.load("pretrained_word2vec.model")
 
# Sample documents (tokenized)
documents = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["dogs", "are", "loyal", "companions", "and", "pets"],
    ["machine", "learning", "algorithms", "process", "data"],
]
 
# Create embedder
embedder = AverageWord2Vec(model, embedding_dim=200)
 
# Get document embeddings
doc_embeddings = embedder.transform(documents)
print(f"Document embedding matrix shape: {doc_embeddings.shape}")
# Output: Document embedding matrix shape: (3, 200)
 
# Compare documents via cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
 
similarities = cosine_similarity(doc_embeddings)
print(f"
Document similarity matrix:")
print(similarities.round(3))

Why Averaging Works

Averaging works because Word2Vec embeddings are designed such that similar words have similar vectors. When averaging over a document, words that reinforce the main topic contribute vectors in similar directions, strengthening the signal. Words that are merely syntactic scaffolding ('the', 'is', 'and') contribute in varied directions that tend to cancel out, effectively reducing noise.

Handling Edge Cases Robustly

Production systems must handle edge cases gracefully. Average Word2Vec encounters several common issues that require careful treatment.

Edge Cases and Strategies
Edge Case	Problem	Strategy	Trade-offs
Out-of-vocabulary (OOV) words	Word not in Word2Vec vocabulary	Skip OOV words; optionally log warnings	Loses information from rare/new words
All words are OOV	Document produces empty average	Return zero vector; or None with downstream handling	Zero vector is not semantically meaningful
Empty document	No words to average	Return zero vector; or reject as invalid input	Should rarely occur in practice
Very short documents	Few words → high variance embedding	Consider length normalization or minimum length threshold	Short docs may not have stable representations
Repeated words	Same word contributes multiple times	Either allow (captures emphasis) or de-duplicate	Repetition may indicate importance

robust_average_w2v.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
from typing import List, Optional, Tuple
from collections import Counter
import logging
 
logger = logging.getLogger(__name__)
 
class RobustAverageWord2Vec:
    """
    Production-ready Average Word2Vec with comprehensive edge case handling.
    """
    
    def __init__(
        self,
        word2vec_model,
        embedding_dim: int = 200,
        min_words: int = 1,
        oov_strategy: str = 'skip',  # 'skip', 'zero', 'random'
        normalize: bool = True,
        deduplicate: bool = False
    ):
        """
        Args:
            word2vec_model: Trained word vectors
            embedding_dim: Embedding dimension
            min_words: Minimum in-vocabulary words required
            oov_strategy: How to handle OOV words
            normalize: Whether to L2-normalize final vectors
            deduplicate: Whether to use each word only once
        """
        self.model = word2vec_model
        self.embedding_dim = embedding_dim
        self.min_words = min_words
        self.oov_strategy = oov_strategy
        self.normalize = normalize
        self.deduplicate = deduplicate
        
        self.vectors = word2vec_model.wv if hasattr(word2vec_model, 'wv') else word2vec_model
        
        # Statistics tracking
        self.stats = {
            'total_docs': 0,
            'empty_docs': 0,
            'below_threshold_docs': 0,
            'total_words': 0,
            'oov_words': 0
        }
    
    def get_document_vector(
        self,
        words: List[str],
        return_coverage: bool = False
    ) -> Tuple[np.ndarray, Optional[float]]:
        """
        Compute robust average embedding with coverage statistics.
        
        Returns:
            Tuple of (embedding, coverage_ratio)
            coverage_ratio is fraction of words found in vocabulary
        """
        self.stats['total_docs'] += 1
        
        if len(words) == 0:
            self.stats['empty_docs'] += 1
            vec = np.zeros(self.embedding_dim)
            return (vec, 0.0) if return_coverage else vec
        
        # Optional deduplication
        if self.deduplicate:
            words = list(set(words))
        
        self.stats['total_words'] += len(words)
        
        # Collect vectors and track OOV
        word_vectors = []
        for word in words:
            if word in self.vectors:
                word_vectors.append(self.vectors[word])
            else:
                self.stats['oov_words'] += 1
                if self.oov_strategy == 'zero':
                    word_vectors.append(np.zeros(self.embedding_dim))
                elif self.oov_strategy == 'random':
                    # Small random vector (not recommended but possible)
                    word_vectors.append(np.random.randn(self.embedding_dim) * 0.01)
                # 'skip' strategy: don't add anything
        
        coverage = len(word_vectors) / len(words) if words else 0.0
        
        # Check minimum word threshold
        if len(word_vectors) < self.min_words:
            self.stats['below_threshold_docs'] += 1
            logger.warning(
                f"Document has only {len(word_vectors)} known words "
                f"(threshold: {self.min_words})"
            )
            vec = np.zeros(self.embedding_dim)
            return (vec, coverage) if return_coverage else vec
        
        # Compute average
        avg_vector = np.mean(word_vectors, axis=0)
        
        # Optional normalization
        if self.normalize:
            norm = np.linalg.norm(avg_vector)
            if norm > 0:
                avg_vector = avg_vector / norm
        
        return (avg_vector, coverage) if return_coverage else avg_vector
    
    def get_stats_summary(self) -> dict:
        """Return summary statistics from processing."""
        if self.stats['total_docs'] == 0:
            return self.stats
        
        return {
            **self.stats,
            'empty_doc_rate': self.stats['empty_docs'] / self.stats['total_docs'],
            'below_threshold_rate': self.stats['below_threshold_docs'] / self.stats['total_docs'],
            'oov_rate': self.stats['oov_words'] / self.stats['total_words'] if self.stats['total_words'] > 0 else 0
        }
 
 
# Usage example with logging
embedder = RobustAverageWord2Vec(
    model,
    normalize=True,
    min_words=3,
    oov_strategy='skip'
)
 
# Process documents
for doc in documents:
    vec, coverage = embedder.get_document_vector(doc, return_coverage=True)
    print(f"Coverage: {coverage:.1%}, Vector norm: {np.linalg.norm(vec):.4f}")
 
# Check processing statistics
print(embedder.get_stats_summary())

Recommendation: Always Normalize

L2-normalizing the final document vectors is strongly recommended. Normalization ensures that document similarity depends on direction (semantic content) rather than magnitude (often correlated with document length). Cosine similarity between normalized vectors reduces to a simple dot product, which is faster to compute.

The Role of Stop Words

A critical question for Average Word2Vec: should we remove stop words before averaging? The answer is nuanced and depends on how your Word2Vec model was trained.

Arguments for removing stop words:

Stop words appear in almost all documents → their vectors point in similar directions
They dilute the topic signal by pulling all documents toward a common "function word" region
Removing them increases the relative weight of content words

Arguments against removing stop words:

If Word2Vec used subsampling (which heavily downsamples frequent words), stop word vectors are already partially suppressed
Stop words may carry some stylistic/domain information
Additional preprocessing adds complexity

stop_word_impact.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
# NLTK stop words (or use your preferred list)
try:
    from nltk.corpus import stopwords
    STOP_WORDS = set(stopwords.words('english'))
except:
    # Minimal fallback
    STOP_WORDS = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',
                  'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
                  'would', 'could', 'should', 'may', 'might', 'must', 'shall',
                  'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from',
                  'as', 'into', 'through', 'during', 'before', 'after',
                  'above', 'below', 'between', 'under', 'again', 'further',
                  'then', 'once', 'here', 'there', 'when', 'where', 'why',
                  'how', 'all', 'each', 'few', 'more', 'most', 'other',
                  'some', 'such', 'no', 'not', 'only', 'own', 'same', 'so',
                  'than', 'too', 'very', 'can', 'just', 'now', 'and', 'but',
                  'or', 'because', 'until', 'while', 'this', 'that', 'these',
                  'those', 'am', 'it', 'its', 'itself', 'they', 'them', 'their'}
 
def average_with_stop_word_options(
    words: list,
    word_vectors,
    remove_stop_words: bool = True,
    custom_stop_words: set = None
) -> np.ndarray:
    """
    Compute average with optional stop word removal.
    """
    stop_words = custom_stop_words if custom_stop_words else STOP_WORDS
    
    if remove_stop_words:
        words = [w for w in words if w.lower() not in stop_words]
    
    vectors = [word_vectors[w] for w in words if w in word_vectors]
    
    if not vectors:
        return np.zeros(word_vectors.vector_size)
    
    return np.mean(vectors, axis=0)
 
# Empirical comparison
def compare_stop_word_strategies(documents, word_vectors, labels=None):
    """
    Compare document embeddings with and without stop word removal.
    """
    embeddings_with_stops = []
    embeddings_without_stops = []
    
    for doc in documents:
        embeddings_with_stops.append(
            average_with_stop_word_options(doc, word_vectors, remove_stop_words=False)
        )
        embeddings_without_stops.append(
            average_with_stop_word_options(doc, word_vectors, remove_stop_words=True)
        )
    
    with_stops = np.array(embeddings_with_stops)
    without_stops = np.array(embeddings_without_stops)
    
    # Compare similarity structures
    sim_with = cosine_similarity(with_stops)
    sim_without = cosine_similarity(without_stops)
    
    print("Similarity matrix WITH stop words:")
    print(sim_with.round(3))
    print("
Similarity matrix WITHOUT stop words:")
    print(sim_without.round(3))
    
    # How different are the similarity rankings?
    from scipy.stats import spearmanr
    
    # Flatten upper triangles (excluding diagonal)
    upper_with = sim_with[np.triu_indices_from(sim_with, k=1)]
    upper_without = sim_without[np.triu_indices_from(sim_without, k=1)]
    
    correlation, pvalue = spearmanr(upper_with, upper_without)
    print(f"
Rank correlation between strategies: {correlation:.4f} (p={pvalue:.4f})")
    
    return with_stops, without_stops
 
# Example documents
docs = [
    "The machine learning algorithm processes the data efficiently".split(),
    "This neural network model is trained on images".split(),
    "The cat sat on the mat in the house".split(),
]
 
with_stops, without_stops = compare_stop_word_strategies(docs, word_vectors)

Practical Recommendation

If using pre-trained embeddings (Google News Word2Vec, GloVe, etc.) that were NOT trained with subsampling, removing stop words generally helps. If using embeddings trained with subsampling (standard gensim Word2Vec), the improvement from stop word removal is smaller but still often beneficial. Always validate empirically on your specific task.

Theoretical Justification: Why Does Averaging Work?

The effectiveness of simple averaging surprises many practitioners. Several theoretical perspectives explain why this simple approach captures meaningful document semantics:

Perspective 1: Maximum Likelihood under Isotropy

Arora et al. (2017) showed that under certain assumptions about the word embedding space, the average embedding is the maximum likelihood estimator for the document's "topic" vector. Specifically, if:

Words are generated from a discourse (topic) vector c plus noise
The embedding space is roughly isotropic (vectors spread uniformly)

Then the average embedding approximates the latent discourse vector.

Perspective 2: First-Order Fisher Vectors

From the perspective of Fisher kernel methods, averaging is the simplest case of Fisher vector aggregation. The Fisher vector describes how a sample differs from a background model. For word embeddings with a uniform prior, the first-order Fisher term reduces to the average.

Perspective 3: Bag-of-Embeddings as Kernel

The inner product between two average embeddings equals the average pairwise word similarity:

$$\langle \bar{v}A, \bar{v}B \rangle = \frac{1}{|A||B|} \sum{w_i \in A} \sum{w_j \in B} \langle v_{w_i}, v_{w_j} \rangle$$

This is equivalent to a "bag-of-embeddings" kernel that measures cumulative word-level similarity.

The SIF Enhancement

Arora et al.'s theoretical analysis led to SIF (Smooth Inverse Frequency) weighting, a principled improvement to simple averaging. SIF downweights frequent words using smooth inverse frequency weights and removes the first principal component (capturing common discourse). This often improves over simple averaging by 2-5% on semantic similarity benchmarks.

sif_embeddings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from collections import Counter
from sklearn.decomposition import PCA
 
class SIFEmbeddings:
    """
    Smooth Inverse Frequency embeddings (Arora et al., 2017).
    A principled improvement over simple averaging.
    """
    
    def __init__(self, word_vectors, word_frequencies: dict, a: float = 1e-3):
        """
        Args:
            word_vectors: Word embedding dictionary or KeyedVectors
            word_frequencies: Dictionary mapping words to their frequencies (probabilities)
            a: SIF hyperparameter (typically 1e-3 to 1e-4)
        """
        self.word_vectors = word_vectors
        self.word_freq = word_frequencies
        self.a = a
        self.embedding_dim = word_vectors.vector_size
        self.principal_component = None
    
    def _sif_weight(self, word: str) -> float:
        """Compute SIF weight for a word."""
        freq = self.word_freq.get(word, 1e-9)  # Small default for unknown
        return self.a / (self.a + freq)
    
    def _weighted_average(self, words: list) -> np.ndarray:
        """Compute SIF-weighted average for a document."""
        weighted_sum = np.zeros(self.embedding_dim)
        total_weight = 0
        
        for word in words:
            if word in self.word_vectors:
                weight = self._sif_weight(word)
                weighted_sum += weight * self.word_vectors[word]
                total_weight += weight
        
        if total_weight > 0:
            return weighted_sum / total_weight
        return weighted_sum
    
    def fit(self, documents: list):
        """
        Fit the model by computing the first principal component.
        This captures the common discourse vector to be removed.
        """
        # Compute weighted averages for all documents
        embeddings = np.array([self._weighted_average(doc) for doc in documents])
        
        # Remove zero vectors
        non_zero_mask = np.any(embeddings != 0, axis=1)
        if non_zero_mask.sum() > 0:
            non_zero_embeddings = embeddings[non_zero_mask]
            
            # Compute first principal component
            pca = PCA(n_components=1)
            pca.fit(non_zero_embeddings)
            self.principal_component = pca.components_[0]
        
        return self
    
    def transform(self, documents: list) -> np.ndarray:
        """
        Transform documents to SIF embeddings.
        """
        embeddings = np.array([self._weighted_average(doc) for doc in documents])
        
        # Remove common component
        if self.principal_component is not None:
            for i in range(len(embeddings)):
                projection = np.dot(embeddings[i], self.principal_component)
                embeddings[i] = embeddings[i] - projection * self.principal_component
        
        return embeddings
    
    def fit_transform(self, documents: list) -> np.ndarray:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
 
 
# Example usage
# First, compute word frequencies from your corpus
def compute_word_frequencies(corpus: list) -> dict:
    """Compute word probability distribution from corpus."""
    word_counts = Counter()
    total = 0
    for doc in corpus:
        word_counts.update(doc)
        total += len(doc)
    
    return {word: count / total for word, count in word_counts.items()}
 
word_freq = compute_word_frequencies(training_corpus)
 
sif = SIFEmbeddings(word_vectors, word_freq, a=1e-3)
doc_embeddings = sif.fit_transform(documents)

Comparison with Bag of Words

Average Word2Vec and Bag-of-Words represent different philosophies for document representation. Understanding their relative strengths informs when to use each.

Average Word2Vec vs. Bag of Words
Property	Average Word2Vec	Bag of Words
Dimensionality	Fixed (e.g., 200-300)	Vocabulary size (10K-1M+)
Density	Dense (all dimensions used)	Very sparse (mostly zeros)
Similarity measure	Captures semantic similarity	Captures term overlap only
OOV handling	Unknown words ignored	Unknown words become new dimensions
Training required	Need pre-trained Word2Vec	No pre-training needed
Memory per document	O(d) for d dimensions	O(V) or sparse O(nnz)
Interpretability	Dimensions have no clear meaning	Each dimension = specific word
Semantic generalization	Yes (synonyms similar)	No (synonyms orthogonal)

When Average W2V Wins

•Semantic similarity tasks — Finding paraphrases, semantic search
•Small training datasets — Pre-trained embeddings provide prior knowledge
•Cross-domain transfer — Embeddings trained on large corpora generalize
•Dense input needed — Many ML models prefer dense features
•Short documents — Where BoW sparsity hurts most

When Bag of Words Wins

•Keyword-specific tasks — When exact word presence matters
•Large training datasets — Can learn word importance from scratch
•Specialized vocabularies — Domain terms not in pre-trained embeddings
•Interpretability needed — Feature importance is word importance
•Information retrieval — Term-based matching remains strong

Hybrid Approach

In practice, you can concatenate Average Word2Vec with TF-IDF features to get both semantic and keyword-based signals. Many competitive systems use this hybrid approach, letting the downstream model learn to weight both types of information appropriately.

Variations and Extensions

Simple averaging can be extended in several ways. Each variation addresses a specific limitation:

1. Summing instead of Averaging

For some applications, the sum (rather than mean) works better:

$$\mathbf{v}D^{sum} = \sum{i=1}^{n} \mathbf{v}_{w_i}$$

The sum preserves information about document length—longer documents have larger magnitude embeddings. This is useful when document length is informative (e.g., longer reviews may be more detailed).

2. Max Pooling

Take the element-wise maximum across word vectors:

$$\mathbf{v}D^{max}[j] = \max{i} \mathbf{v}_{w_i}[j]$$

Max pooling captures "peak" semantic features—if any word strongly activates a dimension, it's preserved. This can capture rare but important words better than averaging.

pooling_variations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
 
def get_embeddings_matrix(words: list, word_vectors) -> np.ndarray:
    """Get matrix of embeddings for words in vocabulary."""
    vectors = [word_vectors[w] for w in words if w in word_vectors]
    if not vectors:
        return None
    return np.array(vectors)
 
def average_pooling(words: list, word_vectors) -> np.ndarray:
    """Standard mean pooling."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.mean(matrix, axis=0)
 
def sum_pooling(words: list, word_vectors) -> np.ndarray:
    """Sum pooling (preserves document length)."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.sum(matrix, axis=0)
 
def max_pooling(words: list, word_vectors) -> np.ndarray:
    """Element-wise max pooling."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.max(matrix, axis=0)
 
def min_pooling(words: list, word_vectors) -> np.ndarray:
    """Element-wise min pooling."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.min(matrix, axis=0)
 
def concatenated_pooling(words: list, word_vectors) -> np.ndarray:
    """Concatenate multiple pooling strategies."""
    avg = average_pooling(words, word_vectors)
    max_pool = max_pooling(words, word_vectors)
    min_pool = min_pooling(words, word_vectors)
    
    # Concatenate for richer representation (3x embedding_dim)
    return np.concatenate([avg, max_pool, min_pool])
 
def hierarchical_average(words: list, word_vectors, chunk_size: int = 10) -> np.ndarray:
    """
    Hierarchical averaging: first average within chunks, then average chunks.
    Can capture some local structure.
    """
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    
    # Split into chunks and average each
    n_chunks = max(1, len(matrix) // chunk_size)
    chunks = np.array_split(matrix, n_chunks)
    
    chunk_averages = [np.mean(chunk, axis=0) for chunk in chunks]
    
    # Average the chunk averages
    return np.mean(chunk_averages, axis=0)
 
 
# Comparison
doc = "the quick brown fox jumps over the lazy dog".split()
 
print("Pooling comparison:")
print(f"Average pooling norm: {np.linalg.norm(average_pooling(doc, wv)):.4f}")
print(f"Sum pooling norm: {np.linalg.norm(sum_pooling(doc, wv)):.4f}")
print(f"Max pooling norm: {np.linalg.norm(max_pooling(doc, wv)):.4f}")
print(f"Concatenated dim: {len(concatenated_pooling(doc, wv))}")

Power Mean Embeddings

A generalization is power mean embeddings: compute v^p mean = (1/n Σ vᵢᵖ)^(1/p). When p=1, this is standard average. As p→∞, it approaches max pooling. As p→-∞, it approaches min pooling. Concatenating power means for multiple p values (e.g., p=1, p=3, p=∞) can capture complementary information.

Practical Applications

Average Word2Vec embeddings serve as practical features for a wide range of downstream tasks:

Common Use Cases

•Document Classification — Use average embeddings as features for logistic regression, SVM, or neural classifiers. Often competitive with more complex methods for topic classification.
•Semantic Search — Index document embeddings for nearest-neighbor search. Query embedding matches semantically similar documents even without keyword overlap.
•Duplicate Detection — Documents with high cosine similarity may be duplicates or near-duplicates. Faster than comparing full text.
•Clustering — K-means or hierarchical clustering on embeddings discovers latent topic structure without predefined categories.
•Recommendation — Represent items and users as averages of their associated word embeddings. Compute user-item similarity for content-based recommendations.

practical_pipeline.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
 
class AverageWord2VecTransformer(BaseEstimator, TransformerMixin):
    """
    Sklearn-compatible transformer for Average Word2Vec.
    Can be used in Pipeline with other sklearn components.
    """
    
    def __init__(self, word_vectors, normalize=True, remove_stop_words=True):
        self.word_vectors = word_vectors
        self.normalize = normalize
        self.remove_stop_words = remove_stop_words
        self.dim = word_vectors.vector_size
        self.stop_words = set(['the', 'a', 'an', 'is', 'are', 'was', 'of', 'to', 'in', 'for'])
    
    def fit(self, X, y=None):
        return self  # Nothing to fit
    
    def transform(self, X):
        """Transform list of tokenized documents to embeddings."""
        embeddings = []
        for doc in X:
            words = doc if isinstance(doc, list) else doc.split()
            
            if self.remove_stop_words:
                words = [w for w in words if w.lower() not in self.stop_words]
            
            vectors = [self.word_vectors[w] for w in words if w in self.word_vectors]
            
            if vectors:
                avg = np.mean(vectors, axis=0)
                if self.normalize:
                    norm = np.linalg.norm(avg)
                    if norm > 0:
                        avg = avg / norm
                embeddings.append(avg)
            else:
                embeddings.append(np.zeros(self.dim))
        
        return np.array(embeddings)
 
 
# Complete classification pipeline
def create_classification_pipeline(word_vectors):
    """Create a complete text classification pipeline."""
    return Pipeline([
        ('embedding', AverageWord2VecTransformer(word_vectors)),
        ('classifier', LogisticRegression(max_iter=1000, C=1.0))
    ])
 
# Usage
pipeline = create_classification_pipeline(word_vectors)
 
# X is a list of tokenized documents or space-separated strings
# y is the label array
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
 
# Fit and predict
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
 
# For semantic search: create index of document embeddings
from sklearn.neighbors import NearestNeighbors
 
class SemanticSearch:
    def __init__(self, word_vectors, n_neighbors=5):
        self.transformer = AverageWord2VecTransformer(word_vectors)
        self.nn = NearestNeighbors(n_neighbors=n_neighbors, metric='cosine')
        self.documents = None
    
    def index(self, documents):
        """Build search index from documents."""
        self.documents = documents
        embeddings = self.transformer.transform(documents)
        self.nn.fit(embeddings)
        return self
    
    def search(self, query, k=5):
        """Find k most similar documents to query."""
        query_embedding = self.transformer.transform([query])
        distances, indices = self.nn.kneighbors(query_embedding, n_neighbors=k)
        
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            results.append({
                'document': self.documents[idx],
                'similarity': 1 - dist  # cosine distance to similarity
            })
        return results

Limitations and When to Use Alternatives

Despite its effectiveness, Average Word2Vec has fundamental limitations that more sophisticated methods address:

Key Limitations

•Loss of word order: 'Dog bites man' and 'Man bites dog' have identical embeddings. For tasks where word order matters (sentiment, intent), this is problematic.
•No compositionality: Can't distinguish 'not good' (negative) from 'good' (positive) because they average similarly.
•Equal weighting: All words contribute equally (unless using TF-IDF weighting—covered next). Rare, informative words may be drowned by common words.
•Fixed representation: Each word has one embedding regardless of context. 'Bank' means the same whether it's a river bank or a financial bank.
•OOV fragility: Words not in vocabulary are completely ignored. Misspellings, neologisms, and domain-specific terms may be lost.

When to Use Alternatives
Limitation	Alternative Approach	When to Switch
Word order matters	Sentence embeddings (BERT, USE)	Sentiment, intent, paraphrase detection
Negation/composition	Contextualized embeddings	Fine-grained semantic understanding
Unequal word importance	TF-IDF weighted averaging	Information retrieval, document similarity
Polysemy (multiple meanings)	Contextualized embeddings (ELMo, BERT)	Word sense disambiguation
OOV words	FastText (subword embeddings)	Noisy text, morphologically rich languages

Average Word2Vec as a Baseline

Despite these limitations, Average Word2Vec is an excellent baseline. Before investing in complex methods like BERT, establish performance with averaged embeddings. You'll be surprised how often the simple baseline is competitive—or even how it remains the best choice given computational and latency constraints in production systems.

Summary: Average Word2Vec

We've comprehensively explored the simple yet powerful technique of averaging word vectors. Let's consolidate the key insights:

Key Takeaways

•Averaging computes the centroid of word vectors, placing documents in a semantic space where similar topics cluster together.
•Edge cases require careful handling: OOV words, empty documents, and very short texts all need explicit strategies.
•Stop word removal often helps by reducing dilution of content signal, especially with embeddings not trained with subsampling.
•Theoretical justifications exist from maximum likelihood, Fisher vectors, and kernel perspectives—averaging isn't just a heuristic.
•Variations like SIF, max pooling, and power means can improve over simple averaging for specific applications.
•Average Word2Vec works best when you need semantic similarity, have limited training data, or require dense fixed-length features.

What's next:

Simple averaging treats all words equally—but some words are clearly more important than others. The next page covers TF-IDF weighted Word2Vec, which uses term frequency-inverse document frequency weights to emphasize informative words while downweighting common ones. This principled weighting often significantly improves document representation quality.

Page Complete

You now have a complete understanding of Average Word2Vec—from basic implementation through robust production handling to theoretical foundations. This simple technique remains a powerful tool in any NLP practitioner's toolkit and serves as the foundation for understanding more sophisticated composition methods.

2 / 6

Loading learning content...

Feature Engineering & SelectionWord Embeddings

Word Embeddings

LevelAdvanced

Duration120 mins

TopicWord Embeddings

2 / 6

Average Word2Vec: Document Representation through Vector Averaging

From Words to Documents

What You Will Master

The Averaging Principle

The core idea is elegantly simple: represent a document as the centroid of its constituent word vectors. Given a document D containing words w₁, w₂, ..., wₙ, the document embedding is:

$$\mathbf{v}D = \frac{1}{n} \sum{i=1}^{n} \mathbf{v}_{w_i}$$

where v_{wᵢ} is the Word2Vec embedding of word wᵢ.

Geometric interpretation:

average_word2vec_basic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
import numpy as np
from typing import List, Optional
 
class AverageWord2Vec:
    """
    Document embedding via averaging word vectors.
    Handles out-of-vocabulary words and empty documents gracefully.
    """
    
    def __init__(self, word2vec_model, embedding_dim: int = 200):
        """
        Initialize with a trained Word2Vec model.
        
        Args:
            word2vec_model: A trained gensim Word2Vec model or KeyedVectors
            embedding_dim: Dimension of word embeddings (for zero vector fallback)
        """
        self.model = word2vec_model
        self.embedding_dim = embedding_dim
        
        # Handle both full model and KeyedVectors
        self.vectors = word2vec_model.wv if hasattr(word2vec_model, 'wv') else word2vec_model
    
    def get_document_vector(self, words: List[str]) -> np.ndarray:
        """
        Compute the average word vector for a document.
        
        Args:
            words: List of tokens in the document
            
        Returns:
            Average embedding as numpy array of shape (embedding_dim,)
        """
        # Collect vectors for in-vocabulary words
        word_vectors = []
        for word in words:
            if word in self.vectors:
                word_vectors.append(self.vectors[word])
        
        if len(word_vectors) == 0:
            # No known words: return zero vector
            # Alternative: return None and handle downstream
            return np.zeros(self.embedding_dim)
        
        # Compute centroid (average)
        return np.mean(word_vectors, axis=0)
    
    def transform(self, documents: List[List[str]]) -> np.ndarray:
        """
        Transform a list of documents to embedding matrix.
        
        Args:
            documents: List of tokenized documents
            
        Returns:
            Matrix of shape (n_documents, embedding_dim)
        """
        return np.array([self.get_document_vector(doc) for doc in documents])
 
 
# Example usage
from gensim.models import Word2Vec
 
# Assume we have a trained model
# model = Word2Vec.load("pretrained_word2vec.model")
 
# Sample documents (tokenized)
documents = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["dogs", "are", "loyal", "companions", "and", "pets"],
    ["machine", "learning", "algorithms", "process", "data"],
]
 
# Create embedder
embedder = AverageWord2Vec(model, embedding_dim=200)
 
# Get document embeddings
doc_embeddings = embedder.transform(documents)
print(f"Document embedding matrix shape: {doc_embeddings.shape}")
# Output: Document embedding matrix shape: (3, 200)
 
# Compare documents via cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
 
similarities = cosine_similarity(doc_embeddings)
print(f"
Document similarity matrix:")
print(similarities.round(3))

Why Averaging Works

Handling Edge Cases Robustly

Production systems must handle edge cases gracefully. Average Word2Vec encounters several common issues that require careful treatment.

Edge Cases and Strategies
Edge Case	Problem	Strategy	Trade-offs
Out-of-vocabulary (OOV) words	Word not in Word2Vec vocabulary	Skip OOV words; optionally log warnings	Loses information from rare/new words
All words are OOV	Document produces empty average	Return zero vector; or None with downstream handling	Zero vector is not semantically meaningful
Empty document	No words to average	Return zero vector; or reject as invalid input	Should rarely occur in practice
Very short documents	Few words → high variance embedding	Consider length normalization or minimum length threshold	Short docs may not have stable representations
Repeated words	Same word contributes multiple times	Either allow (captures emphasis) or de-duplicate	Repetition may indicate importance

robust_average_w2v.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import numpy as np
from typing import List, Optional, Tuple
from collections import Counter
import logging
 
logger = logging.getLogger(__name__)
 
class RobustAverageWord2Vec:
    """
    Production-ready Average Word2Vec with comprehensive edge case handling.
    """
    
    def __init__(
        self,
        word2vec_model,
        embedding_dim: int = 200,
        min_words: int = 1,
        oov_strategy: str = 'skip',  # 'skip', 'zero', 'random'
        normalize: bool = True,
        deduplicate: bool = False
    ):
        """
        Args:
            word2vec_model: Trained word vectors
            embedding_dim: Embedding dimension
            min_words: Minimum in-vocabulary words required
            oov_strategy: How to handle OOV words
            normalize: Whether to L2-normalize final vectors
            deduplicate: Whether to use each word only once
        """
        self.model = word2vec_model
        self.embedding_dim = embedding_dim
        self.min_words = min_words
        self.oov_strategy = oov_strategy
        self.normalize = normalize
        self.deduplicate = deduplicate
        
        self.vectors = word2vec_model.wv if hasattr(word2vec_model, 'wv') else word2vec_model
        
        # Statistics tracking
        self.stats = {
            'total_docs': 0,
            'empty_docs': 0,
            'below_threshold_docs': 0,
            'total_words': 0,
            'oov_words': 0
        }
    
    def get_document_vector(
        self,
        words: List[str],
        return_coverage: bool = False
    ) -> Tuple[np.ndarray, Optional[float]]:
        """
        Compute robust average embedding with coverage statistics.
        
        Returns:
            Tuple of (embedding, coverage_ratio)
            coverage_ratio is fraction of words found in vocabulary
        """
        self.stats['total_docs'] += 1
        
        if len(words) == 0:
            self.stats['empty_docs'] += 1
            vec = np.zeros(self.embedding_dim)
            return (vec, 0.0) if return_coverage else vec
        
        # Optional deduplication
        if self.deduplicate:
            words = list(set(words))
        
        self.stats['total_words'] += len(words)
        
        # Collect vectors and track OOV
        word_vectors = []
        for word in words:
            if word in self.vectors:
                word_vectors.append(self.vectors[word])
            else:
                self.stats['oov_words'] += 1
                if self.oov_strategy == 'zero':
                    word_vectors.append(np.zeros(self.embedding_dim))
                elif self.oov_strategy == 'random':
                    # Small random vector (not recommended but possible)
                    word_vectors.append(np.random.randn(self.embedding_dim) * 0.01)
                # 'skip' strategy: don't add anything
        
        coverage = len(word_vectors) / len(words) if words else 0.0
        
        # Check minimum word threshold
        if len(word_vectors) < self.min_words:
            self.stats['below_threshold_docs'] += 1
            logger.warning(
                f"Document has only {len(word_vectors)} known words "
                f"(threshold: {self.min_words})"
            )
            vec = np.zeros(self.embedding_dim)
            return (vec, coverage) if return_coverage else vec
        
        # Compute average
        avg_vector = np.mean(word_vectors, axis=0)
        
        # Optional normalization
        if self.normalize:
            norm = np.linalg.norm(avg_vector)
            if norm > 0:
                avg_vector = avg_vector / norm
        
        return (avg_vector, coverage) if return_coverage else avg_vector
    
    def get_stats_summary(self) -> dict:
        """Return summary statistics from processing."""
        if self.stats['total_docs'] == 0:
            return self.stats
        
        return {
            **self.stats,
            'empty_doc_rate': self.stats['empty_docs'] / self.stats['total_docs'],
            'below_threshold_rate': self.stats['below_threshold_docs'] / self.stats['total_docs'],
            'oov_rate': self.stats['oov_words'] / self.stats['total_words'] if self.stats['total_words'] > 0 else 0
        }
 
 
# Usage example with logging
embedder = RobustAverageWord2Vec(
    model,
    normalize=True,
    min_words=3,
    oov_strategy='skip'
)
 
# Process documents
for doc in documents:
    vec, coverage = embedder.get_document_vector(doc, return_coverage=True)
    print(f"Coverage: {coverage:.1%}, Vector norm: {np.linalg.norm(vec):.4f}")
 
# Check processing statistics
print(embedder.get_stats_summary())

Recommendation: Always Normalize

The Role of Stop Words

A critical question for Average Word2Vec: should we remove stop words before averaging? The answer is nuanced and depends on how your Word2Vec model was trained.

Arguments for removing stop words:

Stop words appear in almost all documents → their vectors point in similar directions
They dilute the topic signal by pulling all documents toward a common "function word" region
Removing them increases the relative weight of content words

Arguments against removing stop words:

If Word2Vec used subsampling (which heavily downsamples frequent words), stop word vectors are already partially suppressed
Stop words may carry some stylistic/domain information
Additional preprocessing adds complexity

stop_word_impact.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
# NLTK stop words (or use your preferred list)
try:
    from nltk.corpus import stopwords
    STOP_WORDS = set(stopwords.words('english'))
except:
    # Minimal fallback
    STOP_WORDS = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been',
                  'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will',
                  'would', 'could', 'should', 'may', 'might', 'must', 'shall',
                  'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from',
                  'as', 'into', 'through', 'during', 'before', 'after',
                  'above', 'below', 'between', 'under', 'again', 'further',
                  'then', 'once', 'here', 'there', 'when', 'where', 'why',
                  'how', 'all', 'each', 'few', 'more', 'most', 'other',
                  'some', 'such', 'no', 'not', 'only', 'own', 'same', 'so',
                  'than', 'too', 'very', 'can', 'just', 'now', 'and', 'but',
                  'or', 'because', 'until', 'while', 'this', 'that', 'these',
                  'those', 'am', 'it', 'its', 'itself', 'they', 'them', 'their'}
 
def average_with_stop_word_options(
    words: list,
    word_vectors,
    remove_stop_words: bool = True,
    custom_stop_words: set = None
) -> np.ndarray:
    """
    Compute average with optional stop word removal.
    """
    stop_words = custom_stop_words if custom_stop_words else STOP_WORDS
    
    if remove_stop_words:
        words = [w for w in words if w.lower() not in stop_words]
    
    vectors = [word_vectors[w] for w in words if w in word_vectors]
    
    if not vectors:
        return np.zeros(word_vectors.vector_size)
    
    return np.mean(vectors, axis=0)
 
# Empirical comparison
def compare_stop_word_strategies(documents, word_vectors, labels=None):
    """
    Compare document embeddings with and without stop word removal.
    """
    embeddings_with_stops = []
    embeddings_without_stops = []
    
    for doc in documents:
        embeddings_with_stops.append(
            average_with_stop_word_options(doc, word_vectors, remove_stop_words=False)
        )
        embeddings_without_stops.append(
            average_with_stop_word_options(doc, word_vectors, remove_stop_words=True)
        )
    
    with_stops = np.array(embeddings_with_stops)
    without_stops = np.array(embeddings_without_stops)
    
    # Compare similarity structures
    sim_with = cosine_similarity(with_stops)
    sim_without = cosine_similarity(without_stops)
    
    print("Similarity matrix WITH stop words:")
    print(sim_with.round(3))
    print("
Similarity matrix WITHOUT stop words:")
    print(sim_without.round(3))
    
    # How different are the similarity rankings?
    from scipy.stats import spearmanr
    
    # Flatten upper triangles (excluding diagonal)
    upper_with = sim_with[np.triu_indices_from(sim_with, k=1)]
    upper_without = sim_without[np.triu_indices_from(sim_without, k=1)]
    
    correlation, pvalue = spearmanr(upper_with, upper_without)
    print(f"
Rank correlation between strategies: {correlation:.4f} (p={pvalue:.4f})")
    
    return with_stops, without_stops
 
# Example documents
docs = [
    "The machine learning algorithm processes the data efficiently".split(),
    "This neural network model is trained on images".split(),
    "The cat sat on the mat in the house".split(),
]
 
with_stops, without_stops = compare_stop_word_strategies(docs, word_vectors)

Practical Recommendation

Theoretical Justification: Why Does Averaging Work?

The effectiveness of simple averaging surprises many practitioners. Several theoretical perspectives explain why this simple approach captures meaningful document semantics:

Perspective 1: Maximum Likelihood under Isotropy

Words are generated from a discourse (topic) vector c plus noise
The embedding space is roughly isotropic (vectors spread uniformly)

Then the average embedding approximates the latent discourse vector.

Perspective 2: First-Order Fisher Vectors

Perspective 3: Bag-of-Embeddings as Kernel

The inner product between two average embeddings equals the average pairwise word similarity:

$$\langle \bar{v}A, \bar{v}B \rangle = \frac{1}{|A||B|} \sum{w_i \in A} \sum{w_j \in B} \langle v_{w_i}, v_{w_j} \rangle$$

This is equivalent to a "bag-of-embeddings" kernel that measures cumulative word-level similarity.

The SIF Enhancement

sif_embeddings.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
import numpy as np
from collections import Counter
from sklearn.decomposition import PCA
 
class SIFEmbeddings:
    """
    Smooth Inverse Frequency embeddings (Arora et al., 2017).
    A principled improvement over simple averaging.
    """
    
    def __init__(self, word_vectors, word_frequencies: dict, a: float = 1e-3):
        """
        Args:
            word_vectors: Word embedding dictionary or KeyedVectors
            word_frequencies: Dictionary mapping words to their frequencies (probabilities)
            a: SIF hyperparameter (typically 1e-3 to 1e-4)
        """
        self.word_vectors = word_vectors
        self.word_freq = word_frequencies
        self.a = a
        self.embedding_dim = word_vectors.vector_size
        self.principal_component = None
    
    def _sif_weight(self, word: str) -> float:
        """Compute SIF weight for a word."""
        freq = self.word_freq.get(word, 1e-9)  # Small default for unknown
        return self.a / (self.a + freq)
    
    def _weighted_average(self, words: list) -> np.ndarray:
        """Compute SIF-weighted average for a document."""
        weighted_sum = np.zeros(self.embedding_dim)
        total_weight = 0
        
        for word in words:
            if word in self.word_vectors:
                weight = self._sif_weight(word)
                weighted_sum += weight * self.word_vectors[word]
                total_weight += weight
        
        if total_weight > 0:
            return weighted_sum / total_weight
        return weighted_sum
    
    def fit(self, documents: list):
        """
        Fit the model by computing the first principal component.
        This captures the common discourse vector to be removed.
        """
        # Compute weighted averages for all documents
        embeddings = np.array([self._weighted_average(doc) for doc in documents])
        
        # Remove zero vectors
        non_zero_mask = np.any(embeddings != 0, axis=1)
        if non_zero_mask.sum() > 0:
            non_zero_embeddings = embeddings[non_zero_mask]
            
            # Compute first principal component
            pca = PCA(n_components=1)
            pca.fit(non_zero_embeddings)
            self.principal_component = pca.components_[0]
        
        return self
    
    def transform(self, documents: list) -> np.ndarray:
        """
        Transform documents to SIF embeddings.
        """
        embeddings = np.array([self._weighted_average(doc) for doc in documents])
        
        # Remove common component
        if self.principal_component is not None:
            for i in range(len(embeddings)):
                projection = np.dot(embeddings[i], self.principal_component)
                embeddings[i] = embeddings[i] - projection * self.principal_component
        
        return embeddings
    
    def fit_transform(self, documents: list) -> np.ndarray:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
 
 
# Example usage
# First, compute word frequencies from your corpus
def compute_word_frequencies(corpus: list) -> dict:
    """Compute word probability distribution from corpus."""
    word_counts = Counter()
    total = 0
    for doc in corpus:
        word_counts.update(doc)
        total += len(doc)
    
    return {word: count / total for word, count in word_counts.items()}
 
word_freq = compute_word_frequencies(training_corpus)
 
sif = SIFEmbeddings(word_vectors, word_freq, a=1e-3)
doc_embeddings = sif.fit_transform(documents)

Comparison with Bag of Words

Average Word2Vec and Bag-of-Words represent different philosophies for document representation. Understanding their relative strengths informs when to use each.

Average Word2Vec vs. Bag of Words
Property	Average Word2Vec	Bag of Words
Dimensionality	Fixed (e.g., 200-300)	Vocabulary size (10K-1M+)
Density	Dense (all dimensions used)	Very sparse (mostly zeros)
Similarity measure	Captures semantic similarity	Captures term overlap only
OOV handling	Unknown words ignored	Unknown words become new dimensions
Training required	Need pre-trained Word2Vec	No pre-training needed
Memory per document	O(d) for d dimensions	O(V) or sparse O(nnz)
Interpretability	Dimensions have no clear meaning	Each dimension = specific word
Semantic generalization	Yes (synonyms similar)	No (synonyms orthogonal)

When Average W2V Wins

•Semantic similarity tasks — Finding paraphrases, semantic search
•Small training datasets — Pre-trained embeddings provide prior knowledge
•Cross-domain transfer — Embeddings trained on large corpora generalize
•Dense input needed — Many ML models prefer dense features
•Short documents — Where BoW sparsity hurts most

When Bag of Words Wins

•Keyword-specific tasks — When exact word presence matters
•Large training datasets — Can learn word importance from scratch
•Specialized vocabularies — Domain terms not in pre-trained embeddings
•Interpretability needed — Feature importance is word importance
•Information retrieval — Term-based matching remains strong

Hybrid Approach

Variations and Extensions

Simple averaging can be extended in several ways. Each variation addresses a specific limitation:

1. Summing instead of Averaging

For some applications, the sum (rather than mean) works better:

$$\mathbf{v}D^{sum} = \sum{i=1}^{n} \mathbf{v}_{w_i}$$

2. Max Pooling

Take the element-wise maximum across word vectors:

$$\mathbf{v}D^{max}[j] = \max{i} \mathbf{v}_{w_i}[j]$$

Max pooling captures "peak" semantic features—if any word strongly activates a dimension, it's preserved. This can capture rare but important words better than averaging.

pooling_variations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
 
def get_embeddings_matrix(words: list, word_vectors) -> np.ndarray:
    """Get matrix of embeddings for words in vocabulary."""
    vectors = [word_vectors[w] for w in words if w in word_vectors]
    if not vectors:
        return None
    return np.array(vectors)
 
def average_pooling(words: list, word_vectors) -> np.ndarray:
    """Standard mean pooling."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.mean(matrix, axis=0)
 
def sum_pooling(words: list, word_vectors) -> np.ndarray:
    """Sum pooling (preserves document length)."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.sum(matrix, axis=0)
 
def max_pooling(words: list, word_vectors) -> np.ndarray:
    """Element-wise max pooling."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.max(matrix, axis=0)
 
def min_pooling(words: list, word_vectors) -> np.ndarray:
    """Element-wise min pooling."""
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    return np.min(matrix, axis=0)
 
def concatenated_pooling(words: list, word_vectors) -> np.ndarray:
    """Concatenate multiple pooling strategies."""
    avg = average_pooling(words, word_vectors)
    max_pool = max_pooling(words, word_vectors)
    min_pool = min_pooling(words, word_vectors)
    
    # Concatenate for richer representation (3x embedding_dim)
    return np.concatenate([avg, max_pool, min_pool])
 
def hierarchical_average(words: list, word_vectors, chunk_size: int = 10) -> np.ndarray:
    """
    Hierarchical averaging: first average within chunks, then average chunks.
    Can capture some local structure.
    """
    matrix = get_embeddings_matrix(words, word_vectors)
    if matrix is None:
        return np.zeros(word_vectors.vector_size)
    
    # Split into chunks and average each
    n_chunks = max(1, len(matrix) // chunk_size)
    chunks = np.array_split(matrix, n_chunks)
    
    chunk_averages = [np.mean(chunk, axis=0) for chunk in chunks]
    
    # Average the chunk averages
    return np.mean(chunk_averages, axis=0)
 
 
# Comparison
doc = "the quick brown fox jumps over the lazy dog".split()
 
print("Pooling comparison:")
print(f"Average pooling norm: {np.linalg.norm(average_pooling(doc, wv)):.4f}")
print(f"Sum pooling norm: {np.linalg.norm(sum_pooling(doc, wv)):.4f}")
print(f"Max pooling norm: {np.linalg.norm(max_pooling(doc, wv)):.4f}")
print(f"Concatenated dim: {len(concatenated_pooling(doc, wv))}")

Power Mean Embeddings

Practical Applications

Average Word2Vec embeddings serve as practical features for a wide range of downstream tasks:

Common Use Cases

•Document Classification — Use average embeddings as features for logistic regression, SVM, or neural classifiers. Often competitive with more complex methods for topic classification.
•Semantic Search — Index document embeddings for nearest-neighbor search. Query embedding matches semantically similar documents even without keyword overlap.
•Duplicate Detection — Documents with high cosine similarity may be duplicates or near-duplicates. Faster than comparing full text.
•Clustering — K-means or hierarchical clustering on embeddings discovers latent topic structure without predefined categories.
•Recommendation — Represent items and users as averages of their associated word embeddings. Compute user-item similarity for content-based recommendations.

practical_pipeline.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
 
class AverageWord2VecTransformer(BaseEstimator, TransformerMixin):
    """
    Sklearn-compatible transformer for Average Word2Vec.
    Can be used in Pipeline with other sklearn components.
    """
    
    def __init__(self, word_vectors, normalize=True, remove_stop_words=True):
        self.word_vectors = word_vectors
        self.normalize = normalize
        self.remove_stop_words = remove_stop_words
        self.dim = word_vectors.vector_size
        self.stop_words = set(['the', 'a', 'an', 'is', 'are', 'was', 'of', 'to', 'in', 'for'])
    
    def fit(self, X, y=None):
        return self  # Nothing to fit
    
    def transform(self, X):
        """Transform list of tokenized documents to embeddings."""
        embeddings = []
        for doc in X:
            words = doc if isinstance(doc, list) else doc.split()
            
            if self.remove_stop_words:
                words = [w for w in words if w.lower() not in self.stop_words]
            
            vectors = [self.word_vectors[w] for w in words if w in self.word_vectors]
            
            if vectors:
                avg = np.mean(vectors, axis=0)
                if self.normalize:
                    norm = np.linalg.norm(avg)
                    if norm > 0:
                        avg = avg / norm
                embeddings.append(avg)
            else:
                embeddings.append(np.zeros(self.dim))
        
        return np.array(embeddings)
 
 
# Complete classification pipeline
def create_classification_pipeline(word_vectors):
    """Create a complete text classification pipeline."""
    return Pipeline([
        ('embedding', AverageWord2VecTransformer(word_vectors)),
        ('classifier', LogisticRegression(max_iter=1000, C=1.0))
    ])
 
# Usage
pipeline = create_classification_pipeline(word_vectors)
 
# X is a list of tokenized documents or space-separated strings
# y is the label array
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy: {scores.mean():.3f} (+/- {scores.std()*2:.3f})")
 
# Fit and predict
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
 
# For semantic search: create index of document embeddings
from sklearn.neighbors import NearestNeighbors
 
class SemanticSearch:
    def __init__(self, word_vectors, n_neighbors=5):
        self.transformer = AverageWord2VecTransformer(word_vectors)
        self.nn = NearestNeighbors(n_neighbors=n_neighbors, metric='cosine')
        self.documents = None
    
    def index(self, documents):
        """Build search index from documents."""
        self.documents = documents
        embeddings = self.transformer.transform(documents)
        self.nn.fit(embeddings)
        return self
    
    def search(self, query, k=5):
        """Find k most similar documents to query."""
        query_embedding = self.transformer.transform([query])
        distances, indices = self.nn.kneighbors(query_embedding, n_neighbors=k)
        
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            results.append({
                'document': self.documents[idx],
                'similarity': 1 - dist  # cosine distance to similarity
            })
        return results

Limitations and When to Use Alternatives

Despite its effectiveness, Average Word2Vec has fundamental limitations that more sophisticated methods address:

Key Limitations

•Loss of word order: 'Dog bites man' and 'Man bites dog' have identical embeddings. For tasks where word order matters (sentiment, intent), this is problematic.
•No compositionality: Can't distinguish 'not good' (negative) from 'good' (positive) because they average similarly.
•Equal weighting: All words contribute equally (unless using TF-IDF weighting—covered next). Rare, informative words may be drowned by common words.
•Fixed representation: Each word has one embedding regardless of context. 'Bank' means the same whether it's a river bank or a financial bank.
•OOV fragility: Words not in vocabulary are completely ignored. Misspellings, neologisms, and domain-specific terms may be lost.

When to Use Alternatives
Limitation	Alternative Approach	When to Switch
Word order matters	Sentence embeddings (BERT, USE)	Sentiment, intent, paraphrase detection
Negation/composition	Contextualized embeddings	Fine-grained semantic understanding
Unequal word importance	TF-IDF weighted averaging	Information retrieval, document similarity
Polysemy (multiple meanings)	Contextualized embeddings (ELMo, BERT)	Word sense disambiguation
OOV words	FastText (subword embeddings)	Noisy text, morphologically rich languages

Average Word2Vec as a Baseline

Summary: Average Word2Vec

We've comprehensively explored the simple yet powerful technique of averaging word vectors. Let's consolidate the key insights:

Key Takeaways

•Averaging computes the centroid of word vectors, placing documents in a semantic space where similar topics cluster together.
•Edge cases require careful handling: OOV words, empty documents, and very short texts all need explicit strategies.
•Stop word removal often helps by reducing dilution of content signal, especially with embeddings not trained with subsampling.
•Theoretical justifications exist from maximum likelihood, Fisher vectors, and kernel perspectives—averaging isn't just a heuristic.
•Variations like SIF, max pooling, and power means can improve over simple averaging for specific applications.
•Average Word2Vec works best when you need semantic similarity, have limited training data, or require dense fixed-length features.

What's next:

Page Complete

2 / 6