Text Feature Engineering Basics - Learning Module

Loading content...

0/245

BoW Limitations

Where Bag of Words Breaks Down

Throughout this module, we've built deep understanding of Bag of Words—its elegance, mathematical foundations, and computational efficiency. Yet BoW has fundamental limitations that no amount of clever engineering can overcome.

These limitations aren't edge cases or rare scenarios. They manifest constantly in real-world NLP tasks, and understanding them precisely is crucial for two reasons:

Knowing when BoW is inappropriate — Some tasks fundamentally require capabilities BoW cannot provide. Using BoW anyway wastes time and produces unreliable systems.
Understanding why alternatives emerged — Word embeddings, recurrent networks, attention mechanisms, and transformers all address specific BoW failures. Understanding what's broken helps you understand what's fixed.

This page catalogs BoW's failure modes rigorously, with concrete examples demonstrating each limitation's real-world impact.

What You Will Master

By the end of this page, you will understand: (1) The fundamental information loss in BoW's bag assumption, (2) How loss of word order breaks semantics and syntax, (3) Why BoW cannot capture semantic similarity, (4) The curse of dimensionality in text, (5) Context blindness and its consequences, and (6) Which techniques address each limitation.

Loss of Word Order: The Bag Assumption

The defining characteristic of Bag of Words—treating documents as unordered collections—is simultaneously its greatest strength and most severe limitation.

The Problem:

BoW produces identical representations for sentences with completely different meanings:

"Dog bites man" ≡ "Man bites dog"
"I love not hating you" ≡ "I hate not loving you"
"The movie was not great" ≡ "The great movie was not"

All three pairs have identical BoW vectors because they contain exactly the same words. Yet their meanings are opposite or entirely different.

Why This Matters:

Word order encodes crucial linguistic information:

Subject-verb-object relationships: Who did what to whom
Modifier scope: What adjectives/adverbs apply to what
Temporal sequence: Before/after relationships
Causal structure: What causes what
Emphasis and focus: What the speaker considers important

word_order_failure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
 
def demonstrate_word_order_blindness():
    """
    Show how BoW fails to distinguish sentences with different meanings.
    """
    # Pairs with opposite/different meanings but identical BoW
    sentence_pairs = [
        ("The dog bit the man", "The man bit the dog"),
        ("I love not hating you", "I hate not loving you"),
        ("The movie was not very good", "The very good movie was not"),
        ("She sold him the car", "He sold her the car"),
        ("The cat chased the mouse", "The mouse chased the cat"),
    ]
    
    vectorizer = CountVectorizer()
    
    print("Word Order Blindness Demonstration")
    print("=" * 70)
    
    for s1, s2 in sentence_pairs:
        X = vectorizer.fit_transform([s1, s2])
        
        # Check if vectors are identical
        v1 = X[0].toarray().flatten()
        v2 = X[1].toarray().flatten()
        
        identical = np.array_equal(v1, v2)
        
        print(f"\n'{s1}'")
        print(f"'{s2}'")
        print(f"Identical BoW vectors: {identical}")
        
        if identical:
            vocab = vectorizer.get_feature_names_out()
            print(f"Both map to: {dict(zip(vocab, v1))}")
 
demonstrate_word_order_blindness()
 
# Demonstrate impact on classification
print("\n" + "=" * 70)
print("Impact on Sentiment Classification")
print("=" * 70)
 
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
 
# Training data
train_texts = [
    "I love this product",
    "This is wonderful",
    "Absolutely amazing experience",
    "I hate this product", 
    "This is terrible",
    "Absolutely awful experience",
]
train_labels = [1, 1, 1, 0, 0, 0]  # 1=positive, 0=negative
 
# Test cases that confuse BoW
test_texts = [
    "I don't love this product",  # Should be negative
    "This is not wonderful",       # Should be negative
    "I don't hate this product",   # Should be positive (double negative)
]
 
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', LogisticRegression())
])
 
pipeline.fit(train_texts, train_labels)
 
print("\nTest cases with negation:")
for text in test_texts:
    pred = pipeline.predict([text])[0]
    label = "Positive" if pred == 1 else "Negative"
    print(f"  '{text}' → Predicted: {label}")
    
print("\n⚠️ BoW ignores negation structure, causing misclassification")

Negation is Invisible to BoW

"This movie is not good" contains the word 'good', so BoW will weight it as positive. The negation 'not' is just another word—its grammatical function of reversing meaning is lost. N-grams (e.g., 'not_good') partially address this, but increase dimensionality exponentially.

No Semantic Similarity: Words as Islands

In BoW, every word is an independent dimension. There's no notion that "happy" and "joyful" are similar, or that "dog" is more related to "cat" than to "democracy."

Mathematical Statement:

In BoW's vector space, all vocabulary terms are orthogonal:

$$\vec{v}{happy} \cdot \vec{v}{joyful} = 0$$ $$\vec{v}{dog} \cdot \vec{v}{cat} = 0$$ $$\vec{v}{dog} \cdot \vec{v}{democracy} = 0$$

Every word is equally distant from every other word. This is a catastrophically poor model of language.

Real-World Consequences:

Search failures: Searching for "inexpensive hotels" won't find "cheap hotels" or "budget accommodations" unless those exact words appear.
Vocabulary mismatch: If training data uses "automobile" and test data uses "car", there's zero signal—they're completely different dimensions.
Paraphrase blindness: "The quick brown fox" and "The fast brown fox" are semantically identical but have cosine similarity of ~0.8 (only 4/5 words match).

semantic_similarity_failure.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
def demonstrate_semantic_blindness():
    """
    Show how BoW fails to capture semantic similarity.
    """
    # Semantically similar sentence pairs
    similar_pairs = [
        ("The car is fast", "The automobile is quick"),
        ("I am happy", "I am joyful"),
        ("The child is playing", "The kid is having fun"),
        ("Good morning", "Hello"),
    ]
    
    # Semantically different but lexically similar
    different_pairs = [
        ("The bank is by the river", "The bank approved the loan"),
        ("I saw a bat flying", "He swung the bat"),
        ("The spring is beautiful", "I installed a new spring"),
    ]
    
    vectorizer = TfidfVectorizer()
    
    print("Semantic Similarity Failure in BoW")
    print("=" * 70)
    
    print("\n--- Semantically SIMILAR sentences ---")
    for s1, s2 in similar_pairs:
        X = vectorizer.fit_transform([s1, s2])
        sim = cosine_similarity(X[0], X[1])[0, 0]
        print(f"'{s1}'")
        print(f"'{s2}'")
        print(f"BoW similarity: {sim:.3f} (should be HIGH)")
        print()
    
    print("--- Lexically similar but semantically DIFFERENT ---")
    for s1, s2 in different_pairs:
        # These share words but have different meanings
        words1 = set(s1.lower().split())
        words2 = set(s2.lower().split())
        shared = words1 & words2
        
        X = vectorizer.fit_transform([s1, s2])
        sim = cosine_similarity(X[0], X[1])[0, 0]
        print(f"'{s1}'")
        print(f"'{s2}'")
        print(f"Shared words: {shared}")
        print(f"BoW similarity: {sim:.3f} (should be LOW - different meanings)")
        print()
 
demonstrate_semantic_blindness()
 
# Demonstrate search failure
print("\n" + "=" * 70)
print("Search Failure Due to Vocabulary Mismatch")
print("=" * 70)
 
documents = [
    "Affordable cars for sale",
    "Luxury automobiles at great prices",
    "Budget-friendly vehicles available",
    "Premium sedans with financing options",
]
 
query = "cheap cars"
 
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
query_vec = vectorizer.transform([query])
 
similarities = cosine_similarity(query_vec, X)[0]
 
print(f"\nQuery: '{query}'")
print("\nDocument rankings (BoW similarity):")
for idx in np.argsort(similarities)[::-1]:
    print(f"  {similarities[idx]:.3f}: '{documents[idx]}'")
 
print("\n⚠️ Best match is the only doc containing 'cars' - other relevant")
print("   docs using 'automobiles', 'vehicles' are not found!")

Word Embeddings Address This

Word2Vec, GloVe, FastText, and contextual embeddings (BERT) learn dense vector representations where similar words have similar vectors. In these spaces, cos(happy, joyful) ≈ 0.8 because they appear in similar contexts during training.

The Curse of High Dimensionality

BoW creates one dimension per vocabulary term. With vocabularies of 50,000-500,000 terms (or millions with n-grams), this creates severe challenges.

The Curse of Dimensionality:

In high-dimensional spaces:

Data becomes sparse: Points are far apart; neighborhoods are empty. The density of data in the space decreases exponentially with dimensionality.
Distance metrics break down: All points become approximately equidistant. Cosine similarity degrades as dimensions increase.
Overfitting becomes easy: With 100,000 features and 10,000 samples, a linear model can perfectly separate any labeling—but won't generalize.
Computation scales poorly: Operations on 100K-dimensional vectors are expensive. Memory grows linearly with dimensionality.

Dimensionality in Text Representation
Representation	Typical Dimensions	Density	Implications
BoW (unigrams)	50,000 - 200,000	0.1% - 1%	Sparse but high-dim; curse applies
BoW (bigrams)	500,000 - 5,000,000	0.01% - 0.1%	Explosion; most features useless
BoW (trigrams)	10,000,000+	< 0.001%	Computationally prohibitive
Word Embeddings	100 - 300	100% (dense)	Low-dim, dense; curse avoided
BERT embeddings	768	100% (dense)	Rich semantics in manageable dims

dimensionality_curse.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
from scipy.stats import describe
 
def analyze_dimensionality_curse(n_docs: int = 1000, vocab_size: int = 50000):
    """
    Demonstrate how high dimensionality affects distance metrics.
    """
    np.random.seed(42)
    
    # Simulate sparse BoW vectors
    density = 0.01  # 1% non-zero (typical for text)
    
    # Generate random sparse vectors
    X = np.zeros((n_docs, vocab_size))
    for i in range(n_docs):
        n_nonzero = int(vocab_size * density)
        indices = np.random.choice(vocab_size, n_nonzero, replace=False)
        X[i, indices] = np.random.rand(n_nonzero)
    
    # Normalize to unit length (common for text)
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    
    print(f"Simulated BoW: {n_docs} docs × {vocab_size} features")
    print(f"Density: {density:.1%}")
    print()
    
    # Analyze pairwise distances
    # Sample 100 random pairs to avoid O(n²) computation
    n_pairs = 500
    idx1 = np.random.choice(n_docs, n_pairs)
    idx2 = np.random.choice(n_docs, n_pairs)
    
    # Cosine similarities
    cos_sims = [cosine_similarity(X[i:i+1], X[j:j+1])[0, 0] 
                for i, j in zip(idx1, idx2)]
    
    # Euclidean distances
    euc_dists = [euclidean_distances(X[i:i+1], X[j:j+1])[0, 0]
                 for i, j in zip(idx1, idx2)]
    
    print("=== Distance Concentration (Curse of Dimensionality) ===")
    print(f"\nCosine Similarity statistics ({n_pairs} random pairs):")
    stats = describe(cos_sims)
    print(f"  Mean: {stats.mean:.4f}")
    print(f"  Std:  {np.sqrt(stats.variance):.4f}")
    print(f"  Min:  {stats.minmax[0]:.4f}")
    print(f"  Max:  {stats.minmax[1]:.4f}")
    
    print(f"\nEuclidean Distance statistics:")
    stats = describe(euc_dists)
    print(f"  Mean: {stats.mean:.4f}")
    print(f"  Std:  {np.sqrt(stats.variance):.4f}")
    print(f"  Min:  {stats.minmax[0]:.4f}")
    print(f"  Max:  {stats.minmax[1]:.4f}")
    
    # The key insight: distances concentrate around the mean
    cos_range = max(cos_sims) - min(cos_sims)
    cos_relative_range = cos_range / np.mean(cos_sims) if np.mean(cos_sims) > 0 else 0
    
    print(f"\n⚠️ Range of similarities: {cos_range:.4f}")
    print(f"   Most pairs have similar similarity scores!")
    print(f"   This makes nearest-neighbor search unreliable.")
    
    return X
 
X = analyze_dimensionality_curse()
 
# Demonstrate dimensionality reduction helps
print("\n" + "=" * 60)
print("Dimensionality Reduction (TruncatedSVD)")
print("=" * 60)
 
svd = TruncatedSVD(n_components=100, random_state=42)
X_reduced = svd.fit_transform(X)
 
print(f"Original: {X.shape}")
print(f"Reduced:  {X_reduced.shape}")
print(f"Explained variance: {svd.explained_variance_ratio_.sum():.2%}")
 
# Distance stats after reduction
n_pairs = 500
idx1 = np.random.choice(X_reduced.shape[0], n_pairs)
idx2 = np.random.choice(X_reduced.shape[0], n_pairs)
 
cos_sims_reduced = [cosine_similarity(X_reduced[i:i+1], X_reduced[j:j+1])[0, 0] 
                    for i, j in zip(idx1, idx2)]
 
print(f"\nCosine similarity after reduction:")
print(f"  Range: {max(cos_sims_reduced) - min(cos_sims_reduced):.4f}")
print(f"  (Wider range = more discriminative distances)")

Context Blindness: Words Have One Meaning

In BoW, each word type has exactly one representation, regardless of context. This ignores polysemy (words with multiple meanings) and context-dependent interpretation.

Polysemy Examples:

"Bank" → financial institution OR river edge
"Apple" → fruit OR technology company
"Spring" → season OR mechanical device OR water source
"Bat" → animal OR sports equipment
"Python" → snake OR programming language

BoW assigns identical feature weights to "bank" in:

"I deposited money at the bank"
"We had a picnic by the river bank"

The word's context—which determines its meaning—is invisible.

context_blindness.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
 
def demonstrate_polysemy_failure():
    """
    Show how BoW conflates different meanings of the same word.
    """
    
    # Sentences with polysemous words
    polysemy_examples = {
        "bank": [
            "I deposited money at the bank",
            "We walked along the river bank",
            "The bank approved my loan application",
            "Fish live near the bank of the stream",
        ],
        "apple": [
            "I ate a delicious apple for lunch",
            "Apple released a new iPhone today",
            "The apple fell from the tree",
            "Apple stock price increased significantly",
        ],
        "python": [
            "I killed a python in my backyard",
            "I wrote a Python script for data analysis",
            "The python constricted its prey",
            "Python is my favorite programming language",
        ],
    }
    
    vectorizer = TfidfVectorizer()
    
    print("Context Blindness: Polysemy Failure")
    print("=" * 70)
    
    for word, sentences in polysemy_examples.items():
        print(f"\n--- '{word}' in different contexts ---")
        
        X = vectorizer.fit_transform(sentences)
        
        # Compute pairwise similarities
        sims = cosine_similarity(X)
        
        # Group by actual meaning
        # Assuming first half is one meaning, second half is other
        same_meaning = (sims[0, 1] + sims[2, 3]) / 2
        diff_meaning = (sims[0, 2] + sims[0, 3] + sims[1, 2] + sims[1, 3]) / 4
        
        for i, sent in enumerate(sentences):
            print(f"  [{i}] {sent}")
        
        print(f"\n  Similarity between SAME meanings (0-1, 2-3): {same_meaning:.3f}")
        print(f"  Similarity between DIFFERENT meanings: {diff_meaning:.3f}")
        print(f"  ⚠️ BoW cannot distinguish: '{word}' = '{word}' always")
 
demonstrate_polysemy_failure()
 
# Demonstrate how context would help
print("\n" + "=" * 70)
print("What Context Would Tell Us")
print("=" * 70)
 
# If we could use context words as features for the TARGET word
context_features = {
    "bank (financial)": ["deposit", "loan", "money", "account", "savings"],
    "bank (river)": ["river", "stream", "water", "fish", "shore"],
    "apple (fruit)": ["eat", "fruit", "tree", "red", "delicious"],
    "apple (company)": ["iPhone", "stock", "release", "technology", "CEO"],
}
 
print("\nContext words that would disambiguate:")
for sense, context in context_features.items():
    print(f"  {sense}: {context}")
    
print("\n💡 Contextual embeddings (BERT, ELMo) use surrounding words")
print("   to produce different vectors for the SAME word in different contexts.")

Static vs. Contextual Embeddings

Static embeddings (Word2Vec, GloVe) also have this problem—'bank' has one vector. Contextual embeddings (ELMo, BERT, GPT) compute word vectors dynamically based on surrounding context, producing different embeddings for 'bank' in different sentences. This is a major advancement over BoW.

Feature Sparsity and Data Requirements

BoW's one-feature-per-word design leads to severe feature sparsity: most features are zero for most documents, and many features are rarely useful.

The Long Tail Problem:

Word frequencies follow Zipf's Law:

A few words appear in almost every document ("the", "is", "of")
Most words appear in very few documents (rare technical terms, typos)
The "middle" useful words are a small fraction of vocabulary

This means:

High-frequency features are non-discriminative (appear everywhere)
Low-frequency features lack sufficient data to learn from
Only a minority of features are actually useful

feature_sparsity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
 
# Load real text data
print("Loading 20 Newsgroups dataset...")
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
texts = newsgroups.data
 
print(f"Documents: {len(texts)}")
 
# Analyze vocabulary statistics
vectorizer = CountVectorizer(min_df=1, max_df=1.0)
X = vectorizer.fit_transform(texts)
vocab = vectorizer.get_feature_names_out()
 
print(f"Vocabulary size: {len(vocab):,}")
 
# Document frequency for each term
doc_freq = np.array((X > 0).sum(axis=0)).flatten()
 
print("\n=== Feature Frequency Distribution ===")
print("\nDocument Frequency Buckets:")
 
buckets = [
    (1, 1, "Appears in 1 doc (hapax legomena)"),
    (2, 5, "Appears in 2-5 docs"),
    (6, 20, "Appears in 6-20 docs"),
    (21, 100, "Appears in 21-100 docs"),
    (101, 500, "Appears in 101-500 docs"),
    (501, 1000, "Appears in 501-1000 docs"),
    (1001, len(texts), "Appears in 1000+ docs"),
]
 
for low, high, desc in buckets:
    count = np.sum((doc_freq >= low) & (doc_freq <= high))
    pct = 100 * count / len(vocab)
    print(f"  {desc:40s}: {count:6,} ({pct:5.1f}%)")
 
# What fraction of features are "useful"?
# Define useful as: appears in 2+ docs AND < 50% of docs
useful_mask = (doc_freq >= 2) & (doc_freq <= len(texts) * 0.5)
n_useful = np.sum(useful_mask)
 
print(f"\n'Useful' features (df >= 2 and df <= 50%): {n_useful:,} ({100*n_useful/len(vocab):.1f}%)")
 
# Sparsity of the matrix
total_entries = X.shape[0] * X.shape[1]
nnz = X.nnz
sparsity = 1 - (nnz / total_entries)
 
print(f"\n=== Matrix Sparsity ===")
print(f"Total entries: {total_entries:,}")
print(f"Non-zero entries: {nnz:,}")
print(f"Sparsity: {sparsity:.4%}")
print(f"Average non-zeros per document: {nnz/X.shape[0]:.1f}")
 
# Data requirements: how many samples to reliably learn a feature?
print(f"\n=== Data Requirements ===")
print("To reliably learn from a feature, need sufficient positive examples.")
print("Features appearing in < 10 docs are essentially noise:")
rare_features = np.sum(doc_freq < 10)
print(f"  Features with df < 10: {rare_features:,} ({100*rare_features/len(vocab):.1f}%)")

The Feature Efficiency Problem

In BoW, each unique word creates a feature, but most features are useless: either too rare (insufficient data) or too common (non-discriminative). This is wasteful compared to dense embeddings where every dimension carries information.

Tasks BoW Cannot Handle

Given its fundamental limitations, certain NLP tasks are impossible or severely degraded with BoW representations.

Tasks Beyond BoW's Capabilities
Task	Why BoW Fails	What's Needed	Modern Solution
Machine Translation	Word order is meaning; can't generate sequences	Sequence-to-sequence modeling	Transformer encoder-decoder
Question Answering	Can't locate answer in context; no span extraction	Reading comprehension, attention	BERT + span prediction
Named Entity Recognition	No position, no sequence; entities are contextual	Sequential labeling, context	BiLSTM-CRF, BERT
Coreference Resolution	Can't track entities across sentences	Document-level context	Neural coref models
Text Generation	No sequentiality; can't produce ordered output	Language modeling	GPT, LLMs
Semantic Similarity (nuanced)	Synonyms are orthogonal; paraphrases don't match	Semantic embeddings	Sentence-BERT, SimCSE
Sarcasm/Irony Detection	Surface words contradict meaning; no pragmatics	Context, world knowledge	Large language models

Where BoW Still Works:

Despite these limitations, BoW remains effective for:

Topic Classification: When topics have distinct vocabularies (sports vs. politics vs. technology), word presence is sufficient.
Spam Detection: Spam has characteristic words ("free", "winner", "click") that appear regardless of order.
Authorship Attribution: Authors have distinctive vocabulary patterns that BoW captures well.
Keyword-based Information Retrieval: When exact word matching is acceptable, BoW works.
Fast Baselines: BoW + logistic regression is the standard baseline for any text classification task.

The key insight: BoW fails when structure matters—word order, semantic relationships, context-dependent meaning. It succeeds when vocabulary distribution alone distinguishes classes.

Techniques Addressing BoW Limitations

Various techniques have been developed to address BoW's limitations, ranging from extensions within the BoW framework to fundamentally different representations.

Progressive Improvements

•N-grams (Bigrams, Trigrams) — Capture local word order by treating word sequences as features. "not_good" becomes a single feature. Trade-off: vocabulary explosion (bigrams ≈ V², trigrams ≈ V³).
•TF-IDF Weighting — Addresses the 'too common' problem by downweighting terms that appear everywhere. Covered in the next module.
•Latent Semantic Analysis (LSA/LSI) — Uses SVD to find latent semantic dimensions, partially capturing synonym relationships. Reduces dimensionality while grouping similar words.
•Word Embeddings (Word2Vec, GloVe) — Dense vectors where similar words have similar representations. Addresses semantic similarity but not context-dependence.
•Subword Embeddings (FastText) — Embeddings built from character n-grams, handling OOV words and morphological patterns.
•Contextual Embeddings (ELMo, BERT) — Word representations depend on surrounding context. Same word gets different embeddings in different sentences.
•Transformer Language Models (GPT, T5) — Full sequence modeling with attention. Ultimate solution to all BoW limitations but computationally expensive.

improvements_overview.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
from sklearn.feature_extraction.text import (
    CountVectorizer, TfidfVectorizer
)
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
import numpy as np
 
# Sample documents for comparison
documents = [
    "The movie was not good at all",
    "I really liked the movie, it was great",
    "The film was terrible, I hated it",
    "An excellent cinema experience overall",
]
 
print("Comparing BoW and Its Improvements")
print("=" * 70)
 
# 1. Basic BoW (Unigrams)
print("\n1. Basic BoW (Unigrams):")
bow = CountVectorizer()
X_bow = bow.fit_transform(documents)
print(f"   Features: {len(bow.get_feature_names_out())}")
print(f"   Sample vocab: {list(bow.get_feature_names_out())[:10]}")
 
# 2. BoW with Bigrams
print("\n2. BoW with Bigrams (1,2-grams):")
bow_bigram = CountVectorizer(ngram_range=(1, 2))
X_bigram = bow_bigram.fit_transform(documents)
print(f"   Features: {len(bow_bigram.get_feature_names_out())}")
bigram_features = [f for f in bow_bigram.get_feature_names_out() if ' ' in f]
print(f"   Sample bigrams: {bigram_features[:10]}")
 
# 3. TF-IDF
print("\n3. TF-IDF Weighting:")
tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(documents)
print(f"   Features: {len(tfidf.get_feature_names_out())}")
print(f"   Now 'the' has lower weight than 'excellent'")
 
# 4. LSA (Latent Semantic Analysis)
print("\n4. LSA (TF-IDF + SVD):")
lsa = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('svd', TruncatedSVD(n_components=10, random_state=42))
])
X_lsa = lsa.fit_transform(documents)
print(f"   Reduced to: {X_lsa.shape[1]} semantic dimensions")
print(f"   Dense representation (all dimensions used)")
 
# Show limitation progression
print("\n" + "=" * 70)
print("Limitation vs. Solution Matrix")
print("=" * 70)
 
solutions = [
    ("Word Order", "N-grams", "Bigrams capture local order; 'not good' ≠ 'good not'"),
    ("Common Words", "TF-IDF", "IDF downweights words appearing in many docs"),
    ("Synonyms", "LSA", "SVD finds latent dimensions grouping similar words"),
    ("High Dimensions", "LSA/SVD", "Reduces to 100-500 useful dimensions"),
    ("OOV Words", "FastText", "Subword embeddings handle never-seen words"),
    ("Context", "BERT", "Same word, different context → different vector"),
    ("Full Semantics", "LLMs", "Transformers model complete linguistic structure"),
]
 
for limitation, solution, how in solutions:
    print(f"\n{limitation}:")
    print(f"  → Solution: {solution}")
    print(f"  → How: {how}")

Summary: Understanding BoW's Boundaries

We've rigorously analyzed the limitations of Bag of Words—not to dismiss the technique, but to understand exactly when it's appropriate and when to reach for more sophisticated methods. Let's consolidate the key insights:

Key Takeaways

•Word order loss is fundamental — BoW cannot distinguish 'dog bites man' from 'man bites dog'. When word order carries meaning, BoW fails.
•No semantic similarity — Every word is orthogonal. 'Happy' and 'joyful' have zero similarity. Vocabulary mismatch breaks retrieval.
•High dimensionality is problematic — Distance metrics degrade; most features are useless (too rare or too common). The curse of dimensionality applies.
•Context is invisible — Same word, different meanings (polysemy) get identical representations. 'Bank' is 'bank' everywhere.
•Certain tasks are impossible — Machine translation, question answering, text generation, and other sequence-dependent tasks cannot work with BoW.
•BoW remains valuable — For topic classification, spam detection, and as fast baselines, BoW's simplicity and interpretability shine.
•Improvements exist — N-grams, TF-IDF, LSA, and embeddings address specific limitations. Choose based on task requirements.

Module Conclusion:

This module has taken you from the foundations of Bag of Words through term frequency, vocabulary construction, sparse representations, and finally to understanding BoW's limitations. You now have:

Deep theoretical understanding of how text becomes numerical features
Practical implementation knowledge for production systems
Critical analytical ability to know when BoW is appropriate
Foundation for advanced techniques that address BoW's weaknesses

The subsequent modules on TF-IDF, n-grams, and preprocessing build directly on this foundation. The word embedding and transformer modules will show you the modern alternatives that address each limitation we've catalogued here.

Module Complete

Congratulations! You've completed Module 7: Text Feature Engineering - Basics. You now have comprehensive, world-class understanding of Bag of Words representation—from mathematical foundations through implementation to critical analysis of limitations. This knowledge forms the bedrock for all text ML work, whether you're using classical techniques or contextualizing why modern deep learning approaches were developed.