Text Feature Engineering Ngrams - Learning Module

Loading content...

0/278

Bigrams: Capturing Local Context and Word Pairs

Beyond Single Words

Consider two movie reviews:

"This film is not good."

"This film is good."

With unigram features alone, both documents contain nearly identical word sets—differing only by the presence of "not". A classifier must learn that "not" dramatically changes the meaning, but without context, it has no way to know that "not" modifies "good" rather than some other word.

Bigrams solve this problem by capturing consecutive word pairs. The first review generates the bigram "not good", which directly encodes the negation. The second generates "is good"—a distinctly positive signal. This simple extension from single tokens to pairs of tokens unlocks a new dimension of textual understanding.

Bigrams represent the first step beyond the bag-of-words assumption, introducing local context while maintaining the efficiency and interpretability that made unigrams powerful.

What You Will Learn

By the end of this page, you will understand bigram extraction, understand the mathematical foundation of pairwise token representations, master implementation patterns for efficient bigram computation, analyze the vocabulary explosion problem, and learn strategies for managing bigram feature spaces in production systems.

What Are Bigrams?

A bigram is an ordered pair of two consecutive tokens in a sequence. Where unigrams treat each token independently, bigrams preserve the sequential relationship between adjacent tokens, capturing local syntactic and semantic patterns.

Formal Definition:

Given a text sequence T = [t₁, t₂, ..., tₖ], the set of bigrams B(T) is:

B(T) = {(t₁, t₂), (t₂, t₃), ..., (tₖ₋₁, tₖ)}

For a sequence of k tokens, there are exactly k-1 bigrams.

Example:

Consider the sentence: "The quick brown fox"

Position	Token 1	Token 2	Bigram
1-2	the	quick	"the quick"
2-3	quick	brown	"quick brown"
3-4	brown	fox	"brown fox"

Note that bigrams are ordered—("quick", "brown") is different from ("brown", "quick"). This ordering is precisely what enables bigrams to capture context that unigrams miss.

Bigrams as Markov Order-1 Context

Bigrams can be understood through the lens of Markov chains. A bigram model makes the Markov assumption that the probability of each word depends only on the immediately preceding word: P(wₙ | w₁, w₂, ..., wₙ₋₁) ≈ P(wₙ | wₙ₋₁). This first-order Markov assumption is the foundation of bigram language models and informs why bigrams capture "local" but not "global" context.

Bigram Representation Formats:

Bigrams can be represented in several ways, each with tradeoffs:

Tuple Format: ("the", "quick") - Explicit, easy to understand
Joined String: "the_quick" or "the quick" - Can use same infrastructure as unigrams
Index Pair: (0, 1) - Compact, requires vocabulary mapping
Nested Dictionary: {"the": {"quick": count}} - Efficient for sparse storage

The joined string format is most common in practice because it allows reusing unigram vectorization infrastructure—treating each bigram as a single "word" in a larger vocabulary.

Why Bigrams Matter: Capturing What Unigrams Miss

Bigrams address several fundamental limitations of unigram representations. Understanding these improvements clarifies when bigrams add value and when the added complexity isn't justified.

1. Negation Handling

Unigrams treat "good" the same whether it appears as "very good" or "not good". Bigrams distinguish these cases:

Text	Unigrams	Bigrams
"not good"	{not, good}	{not_good}
"very good"	{very, good}	{very_good}
"not bad"	{not, bad}	{not_bad}

The bigram "not_good" becomes a distinct feature that the model can learn to associate with negative sentiment.

2. Phrase Detection

Many concepts are expressed as multi-word phrases where individual words have different meanings:

Phrase	Individual Words	Phrase Meaning
"machine learning"	machine, learning	ML field
"hot dog"	hot, dog	food item
"New York"	new, york	city name
"kick the bucket"	kick, the, bucket	idiom for dying

Bigrams like "machine_learning" or "hot_dog" capture these compound concepts as single features.

What Bigrams Capture

•Negation patterns: "not good", "never again"
•Intensifiers: "very happy", "extremely important"
•Compound nouns: "ice cream", "credit card"
•Named entities: "New York", "Los Angeles"
•Collocations: "strong coffee", "heavy rain"
•Phrasal verbs: "give up", "look forward"
•Local syntax: "the dog" vs "dog the"

What Bigrams Still Miss

•Long-range dependencies: "not ... good" (gap)
•Document structure: Sentence boundaries
•Semantic similarity: Synonyms still unrelated
•Complex negation: "not at all bad"
•Coreference: "It" referring to earlier entity
•Discourse relations: How sentences connect
•Global context: Document-level themes

The Sweet Spot of Local Context

Bigrams often represent the optimal tradeoff between capturing context and maintaining manageable feature spaces. They capture most common phrases and modifier patterns while avoiding the vocabulary explosion of higher-order n-grams. Research consistently shows that unigrams+bigrams outperform either alone, with diminishing returns for trigrams and beyond in many tasks.

Mathematical Foundation

Understanding the mathematics of bigram representations enables rigorous reasoning about feature spaces, vocabulary sizes, and computational complexity.

Bigram Vocabulary Size:

For a unigram vocabulary V with |V| = m unique tokens, the theoretical maximum bigram vocabulary is:

|V_bigram| ≤ m²

However, the actual observed bigram vocabulary is much smaller because:

Not all word pairs occur in natural language
Syntactic constraints limit valid combinations
Corpus size limits observation of rare pairs

In practice, observed bigram vocabulary size follows:

|V_bigram_observed| ≈ k × m^α

where α ≈ 1.3-1.5 for natural language corpora.

Probability Estimation:

Bigram probabilities are estimated using maximum likelihood:

P(wₙ | wₙ₋₁) = count(wₙ₋₁, wₙ) / count(wₙ₋₁)

This leads to the data sparsity problem: most valid bigrams are never observed in the training corpus, giving them probability 0 under MLE.

Bigram Feature Complexity Analysis
Aspect	Unigrams	Bigrams	Implication
Vocabulary Size	O(m)	O(m²) theoretical, O(m^1.4) observed	Memory usage increases significantly
Sparsity	High	Very High	Even sparser matrices require efficient storage
Feature Count per Doc	O(n)	O(n-1)	Similar document vectors, larger vocabulary
Extraction Time	O(n)	O(n)	Both linear in document length
Interpretability	High	High	Both human-readable

Smoothing for Probability Estimation:

To handle zero-probability bigrams, several smoothing techniques exist:

Laplace (Add-1) Smoothing:

P(wₙ | wₙ₋₁) = (count(wₙ₋₁, wₙ) + 1) / (count(wₙ₋₁) + |V|)
Add-k Smoothing:

P(wₙ | wₙ₋₁) = (count(wₙ₋₁, wₙ) + k) / (count(wₙ₋₁) + k×|V|)
Interpolation (Jelinek-Mercer):

P(wₙ | wₙ₋₁) = λ × P_bigram(wₙ | wₙ₋₁) + (1-λ) × P_unigram(wₙ)
Kneser-Ney Smoothing: More sophisticated approach using continuation probability—the probability that a word appears as a novel continuation.

For feature engineering (rather than language modeling), smoothing is less critical since we typically use occurrence counts rather than probabilities.

Implementation Deep Dive

Efficient bigram extraction requires careful attention to memory and computation. Let's explore both from-scratch implementations and production-ready library usage.

bigram_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
from typing import List, Dict, Tuple, Iterator
from collections import Counter, defaultdict
import numpy as np
from scipy.sparse import csr_matrix
 
class BigramExtractor:
    """
    Comprehensive bigram extraction with multiple strategies.
    """
    
    @staticmethod
    def extract_bigrams(
        tokens: List[str],
        separator: str = "_"
    ) -> List[str]:
        """
        Extract bigrams as joined strings.
        
        Time Complexity: O(n) where n = number of tokens
        Space Complexity: O(n) for output list
        
        Args:
            tokens: List of tokens
            separator: Character(s) to join bigram components
            
        Returns:
            List of bigram strings
        """
        if len(tokens) < 2:
            return []
        
        return [
            f"{tokens[i]}{separator}{tokens[i+1]}"
            for i in range(len(tokens) - 1)
        ]
    
    @staticmethod
    def extract_bigrams_with_position(
        tokens: List[str]
    ) -> List[Tuple[str, str, int]]:
        """
        Extract bigrams with position information.
        
        Useful for analysis and debugging.
        
        Returns:
            List of (token1, token2, position) tuples
        """
        return [
            (tokens[i], tokens[i+1], i)
            for i in range(len(tokens) - 1)
        ]
    
    @staticmethod
    def extract_skip_bigrams(
        tokens: List[str],
        max_skip: int = 2,
        separator: str = "_"
    ) -> List[str]:
        """
        Extract bigrams with skips (non-consecutive pairs).
        
        Skip-grams can capture patterns like "not ... good"
        where words are separated by intervening tokens.
        
        Args:
            tokens: List of tokens
            max_skip: Maximum number of tokens to skip between pair
            separator: Join character
            
        Returns:
            List of skip-bigram strings
        """
        bigrams = []
        n = len(tokens)
        
        for i in range(n):
            # Regular bigram (skip=0) and skip-bigrams
            for skip in range(max_skip + 1):
                j = i + 1 + skip
                if j < n:
                    bigrams.append(f"{tokens[i]}{separator}{tokens[j]}")
        
        return bigrams
 
 
class BigramVectorizer:
    """
    Complete bigram vectorizer with vocabulary management.
    """
    
    def __init__(
        self,
        max_vocab_size: int = None,
        min_df: int = 1,
        include_unigrams: bool = True,
        separator: str = "_"
    ):
        self.max_vocab_size = max_vocab_size
        self.min_df = min_df
        self.include_unigrams = include_unigrams
        self.separator = separator
        
        self.vocabulary_: Dict[str, int] = {}
        self.doc_freq_: Dict[str, int] = {}
        
    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization."""
        return text.lower().split()
    
    def _extract_features(self, tokens: List[str]) -> List[str]:
        """Extract all features (unigrams and/or bigrams)."""
        features = []
        
        if self.include_unigrams:
            features.extend(tokens)
        
        # Add bigrams
        features.extend(BigramExtractor.extract_bigrams(
            tokens, self.separator
        ))
        
        return features
    
    def fit(self, documents: List[str]) -> 'BigramVectorizer':
        """
        Build vocabulary from corpus.
        """
        doc_freq = Counter()
        term_freq = Counter()
        
        for doc in documents:
            tokens = self._tokenize(doc)
            features = self._extract_features(tokens)
            
            # Count document frequencies
            unique_features = set(features)
            doc_freq.update(unique_features)
            
            # Count term frequencies
            term_freq.update(features)
        
        # Filter by minimum document frequency
        candidates = [
            term for term, freq in doc_freq.items()
            if freq >= self.min_df
        ]
        
        # Sort by frequency
        sorted_terms = sorted(
            candidates,
            key=lambda t: term_freq[t],
            reverse=True
        )
        
        # Limit vocabulary size
        if self.max_vocab_size:
            sorted_terms = sorted_terms[:self.max_vocab_size]
        
        # Build vocabulary mapping
        self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_terms)}
        self.doc_freq_ = {term: doc_freq[term] for term in sorted_terms}
        
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """
        Transform documents to feature vectors.
        """
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            features = self._extract_features(tokens)
            
            # Count features in document
            feature_counts = Counter(features)
            
            for feature, count in feature_counts.items():
                if feature in self.vocabulary_:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[feature])
                    data.append(count)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def analyze_vocabulary_composition(self) -> Dict:
        """
        Analyze the composition of the vocabulary.
        """
        unigram_count = 0
        bigram_count = 0
        
        for term in self.vocabulary_:
            if self.separator in term:
                bigram_count += 1
            else:
                unigram_count += 1
        
        return {
            'total_vocabulary': len(self.vocabulary_),
            'unigram_count': unigram_count,
            'bigram_count': bigram_count,
            'unigram_ratio': unigram_count / len(self.vocabulary_) if self.vocabulary_ else 0,
            'bigram_ratio': bigram_count / len(self.vocabulary_) if self.vocabulary_ else 0,
        }
 
 
def demonstrate_bigram_extraction():
    """
    Demonstrate bigram extraction and analysis.
    """
    # Sample corpus with sentiment patterns
    corpus = [
        "This movie was not good at all",
        "I really loved this amazing film",
        "The acting was terrible and boring",
        "What a wonderful and beautiful story",
        "Not impressed by this disappointing movie",
        "Absolutely fantastic performance by the cast",
        "The worst film I have ever seen",
        "A truly remarkable cinematic experience",
    ]
    
    labels = [0, 1, 0, 1, 0, 1, 0, 1]  # 0=negative, 1=positive
    
    # Compare unigram-only vs unigram+bigram
    print("="*60)
    print("VOCABULARY ANALYSIS")
    print("="*60)
    
    # Unigram only
    unigram_vec = BigramVectorizer(include_unigrams=True, min_df=1)
    unigram_vec.include_unigrams = True
    # Temporarily disable bigrams
    original_extract = unigram_vec._extract_features
    unigram_vec._extract_features = lambda tokens: tokens
    unigram_vec.fit(corpus)
    
    print(f"
Unigram-only vocabulary size: {len(unigram_vec.vocabulary_)}")
    
    # Unigram + Bigram
    bigram_vec = BigramVectorizer(include_unigrams=True, min_df=1)
    bigram_vec.fit(corpus)
    
    composition = bigram_vec.analyze_vocabulary_composition()
    print(f"
Unigram+Bigram vocabulary:")
    for key, value in composition.items():
        if isinstance(value, float):
            print(f"  {key}: {value:.2%}")
        else:
            print(f"  {key}: {value}")
    
    # Show sentiment-relevant bigrams
    print("
" + "="*60)
    print("SENTIMENT-RELEVANT BIGRAMS")
    print("="*60)
    
    for doc, label in zip(corpus, labels):
        tokens = doc.lower().split()
        bigrams = BigramExtractor.extract_bigrams(tokens)
        sentiment = "Positive" if label == 1 else "Negative"
        print(f"
[{sentiment}] '{doc}'")
        print(f"  Bigrams: {bigrams}")
    
    # Skip-bigram demonstration
    print("
" + "="*60)
    print("SKIP-BIGRAM DEMONSTRATION")
    print("="*60)
    
    text = "not at all good"
    tokens = text.split()
    
    print(f"
Text: '{text}'")
    print(f"Regular bigrams: {BigramExtractor.extract_bigrams(tokens)}")
    print(f"Skip-bigrams (k=2): {BigramExtractor.extract_skip_bigrams(tokens, max_skip=2)}")
    
    # Note: "not_good" appears in skip-bigrams, capturing negation despite gap
 
 
if __name__ == "__main__":
    demonstrate_bigram_extraction()

Production Implementation with scikit-learn

scikit-learn's CountVectorizer natively supports n-gram extraction through the ngram_range parameter. Understanding its configuration is essential for production deployments.

sklearn_bigrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import numpy as np
 
def comprehensive_bigram_example():
    """
    Complete bigram implementation with scikit-learn.
    """
    
    # Sentiment analysis dataset
    texts = [
        "This movie was not good at all",
        "I really loved this amazing film incredible",
        "The acting was terrible and boring throughout",
        "What a wonderful and beautiful story telling",
        "Not impressed by this disappointing movie waste",
        "Absolutely fantastic performance by the cast brilliant",
        "The worst film I have ever seen awful",
        "A truly remarkable cinematic experience outstanding",
        "This was not bad actually quite enjoyable",
        "The plot was confusing and the pacing slow",
    ]
    
    labels = [0, 1, 0, 1, 0, 1, 0, 1, 1, 0]
    
    # Configuration options for n-gram ranges
    configurations = [
        ('Unigrams only', (1, 1)),
        ('Bigrams only', (2, 2)),
        ('Unigrams + Bigrams', (1, 2)),
    ]
    
    print("="*60)
    print("N-GRAM RANGE COMPARISON")
    print("="*60)
    
    for name, ngram_range in configurations:
        vectorizer = CountVectorizer(
            ngram_range=ngram_range,
            min_df=1,
            lowercase=True,
        )
        
        X = vectorizer.fit_transform(texts)
        vocab_size = len(vectorizer.vocabulary_)
        
        # Show sample features
        feature_names = vectorizer.get_feature_names_out()
        
        print(f"
{name}:")
        print(f"  Vocabulary size: {vocab_size}")
        print(f"  Matrix shape: {X.shape}")
        print(f"  Sample features: {list(feature_names[:10])}")
        
        # If bigrams included, show some bigram features
        if ngram_range[1] >= 2:
            bigram_features = [f for f in feature_names if ' ' in f]
            print(f"  Sample bigrams: {bigram_features[:10]}")
    
    # Build classification pipeline
    print("
" + "="*60)
    print("CLASSIFICATION COMPARISON")
    print("="*60)
    
    for name, ngram_range in configurations:
        pipeline = Pipeline([
            ('vectorizer', CountVectorizer(
                ngram_range=ngram_range,
                min_df=1,
                lowercase=True,
                stop_words='english',
            )),
            ('classifier', LogisticRegression(random_state=42, max_iter=1000))
        ])
        
        # Cross-validation (limited by small dataset)
        pipeline.fit(texts, labels)
        train_accuracy = pipeline.score(texts, labels)
        
        print(f"
{name}:")
        print(f"  Training accuracy: {train_accuracy:.2%}")
        
        # Analyze feature importance for unigram+bigram
        if ngram_range == (1, 2):
            feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out()
            coefficients = pipeline.named_steps['classifier'].coef_[0]
            
            # Top positive and negative features
            indices = np.argsort(coefficients)
            
            print("  Top negative indicators:")
            for idx in indices[:5]:
                print(f"    '{feature_names[idx]}': {coefficients[idx]:.3f}")
            
            print("  Top positive indicators:")
            for idx in indices[-5:]:
                print(f"    '{feature_names[idx]}': {coefficients[idx]:.3f}")
 
 
def vocabulary_explosion_analysis():
    """
    Demonstrate vocabulary growth with n-gram order.
    """
    print("
" + "="*60)
    print("VOCABULARY EXPLOSION ANALYSIS")
    print("="*60)
    
    # Generate larger corpus for realistic analysis
    base_texts = [
        "machine learning is transforming how we build software",
        "deep learning neural networks are powerful models",
        "natural language processing enables text understanding",
        "computer vision systems can analyze images effectively",
        "reinforcement learning agents learn from rewards",
    ] * 20
    
    results = []
    
    for n in range(1, 5):
        vectorizer = CountVectorizer(
            ngram_range=(1, n),
            min_df=1,
        )
        X = vectorizer.fit_transform(base_texts)
        
        vocab_size = len(vectorizer.vocabulary_)
        
        # Count n-grams by order
        feature_names = vectorizer.get_feature_names_out()
        ngram_counts = {i: 0 for i in range(1, n+1)}
        
        for feature in feature_names:
            order = feature.count(' ') + 1
            if order <= n:
                ngram_counts[order] += 1
        
        results.append({
            'max_n': n,
            'total_vocab': vocab_size,
            'breakdown': ngram_counts,
        })
        
        print(f"
Up to {n}-grams:")
        print(f"  Total vocabulary: {vocab_size}")
        print(f"  Breakdown: {ngram_counts}")
    
    # Show growth rate
    print("
Vocabulary Growth:")
    for i in range(1, len(results)):
        prev = results[i-1]['total_vocab']
        curr = results[i]['total_vocab']
        growth = (curr - prev) / prev * 100
        print(f"  {results[i-1]['max_n']}-gram → {results[i]['max_n']}-gram: +{growth:.1f}%")
 
 
def tfidf_with_bigrams():
    """
    Demonstrate TF-IDF weighted bigrams.
    """
    print("
" + "="*60)
    print("TF-IDF WITH BIGRAMS")
    print("="*60)
    
    documents = [
        "machine learning is a subset of artificial intelligence",
        "deep learning uses neural networks with many layers",
        "natural language processing handles text and speech",
        "machine learning algorithms learn patterns from data",
        "artificial intelligence includes machine learning and more",
    ]
    
    # TF-IDF with bigrams
    tfidf = TfidfVectorizer(
        ngram_range=(1, 2),
        min_df=1,
        use_idf=True,
        smooth_idf=True,
        sublinear_tf=True,  # Use 1 + log(tf)
    )
    
    X = tfidf.fit_transform(documents)
    feature_names = tfidf.get_feature_names_out()
    
    # Show top terms for each document
    for doc_idx, doc in enumerate(documents):
        print(f"
Document {doc_idx + 1}: '{doc[:50]}...'")
        
        # Get non-zero features for this document
        doc_vector = X[doc_idx].toarray()[0]
        nonzero_indices = np.where(doc_vector > 0)[0]
        
        # Sort by TF-IDF score
        sorted_indices = sorted(
            nonzero_indices,
            key=lambda i: doc_vector[i],
            reverse=True
        )[:5]
        
        print("  Top TF-IDF features:")
        for idx in sorted_indices:
            print(f"    '{feature_names[idx]}': {doc_vector[idx]:.3f}")
 
 
if __name__ == "__main__":
    comprehensive_bigram_example()
    vocabulary_explosion_analysis()
    tfidf_with_bigrams()

The Vocabulary Explosion Problem

The most significant challenge with bigrams is vocabulary explosion: the combinatorial growth of the feature space as we consider word pairs.

Theoretical vs. Practical Growth:

For a unigram vocabulary of size V:

Theoretical maximum bigrams: V²
With V = 10,000 words: 100,000,000 possible bigrams
Actual observed: typically V^1.3 to V^1.5 ≈ 250,000 to 1,000,000

This explosion creates several challenges:

Vocabulary Explosion Challenges

•Memory Usage: Storing vocabulary mappings and feature matrices requires more RAM
•Sparse Data: Most bigrams appear rarely, leading to unreliable frequency estimates
•Overfitting Risk: High-dimensional feature spaces with sparse data invite overfitting
•Training Time: More features means more parameters to learn
•Inference Latency: Larger vocabularies slow down feature extraction

Mitigation Strategies:

Aggressive Frequency Filtering
- Set higher min_df thresholds for bigrams
- Remove bigrams appearing in fewer than 5-10 documents
Maximum Vocabulary Size
- Limit total features (e.g., 50,000)
- Let frequent unigrams and bigrams compete
Chi-Square Feature Selection
- Keep only features statistically associated with labels
- Removes noise bigrams efficiently
Feature Hashing
- Hash bigrams to fixed-size vector
- Trades accuracy for bounded memory
Hierarchical Filtering
- Keep unigram if it appears in useful bigrams
- Build bigrams only from preserved unigrams
Mutual Information Filtering
- Keep bigrams where P(w1,w2) >> P(w1)P(w2)
- Identifies genuine collocations

Practical Rule of Thumb

For most text classification tasks, limit your total vocabulary (unigrams + bigrams) to 50,000-100,000 features. Use min_df=5 or higher for bigrams. Monitor the unigram/bigram ratio—if bigrams dominate (>80%), you may be overfitting to rare patterns. A typical healthy ratio is 60-70% unigrams, 30-40% bigrams.

Bigram Quality Assessment

Not all bigrams are equally useful. Some represent genuine linguistic collocations ("hot dog", "New York"), while others are accidental adjacencies ("the the", "is a"). Identifying quality bigrams improves feature engineering.

Pointwise Mutual Information (PMI):

PMI measures how much more likely two words are to co-occur than expected by chance:

PMI(w₁, w₂) = log₂[P(w₁, w₂) / (P(w₁) × P(w₂))]

PMI > 0: Words co-occur more than expected (collocation)
PMI = 0: Independent occurrence
PMI < 0: Words co-occur less than expected

Limitations of PMI:

PMI favors rare bigrams. If two rare words appear together once, their PMI can be artificially high. Solutions include:

PMI with frequency threshold: Only compute PMI for bigrams appearing ≥ k times
Normalized PMI (NPMI): PMI / -log₂P(w₁, w₂), bounded to [-1, 1]
Positive PMI (PPMI): max(PMI, 0), ignores negative associations

bigram_quality.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple
import math
 
class BigramQualityAnalyzer:
    """
    Analyze bigram quality using statistical measures.
    """
    
    def __init__(self, min_count: int = 5):
        self.min_count = min_count
        self.unigram_counts: Counter = Counter()
        self.bigram_counts: Counter = Counter()
        self.total_tokens: int = 0
        self.total_bigrams: int = 0
        
    def fit(self, tokenized_documents: List[List[str]]) -> 'BigramQualityAnalyzer':
        """
        Count unigrams and bigrams from corpus.
        """
        for tokens in tokenized_documents:
            self.unigram_counts.update(tokens)
            self.total_tokens += len(tokens)
            
            for i in range(len(tokens) - 1):
                bigram = (tokens[i], tokens[i+1])
                self.bigram_counts[bigram] += 1
                self.total_bigrams += 1
        
        return self
    
    def compute_pmi(self, w1: str, w2: str) -> float:
        """
        Compute Pointwise Mutual Information for a bigram.
        
        PMI(w1, w2) = log2[P(w1, w2) / (P(w1) * P(w2))]
        """
        bigram = (w1, w2)
        
        if self.bigram_counts[bigram] < self.min_count:
            return float('-inf')
        
        # Probabilities
        p_w1 = self.unigram_counts[w1] / self.total_tokens
        p_w2 = self.unigram_counts[w2] / self.total_tokens
        p_bigram = self.bigram_counts[bigram] / self.total_bigrams
        
        if p_w1 == 0 or p_w2 == 0 or p_bigram == 0:
            return float('-inf')
        
        return math.log2(p_bigram / (p_w1 * p_w2))
    
    def compute_npmi(self, w1: str, w2: str) -> float:
        """
        Compute Normalized PMI (bounded to [-1, 1]).
        
        NPMI = PMI / -log2(P(w1, w2))
        """
        bigram = (w1, w2)
        
        if self.bigram_counts[bigram] < self.min_count:
            return float('-inf')
        
        pmi = self.compute_pmi(w1, w2)
        
        if pmi == float('-inf'):
            return float('-inf')
        
        p_bigram = self.bigram_counts[bigram] / self.total_bigrams
        
        if p_bigram == 0:
            return float('-inf')
        
        return pmi / (-math.log2(p_bigram))
    
    def compute_ppmi(self, w1: str, w2: str) -> float:
        """
        Compute Positive PMI (negative values → 0).
        """
        pmi = self.compute_pmi(w1, w2)
        return max(pmi, 0) if pmi != float('-inf') else 0
    
    def get_top_bigrams_by_pmi(
        self,
        n: int = 20,
        metric: str = 'npmi'
    ) -> List[Tuple[Tuple[str, str], float]]:
        """
        Get top bigrams ranked by specified metric.
        """
        results = []
        
        for bigram, count in self.bigram_counts.items():
            if count < self.min_count:
                continue
            
            w1, w2 = bigram
            
            if metric == 'pmi':
                score = self.compute_pmi(w1, w2)
            elif metric == 'npmi':
                score = self.compute_npmi(w1, w2)
            elif metric == 'ppmi':
                score = self.compute_ppmi(w1, w2)
            elif metric == 'frequency':
                score = count
            else:
                raise ValueError(f"Unknown metric: {metric}")
            
            if score != float('-inf'):
                results.append((bigram, score))
        
        # Sort by score descending
        results.sort(key=lambda x: x[1], reverse=True)
        
        return results[:n]
    
    def compare_metrics(self, bigram: Tuple[str, str]) -> Dict[str, float]:
        """
        Compare all metrics for a single bigram.
        """
        w1, w2 = bigram
        
        return {
            'count': self.bigram_counts[bigram],
            'pmi': self.compute_pmi(w1, w2),
            'npmi': self.compute_npmi(w1, w2),
            'ppmi': self.compute_ppmi(w1, w2),
        }
 
 
def demonstrate_bigram_quality():
    """
    Demonstrate bigram quality assessment.
    """
    # Simulated corpus with clear patterns
    corpus = [
        "machine learning is transforming the technology industry today",
        "deep learning and neural networks are powerful machine learning techniques",
        "natural language processing uses machine learning for text analysis",
        "new york city is a major technology hub for machine learning",
        "los angeles has many artificial intelligence startups",
        "machine learning applications include computer vision and nlp",
        "the technology industry uses deep learning for many applications",
        "neural networks are the foundation of deep learning systems",
        "artificial intelligence and machine learning are related fields",
        "computer vision uses convolutional neural networks effectively",
    ] * 10  # Repeat for better statistics
    
    # Tokenize
    tokenized = [doc.lower().split() for doc in corpus]
    
    # Analyze
    analyzer = BigramQualityAnalyzer(min_count=3)
    analyzer.fit(tokenized)
    
    print("="*60)
    print("BIGRAM QUALITY ANALYSIS")
    print("="*60)
    
    # Compare metrics for specific bigrams
    test_bigrams = [
        ("machine", "learning"),
        ("deep", "learning"),
        ("new", "york"),
        ("the", "technology"),
        ("is", "a"),
        ("neural", "networks"),
    ]
    
    print("
Metric Comparison for Selected Bigrams:")
    print("-"*50)
    print(f"{'Bigram':<25} {'Count':<8} {'PMI':<8} {'NPMI':<8}")
    print("-"*50)
    
    for bigram in test_bigrams:
        if bigram in analyzer.bigram_counts:
            metrics = analyzer.compare_metrics(bigram)
            print(f"{str(bigram):<25} {metrics['count']:<8} "
                  f"{metrics['pmi']:<8.2f} {metrics['npmi']:<8.3f}")
    
    # Top bigrams by different metrics
    print("
" + "="*60)
    print("TOP BIGRAMS BY METRIC")
    print("="*60)
    
    for metric in ['frequency', 'npmi']:
        print(f"
Top 10 by {metric.upper()}:")
        top_bigrams = analyzer.get_top_bigrams_by_pmi(n=10, metric=metric)
        
        for i, (bigram, score) in enumerate(top_bigrams, 1):
            print(f"  {i}. {bigram}: {score:.3f}")
 
 
if __name__ == "__main__":
    demonstrate_bigram_quality()

Summary: The Power of Pairs

Bigrams represent a fundamental leap from bag-of-words to sequence-aware text representation. We've explored their theory, implementation, and practical considerations:

Key Takeaways

•Bigrams capture local context: Consecutive word pairs preserve ordering that unigrams lose, enabling negation handling and phrase detection.
•Vocabulary explosion is real: Feature spaces grow significantly, requiring aggressive filtering, vocabulary limits, and quality assessment.
•Unigrams + bigrams is the sweet spot: Combined representations typically outperform either alone, capturing both word importance and local patterns.
•Quality metrics matter: PMI and NPMI help distinguish meaningful collocations from accidental adjacencies.
•Production requires tradeoffs: Balance vocabulary size against expressiveness; monitor memory and inference latency.
•Bigrams are Markov order-1: They capture immediate context but miss long-range dependencies.

Looking Ahead: General N-grams

Bigrams are just the beginning of sequence-aware feature engineering. In the next page, we'll generalize to n-grams of arbitrary length, exploring when trigrams, 4-grams, and higher orders add value—and when they don't. We'll also examine the mathematical framework that unifies all n-gram representations.

The journey from unigrams through bigrams to general n-grams represents increasing capture of sequential context, each step trading off expressiveness against computational cost.

Page Complete

You now understand bigrams as ordered word pairs that capture local context. You can implement bigram extraction, manage vocabulary explosion, assess bigram quality using PMI/NPMI, and make informed decisions about when bigrams add value to your NLP pipeline. Next, we'll generalize to n-grams of arbitrary length.