Loading content...
Consider two movie reviews:
"This film is not good."
"This film is good."
With unigram features alone, both documents contain nearly identical word sets—differing only by the presence of "not". A classifier must learn that "not" dramatically changes the meaning, but without context, it has no way to know that "not" modifies "good" rather than some other word.
Bigrams solve this problem by capturing consecutive word pairs. The first review generates the bigram "not good", which directly encodes the negation. The second generates "is good"—a distinctly positive signal. This simple extension from single tokens to pairs of tokens unlocks a new dimension of textual understanding.
Bigrams represent the first step beyond the bag-of-words assumption, introducing local context while maintaining the efficiency and interpretability that made unigrams powerful.
By the end of this page, you will understand bigram extraction, understand the mathematical foundation of pairwise token representations, master implementation patterns for efficient bigram computation, analyze the vocabulary explosion problem, and learn strategies for managing bigram feature spaces in production systems.
A bigram is an ordered pair of two consecutive tokens in a sequence. Where unigrams treat each token independently, bigrams preserve the sequential relationship between adjacent tokens, capturing local syntactic and semantic patterns.
Formal Definition:
Given a text sequence T = [t₁, t₂, ..., tₖ], the set of bigrams B(T) is:
B(T) = {(t₁, t₂), (t₂, t₃), ..., (tₖ₋₁, tₖ)}
For a sequence of k tokens, there are exactly k-1 bigrams.
Example:
Consider the sentence: "The quick brown fox"
| Position | Token 1 | Token 2 | Bigram |
|---|---|---|---|
| 1-2 | the | quick | "the quick" |
| 2-3 | quick | brown | "quick brown" |
| 3-4 | brown | fox | "brown fox" |
Note that bigrams are ordered—("quick", "brown") is different from ("brown", "quick"). This ordering is precisely what enables bigrams to capture context that unigrams miss.
Bigrams can be understood through the lens of Markov chains. A bigram model makes the Markov assumption that the probability of each word depends only on the immediately preceding word: P(wₙ | w₁, w₂, ..., wₙ₋₁) ≈ P(wₙ | wₙ₋₁). This first-order Markov assumption is the foundation of bigram language models and informs why bigrams capture "local" but not "global" context.
Bigram Representation Formats:
Bigrams can be represented in several ways, each with tradeoffs:
("the", "quick") - Explicit, easy to understand"the_quick" or "the quick" - Can use same infrastructure as unigrams(0, 1) - Compact, requires vocabulary mapping{"the": {"quick": count}} - Efficient for sparse storageThe joined string format is most common in practice because it allows reusing unigram vectorization infrastructure—treating each bigram as a single "word" in a larger vocabulary.
Bigrams address several fundamental limitations of unigram representations. Understanding these improvements clarifies when bigrams add value and when the added complexity isn't justified.
1. Negation Handling
Unigrams treat "good" the same whether it appears as "very good" or "not good". Bigrams distinguish these cases:
| Text | Unigrams | Bigrams |
|---|---|---|
| "not good" | {not, good} | {not_good} |
| "very good" | {very, good} | {very_good} |
| "not bad" | {not, bad} | {not_bad} |
The bigram "not_good" becomes a distinct feature that the model can learn to associate with negative sentiment.
2. Phrase Detection
Many concepts are expressed as multi-word phrases where individual words have different meanings:
| Phrase | Individual Words | Phrase Meaning |
|---|---|---|
| "machine learning" | machine, learning | ML field |
| "hot dog" | hot, dog | food item |
| "New York" | new, york | city name |
| "kick the bucket" | kick, the, bucket | idiom for dying |
Bigrams like "machine_learning" or "hot_dog" capture these compound concepts as single features.
Bigrams often represent the optimal tradeoff between capturing context and maintaining manageable feature spaces. They capture most common phrases and modifier patterns while avoiding the vocabulary explosion of higher-order n-grams. Research consistently shows that unigrams+bigrams outperform either alone, with diminishing returns for trigrams and beyond in many tasks.
Understanding the mathematics of bigram representations enables rigorous reasoning about feature spaces, vocabulary sizes, and computational complexity.
Bigram Vocabulary Size:
For a unigram vocabulary V with |V| = m unique tokens, the theoretical maximum bigram vocabulary is:
|V_bigram| ≤ m²
However, the actual observed bigram vocabulary is much smaller because:
In practice, observed bigram vocabulary size follows:
|V_bigram_observed| ≈ k × m^α
where α ≈ 1.3-1.5 for natural language corpora.
Probability Estimation:
Bigram probabilities are estimated using maximum likelihood:
P(wₙ | wₙ₋₁) = count(wₙ₋₁, wₙ) / count(wₙ₋₁)
This leads to the data sparsity problem: most valid bigrams are never observed in the training corpus, giving them probability 0 under MLE.
| Aspect | Unigrams | Bigrams | Implication |
|---|---|---|---|
| Vocabulary Size | O(m) | O(m²) theoretical, O(m^1.4) observed | Memory usage increases significantly |
| Sparsity | High | Very High | Even sparser matrices require efficient storage |
| Feature Count per Doc | O(n) | O(n-1) | Similar document vectors, larger vocabulary |
| Extraction Time | O(n) | O(n) | Both linear in document length |
| Interpretability | High | High | Both human-readable |
Smoothing for Probability Estimation:
To handle zero-probability bigrams, several smoothing techniques exist:
Laplace (Add-1) Smoothing:
P(wₙ | wₙ₋₁) = (count(wₙ₋₁, wₙ) + 1) / (count(wₙ₋₁) + |V|)
Add-k Smoothing:
P(wₙ | wₙ₋₁) = (count(wₙ₋₁, wₙ) + k) / (count(wₙ₋₁) + k×|V|)
Interpolation (Jelinek-Mercer):
P(wₙ | wₙ₋₁) = λ × P_bigram(wₙ | wₙ₋₁) + (1-λ) × P_unigram(wₙ)
Kneser-Ney Smoothing: More sophisticated approach using continuation probability—the probability that a word appears as a novel continuation.
For feature engineering (rather than language modeling), smoothing is less critical since we typically use occurrence counts rather than probabilities.
Efficient bigram extraction requires careful attention to memory and computation. Let's explore both from-scratch implementations and production-ready library usage.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291
from typing import List, Dict, Tuple, Iteratorfrom collections import Counter, defaultdictimport numpy as npfrom scipy.sparse import csr_matrix class BigramExtractor: """ Comprehensive bigram extraction with multiple strategies. """ @staticmethod def extract_bigrams( tokens: List[str], separator: str = "_" ) -> List[str]: """ Extract bigrams as joined strings. Time Complexity: O(n) where n = number of tokens Space Complexity: O(n) for output list Args: tokens: List of tokens separator: Character(s) to join bigram components Returns: List of bigram strings """ if len(tokens) < 2: return [] return [ f"{tokens[i]}{separator}{tokens[i+1]}" for i in range(len(tokens) - 1) ] @staticmethod def extract_bigrams_with_position( tokens: List[str] ) -> List[Tuple[str, str, int]]: """ Extract bigrams with position information. Useful for analysis and debugging. Returns: List of (token1, token2, position) tuples """ return [ (tokens[i], tokens[i+1], i) for i in range(len(tokens) - 1) ] @staticmethod def extract_skip_bigrams( tokens: List[str], max_skip: int = 2, separator: str = "_" ) -> List[str]: """ Extract bigrams with skips (non-consecutive pairs). Skip-grams can capture patterns like "not ... good" where words are separated by intervening tokens. Args: tokens: List of tokens max_skip: Maximum number of tokens to skip between pair separator: Join character Returns: List of skip-bigram strings """ bigrams = [] n = len(tokens) for i in range(n): # Regular bigram (skip=0) and skip-bigrams for skip in range(max_skip + 1): j = i + 1 + skip if j < n: bigrams.append(f"{tokens[i]}{separator}{tokens[j]}") return bigrams class BigramVectorizer: """ Complete bigram vectorizer with vocabulary management. """ def __init__( self, max_vocab_size: int = None, min_df: int = 1, include_unigrams: bool = True, separator: str = "_" ): self.max_vocab_size = max_vocab_size self.min_df = min_df self.include_unigrams = include_unigrams self.separator = separator self.vocabulary_: Dict[str, int] = {} self.doc_freq_: Dict[str, int] = {} def _tokenize(self, text: str) -> List[str]: """Simple tokenization.""" return text.lower().split() def _extract_features(self, tokens: List[str]) -> List[str]: """Extract all features (unigrams and/or bigrams).""" features = [] if self.include_unigrams: features.extend(tokens) # Add bigrams features.extend(BigramExtractor.extract_bigrams( tokens, self.separator )) return features def fit(self, documents: List[str]) -> 'BigramVectorizer': """ Build vocabulary from corpus. """ doc_freq = Counter() term_freq = Counter() for doc in documents: tokens = self._tokenize(doc) features = self._extract_features(tokens) # Count document frequencies unique_features = set(features) doc_freq.update(unique_features) # Count term frequencies term_freq.update(features) # Filter by minimum document frequency candidates = [ term for term, freq in doc_freq.items() if freq >= self.min_df ] # Sort by frequency sorted_terms = sorted( candidates, key=lambda t: term_freq[t], reverse=True ) # Limit vocabulary size if self.max_vocab_size: sorted_terms = sorted_terms[:self.max_vocab_size] # Build vocabulary mapping self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_terms)} self.doc_freq_ = {term: doc_freq[term] for term in sorted_terms} return self def transform(self, documents: List[str]) -> csr_matrix: """ Transform documents to feature vectors. """ rows, cols, data = [], [], [] for doc_idx, doc in enumerate(documents): tokens = self._tokenize(doc) features = self._extract_features(tokens) # Count features in document feature_counts = Counter(features) for feature, count in feature_counts.items(): if feature in self.vocabulary_: rows.append(doc_idx) cols.append(self.vocabulary_[feature]) data.append(count) return csr_matrix( (data, (rows, cols)), shape=(len(documents), len(self.vocabulary_)) ) def analyze_vocabulary_composition(self) -> Dict: """ Analyze the composition of the vocabulary. """ unigram_count = 0 bigram_count = 0 for term in self.vocabulary_: if self.separator in term: bigram_count += 1 else: unigram_count += 1 return { 'total_vocabulary': len(self.vocabulary_), 'unigram_count': unigram_count, 'bigram_count': bigram_count, 'unigram_ratio': unigram_count / len(self.vocabulary_) if self.vocabulary_ else 0, 'bigram_ratio': bigram_count / len(self.vocabulary_) if self.vocabulary_ else 0, } def demonstrate_bigram_extraction(): """ Demonstrate bigram extraction and analysis. """ # Sample corpus with sentiment patterns corpus = [ "This movie was not good at all", "I really loved this amazing film", "The acting was terrible and boring", "What a wonderful and beautiful story", "Not impressed by this disappointing movie", "Absolutely fantastic performance by the cast", "The worst film I have ever seen", "A truly remarkable cinematic experience", ] labels = [0, 1, 0, 1, 0, 1, 0, 1] # 0=negative, 1=positive # Compare unigram-only vs unigram+bigram print("="*60) print("VOCABULARY ANALYSIS") print("="*60) # Unigram only unigram_vec = BigramVectorizer(include_unigrams=True, min_df=1) unigram_vec.include_unigrams = True # Temporarily disable bigrams original_extract = unigram_vec._extract_features unigram_vec._extract_features = lambda tokens: tokens unigram_vec.fit(corpus) print(f"Unigram-only vocabulary size: {len(unigram_vec.vocabulary_)}") # Unigram + Bigram bigram_vec = BigramVectorizer(include_unigrams=True, min_df=1) bigram_vec.fit(corpus) composition = bigram_vec.analyze_vocabulary_composition() print(f"Unigram+Bigram vocabulary:") for key, value in composition.items(): if isinstance(value, float): print(f" {key}: {value:.2%}") else: print(f" {key}: {value}") # Show sentiment-relevant bigrams print("" + "="*60) print("SENTIMENT-RELEVANT BIGRAMS") print("="*60) for doc, label in zip(corpus, labels): tokens = doc.lower().split() bigrams = BigramExtractor.extract_bigrams(tokens) sentiment = "Positive" if label == 1 else "Negative" print(f"[{sentiment}] '{doc}'") print(f" Bigrams: {bigrams}") # Skip-bigram demonstration print("" + "="*60) print("SKIP-BIGRAM DEMONSTRATION") print("="*60) text = "not at all good" tokens = text.split() print(f"Text: '{text}'") print(f"Regular bigrams: {BigramExtractor.extract_bigrams(tokens)}") print(f"Skip-bigrams (k=2): {BigramExtractor.extract_skip_bigrams(tokens, max_skip=2)}") # Note: "not_good" appears in skip-bigrams, capturing negation despite gap if __name__ == "__main__": demonstrate_bigram_extraction()scikit-learn's CountVectorizer natively supports n-gram extraction through the ngram_range parameter. Understanding its configuration is essential for production deployments.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.pipeline import Pipelineimport numpy as np def comprehensive_bigram_example(): """ Complete bigram implementation with scikit-learn. """ # Sentiment analysis dataset texts = [ "This movie was not good at all", "I really loved this amazing film incredible", "The acting was terrible and boring throughout", "What a wonderful and beautiful story telling", "Not impressed by this disappointing movie waste", "Absolutely fantastic performance by the cast brilliant", "The worst film I have ever seen awful", "A truly remarkable cinematic experience outstanding", "This was not bad actually quite enjoyable", "The plot was confusing and the pacing slow", ] labels = [0, 1, 0, 1, 0, 1, 0, 1, 1, 0] # Configuration options for n-gram ranges configurations = [ ('Unigrams only', (1, 1)), ('Bigrams only', (2, 2)), ('Unigrams + Bigrams', (1, 2)), ] print("="*60) print("N-GRAM RANGE COMPARISON") print("="*60) for name, ngram_range in configurations: vectorizer = CountVectorizer( ngram_range=ngram_range, min_df=1, lowercase=True, ) X = vectorizer.fit_transform(texts) vocab_size = len(vectorizer.vocabulary_) # Show sample features feature_names = vectorizer.get_feature_names_out() print(f"{name}:") print(f" Vocabulary size: {vocab_size}") print(f" Matrix shape: {X.shape}") print(f" Sample features: {list(feature_names[:10])}") # If bigrams included, show some bigram features if ngram_range[1] >= 2: bigram_features = [f for f in feature_names if ' ' in f] print(f" Sample bigrams: {bigram_features[:10]}") # Build classification pipeline print("" + "="*60) print("CLASSIFICATION COMPARISON") print("="*60) for name, ngram_range in configurations: pipeline = Pipeline([ ('vectorizer', CountVectorizer( ngram_range=ngram_range, min_df=1, lowercase=True, stop_words='english', )), ('classifier', LogisticRegression(random_state=42, max_iter=1000)) ]) # Cross-validation (limited by small dataset) pipeline.fit(texts, labels) train_accuracy = pipeline.score(texts, labels) print(f"{name}:") print(f" Training accuracy: {train_accuracy:.2%}") # Analyze feature importance for unigram+bigram if ngram_range == (1, 2): feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out() coefficients = pipeline.named_steps['classifier'].coef_[0] # Top positive and negative features indices = np.argsort(coefficients) print(" Top negative indicators:") for idx in indices[:5]: print(f" '{feature_names[idx]}': {coefficients[idx]:.3f}") print(" Top positive indicators:") for idx in indices[-5:]: print(f" '{feature_names[idx]}': {coefficients[idx]:.3f}") def vocabulary_explosion_analysis(): """ Demonstrate vocabulary growth with n-gram order. """ print("" + "="*60) print("VOCABULARY EXPLOSION ANALYSIS") print("="*60) # Generate larger corpus for realistic analysis base_texts = [ "machine learning is transforming how we build software", "deep learning neural networks are powerful models", "natural language processing enables text understanding", "computer vision systems can analyze images effectively", "reinforcement learning agents learn from rewards", ] * 20 results = [] for n in range(1, 5): vectorizer = CountVectorizer( ngram_range=(1, n), min_df=1, ) X = vectorizer.fit_transform(base_texts) vocab_size = len(vectorizer.vocabulary_) # Count n-grams by order feature_names = vectorizer.get_feature_names_out() ngram_counts = {i: 0 for i in range(1, n+1)} for feature in feature_names: order = feature.count(' ') + 1 if order <= n: ngram_counts[order] += 1 results.append({ 'max_n': n, 'total_vocab': vocab_size, 'breakdown': ngram_counts, }) print(f"Up to {n}-grams:") print(f" Total vocabulary: {vocab_size}") print(f" Breakdown: {ngram_counts}") # Show growth rate print("Vocabulary Growth:") for i in range(1, len(results)): prev = results[i-1]['total_vocab'] curr = results[i]['total_vocab'] growth = (curr - prev) / prev * 100 print(f" {results[i-1]['max_n']}-gram → {results[i]['max_n']}-gram: +{growth:.1f}%") def tfidf_with_bigrams(): """ Demonstrate TF-IDF weighted bigrams. """ print("" + "="*60) print("TF-IDF WITH BIGRAMS") print("="*60) documents = [ "machine learning is a subset of artificial intelligence", "deep learning uses neural networks with many layers", "natural language processing handles text and speech", "machine learning algorithms learn patterns from data", "artificial intelligence includes machine learning and more", ] # TF-IDF with bigrams tfidf = TfidfVectorizer( ngram_range=(1, 2), min_df=1, use_idf=True, smooth_idf=True, sublinear_tf=True, # Use 1 + log(tf) ) X = tfidf.fit_transform(documents) feature_names = tfidf.get_feature_names_out() # Show top terms for each document for doc_idx, doc in enumerate(documents): print(f"Document {doc_idx + 1}: '{doc[:50]}...'") # Get non-zero features for this document doc_vector = X[doc_idx].toarray()[0] nonzero_indices = np.where(doc_vector > 0)[0] # Sort by TF-IDF score sorted_indices = sorted( nonzero_indices, key=lambda i: doc_vector[i], reverse=True )[:5] print(" Top TF-IDF features:") for idx in sorted_indices: print(f" '{feature_names[idx]}': {doc_vector[idx]:.3f}") if __name__ == "__main__": comprehensive_bigram_example() vocabulary_explosion_analysis() tfidf_with_bigrams()The most significant challenge with bigrams is vocabulary explosion: the combinatorial growth of the feature space as we consider word pairs.
Theoretical vs. Practical Growth:
For a unigram vocabulary of size V:
This explosion creates several challenges:
Mitigation Strategies:
Aggressive Frequency Filtering
min_df thresholds for bigramsMaximum Vocabulary Size
Chi-Square Feature Selection
Feature Hashing
Hierarchical Filtering
Mutual Information Filtering
For most text classification tasks, limit your total vocabulary (unigrams + bigrams) to 50,000-100,000 features. Use min_df=5 or higher for bigrams. Monitor the unigram/bigram ratio—if bigrams dominate (>80%), you may be overfitting to rare patterns. A typical healthy ratio is 60-70% unigrams, 30-40% bigrams.
Not all bigrams are equally useful. Some represent genuine linguistic collocations ("hot dog", "New York"), while others are accidental adjacencies ("the the", "is a"). Identifying quality bigrams improves feature engineering.
Pointwise Mutual Information (PMI):
PMI measures how much more likely two words are to co-occur than expected by chance:
PMI(w₁, w₂) = log₂[P(w₁, w₂) / (P(w₁) × P(w₂))]
Limitations of PMI:
PMI favors rare bigrams. If two rare words appear together once, their PMI can be artificially high. Solutions include:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200
import numpy as npfrom collections import Counterfrom typing import List, Dict, Tupleimport math class BigramQualityAnalyzer: """ Analyze bigram quality using statistical measures. """ def __init__(self, min_count: int = 5): self.min_count = min_count self.unigram_counts: Counter = Counter() self.bigram_counts: Counter = Counter() self.total_tokens: int = 0 self.total_bigrams: int = 0 def fit(self, tokenized_documents: List[List[str]]) -> 'BigramQualityAnalyzer': """ Count unigrams and bigrams from corpus. """ for tokens in tokenized_documents: self.unigram_counts.update(tokens) self.total_tokens += len(tokens) for i in range(len(tokens) - 1): bigram = (tokens[i], tokens[i+1]) self.bigram_counts[bigram] += 1 self.total_bigrams += 1 return self def compute_pmi(self, w1: str, w2: str) -> float: """ Compute Pointwise Mutual Information for a bigram. PMI(w1, w2) = log2[P(w1, w2) / (P(w1) * P(w2))] """ bigram = (w1, w2) if self.bigram_counts[bigram] < self.min_count: return float('-inf') # Probabilities p_w1 = self.unigram_counts[w1] / self.total_tokens p_w2 = self.unigram_counts[w2] / self.total_tokens p_bigram = self.bigram_counts[bigram] / self.total_bigrams if p_w1 == 0 or p_w2 == 0 or p_bigram == 0: return float('-inf') return math.log2(p_bigram / (p_w1 * p_w2)) def compute_npmi(self, w1: str, w2: str) -> float: """ Compute Normalized PMI (bounded to [-1, 1]). NPMI = PMI / -log2(P(w1, w2)) """ bigram = (w1, w2) if self.bigram_counts[bigram] < self.min_count: return float('-inf') pmi = self.compute_pmi(w1, w2) if pmi == float('-inf'): return float('-inf') p_bigram = self.bigram_counts[bigram] / self.total_bigrams if p_bigram == 0: return float('-inf') return pmi / (-math.log2(p_bigram)) def compute_ppmi(self, w1: str, w2: str) -> float: """ Compute Positive PMI (negative values → 0). """ pmi = self.compute_pmi(w1, w2) return max(pmi, 0) if pmi != float('-inf') else 0 def get_top_bigrams_by_pmi( self, n: int = 20, metric: str = 'npmi' ) -> List[Tuple[Tuple[str, str], float]]: """ Get top bigrams ranked by specified metric. """ results = [] for bigram, count in self.bigram_counts.items(): if count < self.min_count: continue w1, w2 = bigram if metric == 'pmi': score = self.compute_pmi(w1, w2) elif metric == 'npmi': score = self.compute_npmi(w1, w2) elif metric == 'ppmi': score = self.compute_ppmi(w1, w2) elif metric == 'frequency': score = count else: raise ValueError(f"Unknown metric: {metric}") if score != float('-inf'): results.append((bigram, score)) # Sort by score descending results.sort(key=lambda x: x[1], reverse=True) return results[:n] def compare_metrics(self, bigram: Tuple[str, str]) -> Dict[str, float]: """ Compare all metrics for a single bigram. """ w1, w2 = bigram return { 'count': self.bigram_counts[bigram], 'pmi': self.compute_pmi(w1, w2), 'npmi': self.compute_npmi(w1, w2), 'ppmi': self.compute_ppmi(w1, w2), } def demonstrate_bigram_quality(): """ Demonstrate bigram quality assessment. """ # Simulated corpus with clear patterns corpus = [ "machine learning is transforming the technology industry today", "deep learning and neural networks are powerful machine learning techniques", "natural language processing uses machine learning for text analysis", "new york city is a major technology hub for machine learning", "los angeles has many artificial intelligence startups", "machine learning applications include computer vision and nlp", "the technology industry uses deep learning for many applications", "neural networks are the foundation of deep learning systems", "artificial intelligence and machine learning are related fields", "computer vision uses convolutional neural networks effectively", ] * 10 # Repeat for better statistics # Tokenize tokenized = [doc.lower().split() for doc in corpus] # Analyze analyzer = BigramQualityAnalyzer(min_count=3) analyzer.fit(tokenized) print("="*60) print("BIGRAM QUALITY ANALYSIS") print("="*60) # Compare metrics for specific bigrams test_bigrams = [ ("machine", "learning"), ("deep", "learning"), ("new", "york"), ("the", "technology"), ("is", "a"), ("neural", "networks"), ] print("Metric Comparison for Selected Bigrams:") print("-"*50) print(f"{'Bigram':<25} {'Count':<8} {'PMI':<8} {'NPMI':<8}") print("-"*50) for bigram in test_bigrams: if bigram in analyzer.bigram_counts: metrics = analyzer.compare_metrics(bigram) print(f"{str(bigram):<25} {metrics['count']:<8} " f"{metrics['pmi']:<8.2f} {metrics['npmi']:<8.3f}") # Top bigrams by different metrics print("" + "="*60) print("TOP BIGRAMS BY METRIC") print("="*60) for metric in ['frequency', 'npmi']: print(f"Top 10 by {metric.upper()}:") top_bigrams = analyzer.get_top_bigrams_by_pmi(n=10, metric=metric) for i, (bigram, score) in enumerate(top_bigrams, 1): print(f" {i}. {bigram}: {score:.3f}") if __name__ == "__main__": demonstrate_bigram_quality()Bigrams represent a fundamental leap from bag-of-words to sequence-aware text representation. We've explored their theory, implementation, and practical considerations:
Looking Ahead: General N-grams
Bigrams are just the beginning of sequence-aware feature engineering. In the next page, we'll generalize to n-grams of arbitrary length, exploring when trigrams, 4-grams, and higher orders add value—and when they don't. We'll also examine the mathematical framework that unifies all n-gram representations.
The journey from unigrams through bigrams to general n-grams represents increasing capture of sequential context, each step trading off expressiveness against computational cost.
You now understand bigrams as ordered word pairs that capture local context. You can implement bigram extraction, manage vocabulary explosion, assess bigram quality using PMI/NPMI, and make informed decisions about when bigrams add value to your NLP pipeline. Next, we'll generalize to n-grams of arbitrary length.