Loading learning content...
Having explored unigrams (n=1) and bigrams (n=2), we now confront the general question: How much sequential context should we capture?
Should we use trigrams to distinguish "not very good" from "very not good"? Should 4-grams help capture "New York Stock Exchange"? At what point do we hit diminishing returns?
The answer isn't purely theoretical—it depends on your task, data volume, computational budget, and the linguistic phenomena you need to capture. This page provides the framework to make these decisions rigorously, examining n-grams as a unified mathematical framework and developing practical intuition for n-gram order selection.
By the end of this page, you will understand the mathematical generalization of n-grams, analyze the exponential vocabulary growth with n, develop intuition for optimal n-gram order selection, implement efficient multi-order n-gram extraction, and understand how n-gram order relates to the bias-variance tradeoff in language modeling.
An n-gram is a contiguous sequence of n tokens from a given text. The term generalizes unigrams (n=1), bigrams (n=2), trigrams (n=3), and extends to arbitrary sequence lengths.
Formal Definition:
Given a token sequence T = [t₁, t₂, ..., tₖ] of length k, the set of n-grams Gₙ(T) is:
Gₙ(T) = {(tᵢ, tᵢ₊₁, ..., tᵢ₊ₙ₋₁) | 1 ≤ i ≤ k - n + 1}
Key properties:
Naming Convention:
| n | Name | Example ("the quick brown fox") |
|---|---|---|
| 1 | Unigram | "the", "quick", "brown", "fox" |
| 2 | Bigram | "the quick", "quick brown", "brown fox" |
| 3 | Trigram | "the quick brown", "quick brown fox" |
| 4 | 4-gram (Quadgram) | "the quick brown fox" |
| 5 | 5-gram (Pentagram) | — (sequence too short) |
Beyond trigrams, numeric naming (4-gram, 5-gram) is standard.
N-grams of order n correspond to (n-1)-order Markov models in language modeling. An n-gram language model estimates P(wₙ | w₁, ..., wₙ₋₁), assuming that the probability of the next word depends only on the previous n-1 words. Trigrams → 2nd-order Markov, bigrams → 1st-order Markov, unigrams → 0th-order (independent).
Multi-Order N-gram Ranges:
In practice, we often extract n-grams across a range of orders. The notation (min_n, max_n) indicates that we extract all n-grams from min_n to max_n inclusive.
The combined feature set is the union of all n-gram sets:
Features = G₁(T) ∪ G₂(T) ∪ ... ∪ Gₘₐₓ(T)
Scikit-learn's ngram_range parameter uses exactly this notation.
Understanding how vocabulary size scales with n-gram order is essential for system design. The growth is bounded by combinatorics but constrained by linguistic patterns.
Theoretical Upper Bounds:
For a unigram vocabulary of size V:
| N-gram Order | Maximum Possible | With V = 10,000 |
|---|---|---|
| 1 (unigrams) | V | 10,000 |
| 2 (bigrams) | V² | 100,000,000 |
| 3 (trigrams) | V³ | 1,000,000,000,000 |
| 4 (4-grams) | V⁴ | 10¹⁶ |
These astronomical numbers suggest n-grams are impractical. Fortunately, natural language is highly constrained.
Empirical Reality:
In real corpora, observed n-gram counts follow approximately:
|Observed Vₙ| ≈ k × |V₁|^α
Where α ≈ 1.3-1.5 (subexponential but superlinear). The constraints include:
| N-gram Order | Typical Vocabulary Size | % of Theoretical Max |
|---|---|---|
| 1 (unigrams) | 50,000 - 200,000 | 100% |
| 2 (bigrams) | 500,000 - 2,000,000 | 0.05% - 0.2% |
| 3 (trigrams) | 2,000,000 - 10,000,000 | < 0.001% |
| 4 (4-grams) | 5,000,000 - 50,000,000 | < 0.0000001% |
| 5 (5-grams) | 10,000,000 - 100,000,000 | < 10⁻¹²% |
Hapax Legomena (Singletons):
As n increases, the proportion of n-grams appearing exactly once grows dramatically:
This has critical implications:
As n increases, the probability that a specific n-gram in your test data was never seen during training approaches 100%. This is why language models use sophisticated smoothing and why very high-order n-grams are rarely useful for feature engineering—they simply don't generalize.
Efficient n-gram extraction requires attention to memory usage and computation patterns, especially for higher orders.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296
from typing import List, Dict, Iterator, Tuplefrom collections import Counterimport numpy as npfrom scipy.sparse import csr_matrix class NgramExtractor: """ Comprehensive n-gram extraction with multiple strategies. """ @staticmethod def extract_ngrams( tokens: List[str], n: int, separator: str = "_" ) -> List[str]: """ Extract all n-grams of a specific order. Time Complexity: O(k * n) where k = len(tokens) Space Complexity: O(k - n + 1) for output """ if len(tokens) < n: return [] return [ separator.join(tokens[i:i+n]) for i in range(len(tokens) - n + 1) ] @staticmethod def extract_ngram_range( tokens: List[str], min_n: int, max_n: int, separator: str = "_" ) -> List[str]: """ Extract n-grams for a range of orders [min_n, max_n]. Time Complexity: O(k * (max_n - min_n + 1) * max_n) """ ngrams = [] for n in range(min_n, max_n + 1): ngrams.extend( NgramExtractor.extract_ngrams(tokens, n, separator) ) return ngrams @staticmethod def extract_ngrams_generator( tokens: List[str], n: int, separator: str = "_" ) -> Iterator[str]: """ Memory-efficient generator version for large documents. Yields n-grams one at a time without building full list. """ k = len(tokens) if k < n: return for i in range(k - n + 1): yield separator.join(tokens[i:i+n]) @staticmethod def extract_with_positions( tokens: List[str], n: int ) -> List[Tuple[Tuple[str, ...], int]]: """ Extract n-grams with their starting positions. Useful for analysis and document annotation. """ return [ (tuple(tokens[i:i+n]), i) for i in range(len(tokens) - n + 1) ] class NgramVectorizer: """ Full n-gram vectorizer with vocabulary management. """ def __init__( self, min_n: int = 1, max_n: int = 2, max_vocab_size: int = None, min_df: int = 1, max_df_ratio: float = 1.0, separator: str = "_" ): self.min_n = min_n self.max_n = max_n self.max_vocab_size = max_vocab_size self.min_df = min_df self.max_df_ratio = max_df_ratio self.separator = separator self.vocabulary_: Dict[str, int] = {} self.term_freq_: Counter = Counter() self.doc_freq_: Counter = Counter() self.ngram_order_: Dict[str, int] = {} def _tokenize(self, text: str) -> List[str]: """Simple whitespace tokenization.""" return text.lower().split() def _extract_features(self, tokens: List[str]) -> List[str]: """Extract all n-grams in configured range.""" return NgramExtractor.extract_ngram_range( tokens, self.min_n, self.max_n, self.separator ) def fit(self, documents: List[str]) -> 'NgramVectorizer': """ Build vocabulary from corpus. """ num_docs = len(documents) for doc in documents: tokens = self._tokenize(doc) features = self._extract_features(tokens) # Count frequencies self.term_freq_.update(features) self.doc_freq_.update(set(features)) # Determine n-gram order for each feature for feature in self.term_freq_: order = feature.count(self.separator) + 1 self.ngram_order_[feature] = order # Filter vocabulary max_df = int(num_docs * self.max_df_ratio) candidates = [ term for term, freq in self.doc_freq_.items() if freq >= self.min_df and freq <= max_df ] # Sort by frequency sorted_terms = sorted( candidates, key=lambda t: self.term_freq_[t], reverse=True ) # Limit size if self.max_vocab_size: sorted_terms = sorted_terms[:self.max_vocab_size] self.vocabulary_ = { term: idx for idx, term in enumerate(sorted_terms) } return self def transform(self, documents: List[str]) -> csr_matrix: """Transform documents to sparse feature matrix.""" rows, cols, data = [], [], [] for doc_idx, doc in enumerate(documents): tokens = self._tokenize(doc) features = self._extract_features(tokens) feature_counts = Counter(features) for feature, count in feature_counts.items(): if feature in self.vocabulary_: rows.append(doc_idx) cols.append(self.vocabulary_[feature]) data.append(count) return csr_matrix( (data, (rows, cols)), shape=(len(documents), len(self.vocabulary_)) ) def analyze_vocabulary_composition(self) -> Dict: """ Analyze vocabulary by n-gram order. """ order_counts = Counter() order_total_freq = Counter() for term in self.vocabulary_: order = self.ngram_order_[term] order_counts[order] += 1 order_total_freq[order] += self.term_freq_[term] total_vocab = len(self.vocabulary_) analysis = { 'total_vocabulary': total_vocab, 'by_order': {} } for order in sorted(order_counts.keys()): analysis['by_order'][order] = { 'count': order_counts[order], 'ratio': order_counts[order] / total_vocab, 'total_freq': order_total_freq[order], } return analysis def demonstrate_ngram_extraction(): """ Demonstrate n-gram extraction across different orders. """ text = "the quick brown fox jumps over the lazy dog" tokens = text.split() print("="*60) print("N-GRAM EXTRACTION DEMONSTRATION") print("="*60) print(f"\nText: '{text}'") print(f"Tokens: {tokens}") print(f"Token count: {len(tokens)}") for n in range(1, 6): ngrams = NgramExtractor.extract_ngrams(tokens, n, " ") print(f"\n{n}-grams ({len(ngrams)} total):") for ng in ngrams: print(f" {ng}") def analyze_vocabulary_growth(): """ Analyze vocabulary growth with n-gram order. """ print("\n" + "="*60) print("VOCABULARY GROWTH ANALYSIS") print("="*60) # Sample corpus corpus = [ "machine learning is transforming how we build software applications", "deep learning uses neural networks with multiple hidden layers", "natural language processing enables text understanding and generation", "computer vision systems can analyze and interpret visual data", "reinforcement learning agents learn optimal policies from rewards", "supervised learning requires labeled training data examples", "unsupervised learning discovers patterns without labels", "transfer learning leverages knowledge from related tasks", ] * 50 # Larger corpus for better statistics results = [] for max_n in range(1, 6): vectorizer = NgramVectorizer( min_n=1, max_n=max_n, min_df=2, # Appear in at least 2 docs ) vectorizer.fit(corpus) analysis = vectorizer.analyze_vocabulary_composition() results.append({ 'max_n': max_n, 'total_vocab': analysis['total_vocabulary'], 'by_order': analysis['by_order'] }) print(f"\nUp to {max_n}-grams:") print(f" Total vocabulary: {analysis['total_vocabulary']}") for order, stats in analysis['by_order'].items(): print(f" {order}-grams: {stats['count']} ({stats['ratio']:.1%})") # Calculate growth rates print("\nVocabulary Growth Rates:") for i in range(1, len(results)): prev = results[i-1]['total_vocab'] curr = results[i]['total_vocab'] if prev > 0: growth = (curr - prev) / prev abs_growth = curr - prev print(f" {i}-grams → {i+1}-grams: +{abs_growth} features (+{growth:.1%})") if __name__ == "__main__": demonstrate_ngram_extraction() analyze_vocabulary_growth()Selecting the optimal n-gram range is a classic bias-variance tradeoff. Higher orders capture more context (lower bias) but generalize worse (higher variance). The optimal choice depends on task characteristics, data volume, and computational constraints.
Task-Dependent Guidelines:
| Task | Recommended Range | Rationale |
|---|---|---|
| Topic Classification | (1, 1) or (1, 2) | Topics defined by word presence, not order |
| Sentiment Analysis | (1, 2) or (1, 3) | Negation and intensifiers require bigrams/trigrams |
| Authorship Attribution | (2, 5) or character n-grams | Writing style captured by phrase patterns |
| Spam Detection | (1, 2) | Spam keywords + common spam phrases |
| Language Identification | Character (3, 5) | Character patterns distinguish languages |
| Named Entity Recognition | (1, 3) | Multi-word entity names |
| Machine Translation | (1, 4) with smoothing | Phrase-level translation patterns |
Empirical Order Selection:
When task guidelines are unclear, use cross-validation to select n-gram order:
Common Findings:
In practice, (1, 2) n-grams capture roughly 80% of the benefit with 20% of the complexity compared to higher orders. Start with unigrams + bigrams, and only add trigrams if validation performance meaningfully improves. The computational and memory costs of trigrams rarely justify marginal accuracy gains.
Beyond feature engineering, n-grams form the foundation of n-gram language models—probabilistic models that predict the next word given previous context. Understanding this connection deepens appreciation for n-gram theory.
N-gram Language Model:
An n-gram language model estimates the probability of a text sequence using the chain rule with Markov approximation:
P(w₁, w₂, ..., wₖ) = ∏ᵢ P(wᵢ | w₁, ..., wᵢ₋₁)
With (n-1)-order Markov assumption:
P(wᵢ | w₁, ..., wᵢ₋₁) ≈ P(wᵢ | wᵢ₋ₙ₊₁, ..., wᵢ₋₁)
This approximation makes computation tractable while capturing local context.
Probability Estimation:
Maximum Likelihood Estimation (MLE):
P(wₙ | wₙ₋ₙ₊₁, ..., wₙ₋₁) = count(wₙ₋ₙ₊₁, ..., wₙ) / count(wₙ₋ₙ₊₁, ..., wₙ₋₁)
Perplexity:
Language model quality is measured by perplexity—the inverse probability of the test set, normalized by sequence length:
PP(W) = P(w₁, w₂, ..., wₙ)^(-1/N)
Lower perplexity = better model. Typical values:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197
from collections import Counter, defaultdictfrom typing import List, Dict, Tupleimport mathimport random class NgramLanguageModel: """ N-gram language model with smoothing. Demonstrates the connection between n-gram features and probabilistic language modeling. """ def __init__(self, n: int = 3, smoothing: str = 'laplace'): self.n = n self.smoothing = smoothing # Context → next word → count self.ngram_counts: Dict[Tuple, Counter] = defaultdict(Counter) # Context → total count self.context_counts: Counter = Counter() # Vocabulary self.vocabulary: set = set() # Special tokens self.START = '<s>' self.END = '</s>' def _pad_sequence(self, tokens: List[str]) -> List[str]: """Add start/end tokens for context at boundaries.""" padding = [self.START] * (self.n - 1) return padding + tokens + [self.END] def _get_ngrams(self, tokens: List[str]) -> List[Tuple[Tuple, str]]: """Extract (context, next_word) pairs.""" padded = self._pad_sequence(tokens) ngrams = [] for i in range(len(padded) - self.n + 1): context = tuple(padded[i:i+self.n-1]) next_word = padded[i+self.n-1] ngrams.append((context, next_word)) return ngrams def fit(self, documents: List[List[str]]) -> 'NgramLanguageModel': """ Train language model on tokenized documents. """ for tokens in documents: self.vocabulary.update(tokens) for context, next_word in self._get_ngrams(tokens): self.ngram_counts[context][next_word] += 1 self.context_counts[context] += 1 self.vocabulary.add(self.END) return self def probability(self, word: str, context: Tuple[str, ...]) -> float: """ Compute P(word | context) with smoothing. """ V = len(self.vocabulary) if self.smoothing == 'laplace': # Add-1 (Laplace) smoothing count = self.ngram_counts[context][word] context_count = self.context_counts[context] return (count + 1) / (context_count + V) elif self.smoothing == 'mle': # Maximum likelihood (no smoothing) if self.context_counts[context] == 0: return 1 / V # Fallback to uniform return self.ngram_counts[context][word] / self.context_counts[context] else: raise ValueError(f"Unknown smoothing: {self.smoothing}") def log_probability(self, tokens: List[str]) -> float: """ Compute log probability of a token sequence. """ log_prob = 0.0 for context, next_word in self._get_ngrams(tokens): prob = self.probability(next_word, context) if prob <= 0: return float('-inf') log_prob += math.log2(prob) return log_prob def perplexity(self, test_documents: List[List[str]]) -> float: """ Compute perplexity on test documents. PP = 2^(-1/N * sum(log2 P(w_i | context))) """ total_log_prob = 0.0 total_tokens = 0 for tokens in test_documents: total_log_prob += self.log_probability(tokens) total_tokens += len(tokens) + 1 # +1 for </s> if total_tokens == 0: return float('inf') cross_entropy = -total_log_prob / total_tokens perplexity = 2 ** cross_entropy return perplexity def generate(self, max_length: int = 20, temperature: float = 1.0) -> List[str]: """ Generate text using the language model. """ context = tuple([self.START] * (self.n - 1)) generated = [] for _ in range(max_length): # Get distribution for next word next_word_counts = self.ngram_counts[context] if not next_word_counts: break # Sample from distribution words = list(next_word_counts.keys()) counts = [next_word_counts[w] for w in words] # Apply temperature if temperature != 1.0: counts = [c ** (1/temperature) for c in counts] total = sum(counts) probs = [c / total for c in counts] next_word = random.choices(words, weights=probs)[0] if next_word == self.END: break generated.append(next_word) context = tuple(list(context)[1:] + [next_word]) return generated def demonstrate_language_model(): """ Demonstrate n-gram language model. """ print("="*60) print("N-GRAM LANGUAGE MODEL DEMONSTRATION") print("="*60) # Training corpus (tokenized) training_data = [ ["the", "quick", "brown", "fox", "jumps"], ["the", "quick", "brown", "dog", "runs"], ["the", "lazy", "brown", "fox", "sleeps"], ["a", "quick", "brown", "cat", "jumps"], ["the", "brown", "fox", "is", "quick"], ] # Test data test_data = [ ["the", "quick", "brown", "fox"], ["the", "lazy", "brown", "cat"], ] # Compare different n values for n in [1, 2, 3]: print(f"\n{n}-gram Language Model:") model = NgramLanguageModel(n=n, smoothing='laplace') model.fit(training_data) pp = model.perplexity(test_data) print(f" Perplexity: {pp:.2f}") # Generate sample text for _ in range(3): generated = model.generate(max_length=10) print(f" Generated: {' '.join(generated)}") if __name__ == "__main__": demonstrate_language_model()Deploying n-gram features in production requires attention to several practical factors beyond pure algorithmic considerations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.feature_selection import SelectKBest, chi2from sklearn.pipeline import Pipelinefrom sklearn.linear_model import LogisticRegressionimport numpy as np def production_ngram_pipeline(): """ Production-ready n-gram pipeline with best practices. """ # Sample data texts = [ "This product is absolutely wonderful and amazing", "I really loved this excellent high quality item", "Terrible waste of money complete garbage", "The worst purchase I have ever made awful", "Outstanding quality highly recommend to everyone", "Disappointing experience would not buy again", ] * 10 labels = [1, 1, 0, 0, 1, 0] * 10 # Production pipeline with feature selection pipeline = Pipeline([ ('vectorizer', TfidfVectorizer( ngram_range=(1, 2), min_df=2, max_df=0.95, max_features=10000, # Hard limit lowercase=True, strip_accents='unicode', sublinear_tf=True, # Use 1 + log(tf) stop_words='english', )), # Feature selection to reduce dimensionality ('feature_selection', SelectKBest(chi2, k=1000)), ('classifier', LogisticRegression( max_iter=1000, random_state=42, C=1.0, )), ]) pipeline.fit(texts, labels) # Analyze selected features vectorizer = pipeline.named_steps['vectorizer'] selector = pipeline.named_steps['feature_selection'] classifier = pipeline.named_steps['classifier'] feature_names = vectorizer.get_feature_names_out() selected_mask = selector.get_support() selected_features = feature_names[selected_mask] print("Pipeline Summary:") print(f" Original features: {len(feature_names)}") print(f" Selected features: {len(selected_features)}") # Count n-gram orders in selected features unigrams = sum(1 for f in selected_features if ' ' not in f) bigrams = sum(1 for f in selected_features if ' ' in f) print(f" Unigrams selected: {unigrams}") print(f" Bigrams selected: {bigrams}") # Top features by chi-square score scores = selector.scores_[selected_mask] top_indices = np.argsort(scores)[-20:] print("\nTop 20 Features by Chi-Square Score:") for idx in reversed(top_indices): print(f" {selected_features[idx]}: {scores[idx]:.2f}") return pipeline def memory_estimation(): """ Estimate memory requirements for n-gram vocabularies. """ print("\n" + "="*60) print("MEMORY ESTIMATION") print("="*60) # Typical vocabulary sizes and estimated memory configurations = [ {'name': 'Unigrams only', 'vocab_size': 50000}, {'name': 'Uni+Bigrams', 'vocab_size': 200000}, {'name': 'Uni+Bi+Trigrams', 'vocab_size': 500000}, {'name': 'Up to 4-grams', 'vocab_size': 1000000}, ] for config in configurations: vocab_size = config['vocab_size'] # Estimate: average feature is ~15 characters avg_feature_len = 15 # String storage string_bytes = vocab_size * avg_feature_len # Vocabulary dict (str → int mapping) # Python dict overhead ~50 bytes per entry dict_bytes = vocab_size * 50 # Sparse matrix for 10K documents # Assume 0.1% density num_docs = 10000 density = 0.001 nnz = int(num_docs * vocab_size * density) sparse_bytes = nnz * 12 # 4 bytes data + 4 bytes col + 4 bytes row total_mb = (string_bytes + dict_bytes + sparse_bytes) / (1024**2) print(f"\n{config['name']} ({vocab_size:,} features):") print(f" Vocabulary storage: {string_bytes/1024**2:.1f} MB") print(f" Dict overhead: {dict_bytes/1024**2:.1f} MB") print(f" Sparse matrix (10K docs): {sparse_bytes/1024**2:.1f} MB") print(f" Total estimate: {total_mb:.1f} MB") if __name__ == "__main__": production_ngram_pipeline() memory_estimation()We've explored the complete n-gram framework, from mathematical foundations to production deployment. The key insights:
Looking Ahead: Character N-grams
Word-level n-grams assume useful text representations at the word level. But what about misspellings, morphological variations, or languages with complex word formation? Character n-grams operate below the word level, capturing sub-word patterns that word n-grams miss.
In the next page, we'll explore character n-grams as a complementary approach—one that handles typos gracefully, captures morphological patterns, and excels for language identification and authorship attribution.
You now understand n-grams as a unified framework for sequential text features. You can analyze vocabulary growth, select optimal n-gram orders for specific tasks, implement efficient multi-order extraction, and deploy n-gram features in production systems. Next, we'll explore character-level n-grams for sub-word representations.