Machine LearningText Feature Engineering - N-grams

Text Feature Engineering: N-grams

LevelIntermediate

Duration75 mins

TopicText Feature Engineering - N-grams

3 / 5

N-grams: The General Framework for Sequential Text Features

Generalizing Sequential Context

Having explored unigrams (n=1) and bigrams (n=2), we now confront the general question: How much sequential context should we capture?

Should we use trigrams to distinguish "not very good" from "very not good"? Should 4-grams help capture "New York Stock Exchange"? At what point do we hit diminishing returns?

The answer isn't purely theoretical—it depends on your task, data volume, computational budget, and the linguistic phenomena you need to capture. This page provides the framework to make these decisions rigorously, examining n-grams as a unified mathematical framework and developing practical intuition for n-gram order selection.

What You Will Learn

By the end of this page, you will understand the mathematical generalization of n-grams, analyze the exponential vocabulary growth with n, develop intuition for optimal n-gram order selection, implement efficient multi-order n-gram extraction, and understand how n-gram order relates to the bias-variance tradeoff in language modeling.

Formal Definition of N-grams

An n-gram is a contiguous sequence of n tokens from a given text. The term generalizes unigrams (n=1), bigrams (n=2), trigrams (n=3), and extends to arbitrary sequence lengths.

Formal Definition:

Given a token sequence T = [t₁, t₂, ..., tₖ] of length k, the set of n-grams Gₙ(T) is:

Gₙ(T) = {(tᵢ, tᵢ₊₁, ..., tᵢ₊ₙ₋₁) | 1 ≤ i ≤ k - n + 1}

Key properties:

A sequence of k tokens yields exactly k - n + 1 n-grams
When n > k, the sequence yields 0 n-grams
N-grams are ordered tuples—position matters

Naming Convention:

n	Name	Example ("the quick brown fox")
1	Unigram	"the", "quick", "brown", "fox"
2	Bigram	"the quick", "quick brown", "brown fox"
3	Trigram	"the quick brown", "quick brown fox"
4	4-gram (Quadgram)	"the quick brown fox"
5	5-gram (Pentagram)	— (sequence too short)

Beyond trigrams, numeric naming (4-gram, 5-gram) is standard.

The Markov Perspective

N-grams of order n correspond to (n-1)-order Markov models in language modeling. An n-gram language model estimates P(wₙ | w₁, ..., wₙ₋₁), assuming that the probability of the next word depends only on the previous n-1 words. Trigrams → 2nd-order Markov, bigrams → 1st-order Markov, unigrams → 0th-order (independent).

Multi-Order N-gram Ranges:

In practice, we often extract n-grams across a range of orders. The notation (min_n, max_n) indicates that we extract all n-grams from min_n to max_n inclusive.

(1, 1): Unigrams only
(2, 2): Bigrams only
(1, 2): Unigrams + Bigrams
(1, 3): Unigrams + Bigrams + Trigrams
(2, 4): Bigrams + Trigrams + 4-grams (no unigrams)

The combined feature set is the union of all n-gram sets:

Features = G₁(T) ∪ G₂(T) ∪ ... ∪ Gₘₐₓ(T)

Scikit-learn's ngram_range parameter uses exactly this notation.

Vocabulary Growth Analysis

Understanding how vocabulary size scales with n-gram order is essential for system design. The growth is bounded by combinatorics but constrained by linguistic patterns.

Theoretical Upper Bounds:

For a unigram vocabulary of size V:

N-gram Order	Maximum Possible	With V = 10,000
1 (unigrams)	V	10,000
2 (bigrams)	V²	100,000,000
3 (trigrams)	V³	1,000,000,000,000
4 (4-grams)	V⁴	10¹⁶

These astronomical numbers suggest n-grams are impractical. Fortunately, natural language is highly constrained.

Empirical Reality:

In real corpora, observed n-gram counts follow approximately:

|Observed Vₙ| ≈ k × |V₁|^α

Where α ≈ 1.3-1.5 (subexponential but superlinear). The constraints include:

Syntactic rules: Not all word sequences are grammatical
Semantic constraints: Meaningful combinations are limited
Corpus size: Rare n-grams never observed
Zipf's Law: Most n-grams appear only once

Observed N-gram Vocabulary (Typical English Corpus)
N-gram Order	Typical Vocabulary Size	% of Theoretical Max
1 (unigrams)	50,000 - 200,000	100%
2 (bigrams)	500,000 - 2,000,000	0.05% - 0.2%
3 (trigrams)	2,000,000 - 10,000,000	< 0.001%
4 (4-grams)	5,000,000 - 50,000,000	< 0.0000001%
5 (5-grams)	10,000,000 - 100,000,000	< 10⁻¹²%

Hapax Legomena (Singletons):

As n increases, the proportion of n-grams appearing exactly once grows dramatically:

Unigrams: ~40-50% are hapax
Bigrams: ~60-70% are hapax
Trigrams: ~80-90% are hapax
4-grams: ~90-95% are hapax

This has critical implications:

Most n-grams provide no statistical power (appear once)
Frequency-based filtering becomes essential
Smoothing techniques cannot save most n-grams
Generalization to unseen text becomes increasingly difficult

The Curse of Sparsity

As n increases, the probability that a specific n-gram in your test data was never seen during training approaches 100%. This is why language models use sophisticated smoothing and why very high-order n-grams are rarely useful for feature engineering—they simply don't generalize.

Efficient N-gram Implementation

Efficient n-gram extraction requires attention to memory usage and computation patterns, especially for higher orders.

ngram_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
from typing import List, Dict, Iterator, Tuple
from collections import Counter
import numpy as np
from scipy.sparse import csr_matrix
 
class NgramExtractor:
    """
    Comprehensive n-gram extraction with multiple strategies.
    """
    
    @staticmethod
    def extract_ngrams(
        tokens: List[str],
        n: int,
        separator: str = "_"
    ) -> List[str]:
        """
        Extract all n-grams of a specific order.
        
        Time Complexity: O(k * n) where k = len(tokens)
        Space Complexity: O(k - n + 1) for output
        """
        if len(tokens) < n:
            return []
        
        return [
            separator.join(tokens[i:i+n])
            for i in range(len(tokens) - n + 1)
        ]
    
    @staticmethod
    def extract_ngram_range(
        tokens: List[str],
        min_n: int,
        max_n: int,
        separator: str = "_"
    ) -> List[str]:
        """
        Extract n-grams for a range of orders [min_n, max_n].
        
        Time Complexity: O(k * (max_n - min_n + 1) * max_n)
        """
        ngrams = []
        
        for n in range(min_n, max_n + 1):
            ngrams.extend(
                NgramExtractor.extract_ngrams(tokens, n, separator)
            )
        
        return ngrams
    
    @staticmethod
    def extract_ngrams_generator(
        tokens: List[str],
        n: int,
        separator: str = "_"
    ) -> Iterator[str]:
        """
        Memory-efficient generator version for large documents.
        
        Yields n-grams one at a time without building full list.
        """
        k = len(tokens)
        
        if k < n:
            return
        
        for i in range(k - n + 1):
            yield separator.join(tokens[i:i+n])
    
    @staticmethod
    def extract_with_positions(
        tokens: List[str],
        n: int
    ) -> List[Tuple[Tuple[str, ...], int]]:
        """
        Extract n-grams with their starting positions.
        
        Useful for analysis and document annotation.
        """
        return [
            (tuple(tokens[i:i+n]), i)
            for i in range(len(tokens) - n + 1)
        ]
 
 
class NgramVectorizer:
    """
    Full n-gram vectorizer with vocabulary management.
    """
    
    def __init__(
        self,
        min_n: int = 1,
        max_n: int = 2,
        max_vocab_size: int = None,
        min_df: int = 1,
        max_df_ratio: float = 1.0,
        separator: str = "_"
    ):
        self.min_n = min_n
        self.max_n = max_n
        self.max_vocab_size = max_vocab_size
        self.min_df = min_df
        self.max_df_ratio = max_df_ratio
        self.separator = separator
        
        self.vocabulary_: Dict[str, int] = {}
        self.term_freq_: Counter = Counter()
        self.doc_freq_: Counter = Counter()
        self.ngram_order_: Dict[str, int] = {}
        
    def _tokenize(self, text: str) -> List[str]:
        """Simple whitespace tokenization."""
        return text.lower().split()
    
    def _extract_features(self, tokens: List[str]) -> List[str]:
        """Extract all n-grams in configured range."""
        return NgramExtractor.extract_ngram_range(
            tokens, self.min_n, self.max_n, self.separator
        )
    
    def fit(self, documents: List[str]) -> 'NgramVectorizer':
        """
        Build vocabulary from corpus.
        """
        num_docs = len(documents)
        
        for doc in documents:
            tokens = self._tokenize(doc)
            features = self._extract_features(tokens)
            
            # Count frequencies
            self.term_freq_.update(features)
            self.doc_freq_.update(set(features))
        
        # Determine n-gram order for each feature
        for feature in self.term_freq_:
            order = feature.count(self.separator) + 1
            self.ngram_order_[feature] = order
        
        # Filter vocabulary
        max_df = int(num_docs * self.max_df_ratio)
        
        candidates = [
            term for term, freq in self.doc_freq_.items()
            if freq >= self.min_df and freq <= max_df
        ]
        
        # Sort by frequency
        sorted_terms = sorted(
            candidates,
            key=lambda t: self.term_freq_[t],
            reverse=True
        )
        
        # Limit size
        if self.max_vocab_size:
            sorted_terms = sorted_terms[:self.max_vocab_size]
        
        self.vocabulary_ = {
            term: idx for idx, term in enumerate(sorted_terms)
        }
        
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """Transform documents to sparse feature matrix."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            features = self._extract_features(tokens)
            
            feature_counts = Counter(features)
            
            for feature, count in feature_counts.items():
                if feature in self.vocabulary_:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[feature])
                    data.append(count)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def analyze_vocabulary_composition(self) -> Dict:
        """
        Analyze vocabulary by n-gram order.
        """
        order_counts = Counter()
        order_total_freq = Counter()
        
        for term in self.vocabulary_:
            order = self.ngram_order_[term]
            order_counts[order] += 1
            order_total_freq[order] += self.term_freq_[term]
        
        total_vocab = len(self.vocabulary_)
        
        analysis = {
            'total_vocabulary': total_vocab,
            'by_order': {}
        }
        
        for order in sorted(order_counts.keys()):
            analysis['by_order'][order] = {
                'count': order_counts[order],
                'ratio': order_counts[order] / total_vocab,
                'total_freq': order_total_freq[order],
            }
        
        return analysis
 
 
def demonstrate_ngram_extraction():
    """
    Demonstrate n-gram extraction across different orders.
    """
    text = "the quick brown fox jumps over the lazy dog"
    tokens = text.split()
    
    print("="*60)
    print("N-GRAM EXTRACTION DEMONSTRATION")
    print("="*60)
    print(f"
Text: '{text}'")
    print(f"Tokens: {tokens}")
    print(f"Token count: {len(tokens)}")
    
    for n in range(1, 6):
        ngrams = NgramExtractor.extract_ngrams(tokens, n, " ")
        print(f"
{n}-grams ({len(ngrams)} total):")
        for ng in ngrams:
            print(f"  {ng}")
 
 
def analyze_vocabulary_growth():
    """
    Analyze vocabulary growth with n-gram order.
    """
    print("
" + "="*60)
    print("VOCABULARY GROWTH ANALYSIS")
    print("="*60)
    
    # Sample corpus
    corpus = [
        "machine learning is transforming how we build software applications",
        "deep learning uses neural networks with multiple hidden layers",
        "natural language processing enables text understanding and generation",
        "computer vision systems can analyze and interpret visual data",
        "reinforcement learning agents learn optimal policies from rewards",
        "supervised learning requires labeled training data examples",
        "unsupervised learning discovers patterns without labels",
        "transfer learning leverages knowledge from related tasks",
    ] * 50  # Larger corpus for better statistics
    
    results = []
    
    for max_n in range(1, 6):
        vectorizer = NgramVectorizer(
            min_n=1,
            max_n=max_n,
            min_df=2,  # Appear in at least 2 docs
        )
        vectorizer.fit(corpus)
        
        analysis = vectorizer.analyze_vocabulary_composition()
        
        results.append({
            'max_n': max_n,
            'total_vocab': analysis['total_vocabulary'],
            'by_order': analysis['by_order']
        })
        
        print(f"
Up to {max_n}-grams:")
        print(f"  Total vocabulary: {analysis['total_vocabulary']}")
        
        for order, stats in analysis['by_order'].items():
            print(f"  {order}-grams: {stats['count']} ({stats['ratio']:.1%})")
    
    # Calculate growth rates
    print("
Vocabulary Growth Rates:")
    for i in range(1, len(results)):
        prev = results[i-1]['total_vocab']
        curr = results[i]['total_vocab']
        
        if prev > 0:
            growth = (curr - prev) / prev
            abs_growth = curr - prev
            print(f"  {i}-grams → {i+1}-grams: +{abs_growth} features (+{growth:.1%})")
 
 
if __name__ == "__main__":
    demonstrate_ngram_extraction()
    analyze_vocabulary_growth()

Optimal N-gram Order Selection

Selecting the optimal n-gram range is a classic bias-variance tradeoff. Higher orders capture more context (lower bias) but generalize worse (higher variance). The optimal choice depends on task characteristics, data volume, and computational constraints.

Task-Dependent Guidelines:

Recommended N-gram Ranges by Task
Task	Recommended Range	Rationale
Topic Classification	(1, 1) or (1, 2)	Topics defined by word presence, not order
Sentiment Analysis	(1, 2) or (1, 3)	Negation and intensifiers require bigrams/trigrams
Authorship Attribution	(2, 5) or character n-grams	Writing style captured by phrase patterns
Spam Detection	(1, 2)	Spam keywords + common spam phrases
Language Identification	Character (3, 5)	Character patterns distinguish languages
Named Entity Recognition	(1, 3)	Multi-word entity names
Machine Translation	(1, 4) with smoothing	Phrase-level translation patterns

Empirical Order Selection:

When task guidelines are unclear, use cross-validation to select n-gram order:

Train models with different n-gram ranges: (1,1), (1,2), (1,3), (2,2), (2,3)
Evaluate on held-out validation set
Plot performance vs. vocabulary size
Select simplest model within acceptable performance range

Common Findings:

(1, 2) outperforms (1, 1) for most classification tasks
(1, 3) rarely outperforms (1, 2) significantly
(1, 4) and beyond almost never help for classification
Trigrams help most when phrases are discriminative (e.g., "not very good")
Character n-grams often outperform word n-grams for short texts

When Higher N Helps

•Large training data: Enough examples for rare n-grams
•Distinctive phrases: Task-specific collocations exist
•Negation patterns: "not at all good"
•Named entities: Multi-word names common
•Authorship/style: Phrase-level stylistic markers
•Specific domains: Technical/legal terminology

When Higher N Hurts

•Limited training data: Rare n-grams overfit
•Diverse vocabulary: Too many unique phrases
•Noisy text: Typos break n-gram continuity
•Short documents: Few n-grams per document
•Memory constraints: Vocabulary too large
•Latency requirements: Feature extraction too slow

The 80/20 Rule of N-grams

In practice, (1, 2) n-grams capture roughly 80% of the benefit with 20% of the complexity compared to higher orders. Start with unigrams + bigrams, and only add trigrams if validation performance meaningfully improves. The computational and memory costs of trigrams rarely justify marginal accuracy gains.

N-grams in Language Modeling

Beyond feature engineering, n-grams form the foundation of n-gram language models—probabilistic models that predict the next word given previous context. Understanding this connection deepens appreciation for n-gram theory.

N-gram Language Model:

An n-gram language model estimates the probability of a text sequence using the chain rule with Markov approximation:

P(w₁, w₂, ..., wₖ) = ∏ᵢ P(wᵢ | w₁, ..., wᵢ₋₁)

With (n-1)-order Markov assumption:

P(wᵢ | w₁, ..., wᵢ₋₁) ≈ P(wᵢ | wᵢ₋ₙ₊₁, ..., wᵢ₋₁)

This approximation makes computation tractable while capturing local context.

Probability Estimation:

Maximum Likelihood Estimation (MLE):

P(wₙ | wₙ₋ₙ₊₁, ..., wₙ₋₁) = count(wₙ₋ₙ₊₁, ..., wₙ) / count(wₙ₋ₙ₊₁, ..., wₙ₋₁)

Perplexity:

Language model quality is measured by perplexity—the inverse probability of the test set, normalized by sequence length:

PP(W) = P(w₁, w₂, ..., wₙ)^(-1/N)

Lower perplexity = better model. Typical values:

Trigram models: PP ≈ 70-150 on general English
Bigram models: PP ≈ 150-300
Unigram models: PP ≈ 500-1000+

ngram_language_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
from collections import Counter, defaultdict
from typing import List, Dict, Tuple
import math
import random
 
class NgramLanguageModel:
    """
    N-gram language model with smoothing.
    
    Demonstrates the connection between n-gram features
    and probabilistic language modeling.
    """
    
    def __init__(self, n: int = 3, smoothing: str = 'laplace'):
        self.n = n
        self.smoothing = smoothing
        
        # Context → next word → count
        self.ngram_counts: Dict[Tuple, Counter] = defaultdict(Counter)
        # Context → total count
        self.context_counts: Counter = Counter()
        # Vocabulary
        self.vocabulary: set = set()
        
        # Special tokens
        self.START = '<s>'
        self.END = '</s>'
        
    def _pad_sequence(self, tokens: List[str]) -> List[str]:
        """Add start/end tokens for context at boundaries."""
        padding = [self.START] * (self.n - 1)
        return padding + tokens + [self.END]
    
    def _get_ngrams(self, tokens: List[str]) -> List[Tuple[Tuple, str]]:
        """Extract (context, next_word) pairs."""
        padded = self._pad_sequence(tokens)
        ngrams = []
        
        for i in range(len(padded) - self.n + 1):
            context = tuple(padded[i:i+self.n-1])
            next_word = padded[i+self.n-1]
            ngrams.append((context, next_word))
        
        return ngrams
    
    def fit(self, documents: List[List[str]]) -> 'NgramLanguageModel':
        """
        Train language model on tokenized documents.
        """
        for tokens in documents:
            self.vocabulary.update(tokens)
            
            for context, next_word in self._get_ngrams(tokens):
                self.ngram_counts[context][next_word] += 1
                self.context_counts[context] += 1
        
        self.vocabulary.add(self.END)
        
        return self
    
    def probability(self, word: str, context: Tuple[str, ...]) -> float:
        """
        Compute P(word | context) with smoothing.
        """
        V = len(self.vocabulary)
        
        if self.smoothing == 'laplace':
            # Add-1 (Laplace) smoothing
            count = self.ngram_counts[context][word]
            context_count = self.context_counts[context]
            
            return (count + 1) / (context_count + V)
        
        elif self.smoothing == 'mle':
            # Maximum likelihood (no smoothing)
            if self.context_counts[context] == 0:
                return 1 / V  # Fallback to uniform
            
            return self.ngram_counts[context][word] / self.context_counts[context]
        
        else:
            raise ValueError(f"Unknown smoothing: {self.smoothing}")
    
    def log_probability(self, tokens: List[str]) -> float:
        """
        Compute log probability of a token sequence.
        """
        log_prob = 0.0
        
        for context, next_word in self._get_ngrams(tokens):
            prob = self.probability(next_word, context)
            
            if prob <= 0:
                return float('-inf')
            
            log_prob += math.log2(prob)
        
        return log_prob
    
    def perplexity(self, test_documents: List[List[str]]) -> float:
        """
        Compute perplexity on test documents.
        
        PP = 2^(-1/N * sum(log2 P(w_i | context)))
        """
        total_log_prob = 0.0
        total_tokens = 0
        
        for tokens in test_documents:
            total_log_prob += self.log_probability(tokens)
            total_tokens += len(tokens) + 1  # +1 for </s>
        
        if total_tokens == 0:
            return float('inf')
        
        cross_entropy = -total_log_prob / total_tokens
        perplexity = 2 ** cross_entropy
        
        return perplexity
    
    def generate(self, max_length: int = 20, temperature: float = 1.0) -> List[str]:
        """
        Generate text using the language model.
        """
        context = tuple([self.START] * (self.n - 1))
        generated = []
        
        for _ in range(max_length):
            # Get distribution for next word
            next_word_counts = self.ngram_counts[context]
            
            if not next_word_counts:
                break
            
            # Sample from distribution
            words = list(next_word_counts.keys())
            counts = [next_word_counts[w] for w in words]
            
            # Apply temperature
            if temperature != 1.0:
                counts = [c ** (1/temperature) for c in counts]
            
            total = sum(counts)
            probs = [c / total for c in counts]
            
            next_word = random.choices(words, weights=probs)[0]
            
            if next_word == self.END:
                break
            
            generated.append(next_word)
            context = tuple(list(context)[1:] + [next_word])
        
        return generated
 
 
def demonstrate_language_model():
    """
    Demonstrate n-gram language model.
    """
    print("="*60)
    print("N-GRAM LANGUAGE MODEL DEMONSTRATION")
    print("="*60)
    
    # Training corpus (tokenized)
    training_data = [
        ["the", "quick", "brown", "fox", "jumps"],
        ["the", "quick", "brown", "dog", "runs"],
        ["the", "lazy", "brown", "fox", "sleeps"],
        ["a", "quick", "brown", "cat", "jumps"],
        ["the", "brown", "fox", "is", "quick"],
    ]
    
    # Test data
    test_data = [
        ["the", "quick", "brown", "fox"],
        ["the", "lazy", "brown", "cat"],
    ]
    
    # Compare different n values
    for n in [1, 2, 3]:
        print(f"
{n}-gram Language Model:")
        
        model = NgramLanguageModel(n=n, smoothing='laplace')
        model.fit(training_data)
        
        pp = model.perplexity(test_data)
        print(f"  Perplexity: {pp:.2f}")
        
        # Generate sample text
        for _ in range(3):
            generated = model.generate(max_length=10)
            print(f"  Generated: {' '.join(generated)}")
 
 
if __name__ == "__main__":
    demonstrate_language_model()

Practical Considerations

Deploying n-gram features in production requires attention to several practical factors beyond pure algorithmic considerations.

Production Deployment Checklist

•Memory Budget: Higher n = exponentially larger vocabulary. Plan storage accordingly. A 500K vocabulary requires ~4MB just for string storage, plus indices.
•Vocabulary Versioning: N-gram vocabularies must match exactly between training and inference. Version and hash your vocabulary files.
•Fallback Strategy: Decide how to handle n-grams seen at inference but not training. Options: ignore, use lower-order backoff, or feature hashing.
•Tokenization Consistency: The same tokenizer must be used everywhere. Different tokenizers produce incompatible n-grams.
•Sentence Boundaries: Decide whether to extract n-grams across sentence boundaries. Usually no—add sentence tokenization first.
•Case Sensitivity: Most applications use lowercase. If case matters (e.g., named entities), preserve it.
•Punctuation Handling: Include punctuation in n-grams? "don't" vs "don ' t"? Depends on task.

production_ngrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
 
def production_ngram_pipeline():
    """
    Production-ready n-gram pipeline with best practices.
    """
    
    # Sample data
    texts = [
        "This product is absolutely wonderful and amazing",
        "I really loved this excellent high quality item",
        "Terrible waste of money complete garbage",
        "The worst purchase I have ever made awful",
        "Outstanding quality highly recommend to everyone",
        "Disappointing experience would not buy again",
    ] * 10
    
    labels = [1, 1, 0, 0, 1, 0] * 10
    
    # Production pipeline with feature selection
    pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(
            ngram_range=(1, 2),
            min_df=2,
            max_df=0.95,
            max_features=10000,  # Hard limit
            lowercase=True,
            strip_accents='unicode',
            sublinear_tf=True,  # Use 1 + log(tf)
            stop_words='english',
        )),
        # Feature selection to reduce dimensionality
        ('feature_selection', SelectKBest(chi2, k=1000)),
        ('classifier', LogisticRegression(
            max_iter=1000,
            random_state=42,
            C=1.0,
        )),
    ])
    
    pipeline.fit(texts, labels)
    
    # Analyze selected features
    vectorizer = pipeline.named_steps['vectorizer']
    selector = pipeline.named_steps['feature_selection']
    classifier = pipeline.named_steps['classifier']
    
    feature_names = vectorizer.get_feature_names_out()
    selected_mask = selector.get_support()
    selected_features = feature_names[selected_mask]
    
    print("Pipeline Summary:")
    print(f"  Original features: {len(feature_names)}")
    print(f"  Selected features: {len(selected_features)}")
    
    # Count n-gram orders in selected features
    unigrams = sum(1 for f in selected_features if ' ' not in f)
    bigrams = sum(1 for f in selected_features if ' ' in f)
    
    print(f"  Unigrams selected: {unigrams}")
    print(f"  Bigrams selected: {bigrams}")
    
    # Top features by chi-square score
    scores = selector.scores_[selected_mask]
    top_indices = np.argsort(scores)[-20:]
    
    print("
Top 20 Features by Chi-Square Score:")
    for idx in reversed(top_indices):
        print(f"  {selected_features[idx]}: {scores[idx]:.2f}")
    
    return pipeline
 
 
def memory_estimation():
    """
    Estimate memory requirements for n-gram vocabularies.
    """
    print("
" + "="*60)
    print("MEMORY ESTIMATION")
    print("="*60)
    
    # Typical vocabulary sizes and estimated memory
    configurations = [
        {'name': 'Unigrams only', 'vocab_size': 50000},
        {'name': 'Uni+Bigrams', 'vocab_size': 200000},
        {'name': 'Uni+Bi+Trigrams', 'vocab_size': 500000},
        {'name': 'Up to 4-grams', 'vocab_size': 1000000},
    ]
    
    for config in configurations:
        vocab_size = config['vocab_size']
        
        # Estimate: average feature is ~15 characters
        avg_feature_len = 15
        
        # String storage
        string_bytes = vocab_size * avg_feature_len
        
        # Vocabulary dict (str → int mapping)
        # Python dict overhead ~50 bytes per entry
        dict_bytes = vocab_size * 50
        
        # Sparse matrix for 10K documents
        # Assume 0.1% density
        num_docs = 10000
        density = 0.001
        nnz = int(num_docs * vocab_size * density)
        sparse_bytes = nnz * 12  # 4 bytes data + 4 bytes col + 4 bytes row
        
        total_mb = (string_bytes + dict_bytes + sparse_bytes) / (1024**2)
        
        print(f"
{config['name']} ({vocab_size:,} features):")
        print(f"  Vocabulary storage: {string_bytes/1024**2:.1f} MB")
        print(f"  Dict overhead: {dict_bytes/1024**2:.1f} MB")
        print(f"  Sparse matrix (10K docs): {sparse_bytes/1024**2:.1f} MB")
        print(f"  Total estimate: {total_mb:.1f} MB")
 
 
if __name__ == "__main__":
    production_ngram_pipeline()
    memory_estimation()

Summary: The N-gram Framework

We've explored the complete n-gram framework, from mathematical foundations to production deployment. The key insights:

Key Takeaways

•N-grams generalize sequential context: Unigrams, bigrams, and trigrams are special cases of the n-gram framework, each capturing n consecutive tokens.
•Vocabulary grows sub-exponentially: While theoretical bounds are V^n, actual growth is closer to V^1.4 due to linguistic constraints.
•Diminishing returns after bigrams: For most classification tasks, (1,2) n-grams capture nearly all the benefit with manageable complexity.
•Language models connect to features: N-gram features derive from the same principles as n-gram language models—the Markov assumption.
•Production requires discipline: Memory budgets, vocabulary versioning, and consistent tokenization are essential for reliable systems.
•Task determines optimal n: Sentiment needs bigrams for negation; topic classification may not. Always validate empirically.

Looking Ahead: Character N-grams

Word-level n-grams assume useful text representations at the word level. But what about misspellings, morphological variations, or languages with complex word formation? Character n-grams operate below the word level, capturing sub-word patterns that word n-grams miss.

In the next page, we'll explore character n-grams as a complementary approach—one that handles typos gracefully, captures morphological patterns, and excels for language identification and authorship attribution.

Page Complete

You now understand n-grams as a unified framework for sequential text features. You can analyze vocabulary growth, select optimal n-gram orders for specific tasks, implement efficient multi-order extraction, and deploy n-gram features in production systems. Next, we'll explore character-level n-grams for sub-word representations.

3 / 5

Loading learning content...

Machine LearningText Feature Engineering - N-grams

Text Feature Engineering: N-grams

LevelIntermediate

Duration75 mins

TopicText Feature Engineering - N-grams

3 / 5

N-grams: The General Framework for Sequential Text Features

Generalizing Sequential Context

Having explored unigrams (n=1) and bigrams (n=2), we now confront the general question: How much sequential context should we capture?

Should we use trigrams to distinguish "not very good" from "very not good"? Should 4-grams help capture "New York Stock Exchange"? At what point do we hit diminishing returns?

What You Will Learn

Formal Definition of N-grams

An n-gram is a contiguous sequence of n tokens from a given text. The term generalizes unigrams (n=1), bigrams (n=2), trigrams (n=3), and extends to arbitrary sequence lengths.

Formal Definition:

Given a token sequence T = [t₁, t₂, ..., tₖ] of length k, the set of n-grams Gₙ(T) is:

Gₙ(T) = {(tᵢ, tᵢ₊₁, ..., tᵢ₊ₙ₋₁) | 1 ≤ i ≤ k - n + 1}

Key properties:

A sequence of k tokens yields exactly k - n + 1 n-grams
When n > k, the sequence yields 0 n-grams
N-grams are ordered tuples—position matters

Naming Convention:

n	Name	Example ("the quick brown fox")
1	Unigram	"the", "quick", "brown", "fox"
2	Bigram	"the quick", "quick brown", "brown fox"
3	Trigram	"the quick brown", "quick brown fox"
4	4-gram (Quadgram)	"the quick brown fox"
5	5-gram (Pentagram)	— (sequence too short)

Beyond trigrams, numeric naming (4-gram, 5-gram) is standard.

The Markov Perspective

Multi-Order N-gram Ranges:

In practice, we often extract n-grams across a range of orders. The notation (min_n, max_n) indicates that we extract all n-grams from min_n to max_n inclusive.

(1, 1): Unigrams only
(2, 2): Bigrams only
(1, 2): Unigrams + Bigrams
(1, 3): Unigrams + Bigrams + Trigrams
(2, 4): Bigrams + Trigrams + 4-grams (no unigrams)

The combined feature set is the union of all n-gram sets:

Features = G₁(T) ∪ G₂(T) ∪ ... ∪ Gₘₐₓ(T)

Scikit-learn's ngram_range parameter uses exactly this notation.

Vocabulary Growth Analysis

Understanding how vocabulary size scales with n-gram order is essential for system design. The growth is bounded by combinatorics but constrained by linguistic patterns.

Theoretical Upper Bounds:

For a unigram vocabulary of size V:

N-gram Order	Maximum Possible	With V = 10,000
1 (unigrams)	V	10,000
2 (bigrams)	V²	100,000,000
3 (trigrams)	V³	1,000,000,000,000
4 (4-grams)	V⁴	10¹⁶

These astronomical numbers suggest n-grams are impractical. Fortunately, natural language is highly constrained.

Empirical Reality:

In real corpora, observed n-gram counts follow approximately:

|Observed Vₙ| ≈ k × |V₁|^α

Where α ≈ 1.3-1.5 (subexponential but superlinear). The constraints include:

Syntactic rules: Not all word sequences are grammatical
Semantic constraints: Meaningful combinations are limited
Corpus size: Rare n-grams never observed
Zipf's Law: Most n-grams appear only once

Observed N-gram Vocabulary (Typical English Corpus)
N-gram Order	Typical Vocabulary Size	% of Theoretical Max
1 (unigrams)	50,000 - 200,000	100%
2 (bigrams)	500,000 - 2,000,000	0.05% - 0.2%
3 (trigrams)	2,000,000 - 10,000,000	< 0.001%
4 (4-grams)	5,000,000 - 50,000,000	< 0.0000001%
5 (5-grams)	10,000,000 - 100,000,000	< 10⁻¹²%

Hapax Legomena (Singletons):

As n increases, the proportion of n-grams appearing exactly once grows dramatically:

Unigrams: ~40-50% are hapax
Bigrams: ~60-70% are hapax
Trigrams: ~80-90% are hapax
4-grams: ~90-95% are hapax

This has critical implications:

Most n-grams provide no statistical power (appear once)
Frequency-based filtering becomes essential
Smoothing techniques cannot save most n-grams
Generalization to unseen text becomes increasingly difficult

The Curse of Sparsity

Efficient N-gram Implementation

Efficient n-gram extraction requires attention to memory usage and computation patterns, especially for higher orders.

ngram_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
from typing import List, Dict, Iterator, Tuple
from collections import Counter
import numpy as np
from scipy.sparse import csr_matrix
 
class NgramExtractor:
    """
    Comprehensive n-gram extraction with multiple strategies.
    """
    
    @staticmethod
    def extract_ngrams(
        tokens: List[str],
        n: int,
        separator: str = "_"
    ) -> List[str]:
        """
        Extract all n-grams of a specific order.
        
        Time Complexity: O(k * n) where k = len(tokens)
        Space Complexity: O(k - n + 1) for output
        """
        if len(tokens) < n:
            return []
        
        return [
            separator.join(tokens[i:i+n])
            for i in range(len(tokens) - n + 1)
        ]
    
    @staticmethod
    def extract_ngram_range(
        tokens: List[str],
        min_n: int,
        max_n: int,
        separator: str = "_"
    ) -> List[str]:
        """
        Extract n-grams for a range of orders [min_n, max_n].
        
        Time Complexity: O(k * (max_n - min_n + 1) * max_n)
        """
        ngrams = []
        
        for n in range(min_n, max_n + 1):
            ngrams.extend(
                NgramExtractor.extract_ngrams(tokens, n, separator)
            )
        
        return ngrams
    
    @staticmethod
    def extract_ngrams_generator(
        tokens: List[str],
        n: int,
        separator: str = "_"
    ) -> Iterator[str]:
        """
        Memory-efficient generator version for large documents.
        
        Yields n-grams one at a time without building full list.
        """
        k = len(tokens)
        
        if k < n:
            return
        
        for i in range(k - n + 1):
            yield separator.join(tokens[i:i+n])
    
    @staticmethod
    def extract_with_positions(
        tokens: List[str],
        n: int
    ) -> List[Tuple[Tuple[str, ...], int]]:
        """
        Extract n-grams with their starting positions.
        
        Useful for analysis and document annotation.
        """
        return [
            (tuple(tokens[i:i+n]), i)
            for i in range(len(tokens) - n + 1)
        ]
 
 
class NgramVectorizer:
    """
    Full n-gram vectorizer with vocabulary management.
    """
    
    def __init__(
        self,
        min_n: int = 1,
        max_n: int = 2,
        max_vocab_size: int = None,
        min_df: int = 1,
        max_df_ratio: float = 1.0,
        separator: str = "_"
    ):
        self.min_n = min_n
        self.max_n = max_n
        self.max_vocab_size = max_vocab_size
        self.min_df = min_df
        self.max_df_ratio = max_df_ratio
        self.separator = separator
        
        self.vocabulary_: Dict[str, int] = {}
        self.term_freq_: Counter = Counter()
        self.doc_freq_: Counter = Counter()
        self.ngram_order_: Dict[str, int] = {}
        
    def _tokenize(self, text: str) -> List[str]:
        """Simple whitespace tokenization."""
        return text.lower().split()
    
    def _extract_features(self, tokens: List[str]) -> List[str]:
        """Extract all n-grams in configured range."""
        return NgramExtractor.extract_ngram_range(
            tokens, self.min_n, self.max_n, self.separator
        )
    
    def fit(self, documents: List[str]) -> 'NgramVectorizer':
        """
        Build vocabulary from corpus.
        """
        num_docs = len(documents)
        
        for doc in documents:
            tokens = self._tokenize(doc)
            features = self._extract_features(tokens)
            
            # Count frequencies
            self.term_freq_.update(features)
            self.doc_freq_.update(set(features))
        
        # Determine n-gram order for each feature
        for feature in self.term_freq_:
            order = feature.count(self.separator) + 1
            self.ngram_order_[feature] = order
        
        # Filter vocabulary
        max_df = int(num_docs * self.max_df_ratio)
        
        candidates = [
            term for term, freq in self.doc_freq_.items()
            if freq >= self.min_df and freq <= max_df
        ]
        
        # Sort by frequency
        sorted_terms = sorted(
            candidates,
            key=lambda t: self.term_freq_[t],
            reverse=True
        )
        
        # Limit size
        if self.max_vocab_size:
            sorted_terms = sorted_terms[:self.max_vocab_size]
        
        self.vocabulary_ = {
            term: idx for idx, term in enumerate(sorted_terms)
        }
        
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """Transform documents to sparse feature matrix."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            features = self._extract_features(tokens)
            
            feature_counts = Counter(features)
            
            for feature, count in feature_counts.items():
                if feature in self.vocabulary_:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[feature])
                    data.append(count)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def analyze_vocabulary_composition(self) -> Dict:
        """
        Analyze vocabulary by n-gram order.
        """
        order_counts = Counter()
        order_total_freq = Counter()
        
        for term in self.vocabulary_:
            order = self.ngram_order_[term]
            order_counts[order] += 1
            order_total_freq[order] += self.term_freq_[term]
        
        total_vocab = len(self.vocabulary_)
        
        analysis = {
            'total_vocabulary': total_vocab,
            'by_order': {}
        }
        
        for order in sorted(order_counts.keys()):
            analysis['by_order'][order] = {
                'count': order_counts[order],
                'ratio': order_counts[order] / total_vocab,
                'total_freq': order_total_freq[order],
            }
        
        return analysis
 
 
def demonstrate_ngram_extraction():
    """
    Demonstrate n-gram extraction across different orders.
    """
    text = "the quick brown fox jumps over the lazy dog"
    tokens = text.split()
    
    print("="*60)
    print("N-GRAM EXTRACTION DEMONSTRATION")
    print("="*60)
    print(f"
Text: '{text}'")
    print(f"Tokens: {tokens}")
    print(f"Token count: {len(tokens)}")
    
    for n in range(1, 6):
        ngrams = NgramExtractor.extract_ngrams(tokens, n, " ")
        print(f"
{n}-grams ({len(ngrams)} total):")
        for ng in ngrams:
            print(f"  {ng}")
 
 
def analyze_vocabulary_growth():
    """
    Analyze vocabulary growth with n-gram order.
    """
    print("
" + "="*60)
    print("VOCABULARY GROWTH ANALYSIS")
    print("="*60)
    
    # Sample corpus
    corpus = [
        "machine learning is transforming how we build software applications",
        "deep learning uses neural networks with multiple hidden layers",
        "natural language processing enables text understanding and generation",
        "computer vision systems can analyze and interpret visual data",
        "reinforcement learning agents learn optimal policies from rewards",
        "supervised learning requires labeled training data examples",
        "unsupervised learning discovers patterns without labels",
        "transfer learning leverages knowledge from related tasks",
    ] * 50  # Larger corpus for better statistics
    
    results = []
    
    for max_n in range(1, 6):
        vectorizer = NgramVectorizer(
            min_n=1,
            max_n=max_n,
            min_df=2,  # Appear in at least 2 docs
        )
        vectorizer.fit(corpus)
        
        analysis = vectorizer.analyze_vocabulary_composition()
        
        results.append({
            'max_n': max_n,
            'total_vocab': analysis['total_vocabulary'],
            'by_order': analysis['by_order']
        })
        
        print(f"
Up to {max_n}-grams:")
        print(f"  Total vocabulary: {analysis['total_vocabulary']}")
        
        for order, stats in analysis['by_order'].items():
            print(f"  {order}-grams: {stats['count']} ({stats['ratio']:.1%})")
    
    # Calculate growth rates
    print("
Vocabulary Growth Rates:")
    for i in range(1, len(results)):
        prev = results[i-1]['total_vocab']
        curr = results[i]['total_vocab']
        
        if prev > 0:
            growth = (curr - prev) / prev
            abs_growth = curr - prev
            print(f"  {i}-grams → {i+1}-grams: +{abs_growth} features (+{growth:.1%})")
 
 
if __name__ == "__main__":
    demonstrate_ngram_extraction()
    analyze_vocabulary_growth()

Optimal N-gram Order Selection

Task-Dependent Guidelines:

Recommended N-gram Ranges by Task
Task	Recommended Range	Rationale
Topic Classification	(1, 1) or (1, 2)	Topics defined by word presence, not order
Sentiment Analysis	(1, 2) or (1, 3)	Negation and intensifiers require bigrams/trigrams
Authorship Attribution	(2, 5) or character n-grams	Writing style captured by phrase patterns
Spam Detection	(1, 2)	Spam keywords + common spam phrases
Language Identification	Character (3, 5)	Character patterns distinguish languages
Named Entity Recognition	(1, 3)	Multi-word entity names
Machine Translation	(1, 4) with smoothing	Phrase-level translation patterns

Empirical Order Selection:

When task guidelines are unclear, use cross-validation to select n-gram order:

Train models with different n-gram ranges: (1,1), (1,2), (1,3), (2,2), (2,3)
Evaluate on held-out validation set
Plot performance vs. vocabulary size
Select simplest model within acceptable performance range

Common Findings:

(1, 2) outperforms (1, 1) for most classification tasks
(1, 3) rarely outperforms (1, 2) significantly
(1, 4) and beyond almost never help for classification
Trigrams help most when phrases are discriminative (e.g., "not very good")
Character n-grams often outperform word n-grams for short texts

When Higher N Helps

•Large training data: Enough examples for rare n-grams
•Distinctive phrases: Task-specific collocations exist
•Negation patterns: "not at all good"
•Named entities: Multi-word names common
•Authorship/style: Phrase-level stylistic markers
•Specific domains: Technical/legal terminology

When Higher N Hurts

•Limited training data: Rare n-grams overfit
•Diverse vocabulary: Too many unique phrases
•Noisy text: Typos break n-gram continuity
•Short documents: Few n-grams per document
•Memory constraints: Vocabulary too large
•Latency requirements: Feature extraction too slow

The 80/20 Rule of N-grams

N-grams in Language Modeling

N-gram Language Model:

An n-gram language model estimates the probability of a text sequence using the chain rule with Markov approximation:

P(w₁, w₂, ..., wₖ) = ∏ᵢ P(wᵢ | w₁, ..., wᵢ₋₁)

With (n-1)-order Markov assumption:

P(wᵢ | w₁, ..., wᵢ₋₁) ≈ P(wᵢ | wᵢ₋ₙ₊₁, ..., wᵢ₋₁)

This approximation makes computation tractable while capturing local context.

Probability Estimation:

Maximum Likelihood Estimation (MLE):

P(wₙ | wₙ₋ₙ₊₁, ..., wₙ₋₁) = count(wₙ₋ₙ₊₁, ..., wₙ) / count(wₙ₋ₙ₊₁, ..., wₙ₋₁)

Perplexity:

Language model quality is measured by perplexity—the inverse probability of the test set, normalized by sequence length:

PP(W) = P(w₁, w₂, ..., wₙ)^(-1/N)

Lower perplexity = better model. Typical values:

Trigram models: PP ≈ 70-150 on general English
Bigram models: PP ≈ 150-300
Unigram models: PP ≈ 500-1000+

ngram_language_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
from collections import Counter, defaultdict
from typing import List, Dict, Tuple
import math
import random
 
class NgramLanguageModel:
    """
    N-gram language model with smoothing.
    
    Demonstrates the connection between n-gram features
    and probabilistic language modeling.
    """
    
    def __init__(self, n: int = 3, smoothing: str = 'laplace'):
        self.n = n
        self.smoothing = smoothing
        
        # Context → next word → count
        self.ngram_counts: Dict[Tuple, Counter] = defaultdict(Counter)
        # Context → total count
        self.context_counts: Counter = Counter()
        # Vocabulary
        self.vocabulary: set = set()
        
        # Special tokens
        self.START = '<s>'
        self.END = '</s>'
        
    def _pad_sequence(self, tokens: List[str]) -> List[str]:
        """Add start/end tokens for context at boundaries."""
        padding = [self.START] * (self.n - 1)
        return padding + tokens + [self.END]
    
    def _get_ngrams(self, tokens: List[str]) -> List[Tuple[Tuple, str]]:
        """Extract (context, next_word) pairs."""
        padded = self._pad_sequence(tokens)
        ngrams = []
        
        for i in range(len(padded) - self.n + 1):
            context = tuple(padded[i:i+self.n-1])
            next_word = padded[i+self.n-1]
            ngrams.append((context, next_word))
        
        return ngrams
    
    def fit(self, documents: List[List[str]]) -> 'NgramLanguageModel':
        """
        Train language model on tokenized documents.
        """
        for tokens in documents:
            self.vocabulary.update(tokens)
            
            for context, next_word in self._get_ngrams(tokens):
                self.ngram_counts[context][next_word] += 1
                self.context_counts[context] += 1
        
        self.vocabulary.add(self.END)
        
        return self
    
    def probability(self, word: str, context: Tuple[str, ...]) -> float:
        """
        Compute P(word | context) with smoothing.
        """
        V = len(self.vocabulary)
        
        if self.smoothing == 'laplace':
            # Add-1 (Laplace) smoothing
            count = self.ngram_counts[context][word]
            context_count = self.context_counts[context]
            
            return (count + 1) / (context_count + V)
        
        elif self.smoothing == 'mle':
            # Maximum likelihood (no smoothing)
            if self.context_counts[context] == 0:
                return 1 / V  # Fallback to uniform
            
            return self.ngram_counts[context][word] / self.context_counts[context]
        
        else:
            raise ValueError(f"Unknown smoothing: {self.smoothing}")
    
    def log_probability(self, tokens: List[str]) -> float:
        """
        Compute log probability of a token sequence.
        """
        log_prob = 0.0
        
        for context, next_word in self._get_ngrams(tokens):
            prob = self.probability(next_word, context)
            
            if prob <= 0:
                return float('-inf')
            
            log_prob += math.log2(prob)
        
        return log_prob
    
    def perplexity(self, test_documents: List[List[str]]) -> float:
        """
        Compute perplexity on test documents.
        
        PP = 2^(-1/N * sum(log2 P(w_i | context)))
        """
        total_log_prob = 0.0
        total_tokens = 0
        
        for tokens in test_documents:
            total_log_prob += self.log_probability(tokens)
            total_tokens += len(tokens) + 1  # +1 for </s>
        
        if total_tokens == 0:
            return float('inf')
        
        cross_entropy = -total_log_prob / total_tokens
        perplexity = 2 ** cross_entropy
        
        return perplexity
    
    def generate(self, max_length: int = 20, temperature: float = 1.0) -> List[str]:
        """
        Generate text using the language model.
        """
        context = tuple([self.START] * (self.n - 1))
        generated = []
        
        for _ in range(max_length):
            # Get distribution for next word
            next_word_counts = self.ngram_counts[context]
            
            if not next_word_counts:
                break
            
            # Sample from distribution
            words = list(next_word_counts.keys())
            counts = [next_word_counts[w] for w in words]
            
            # Apply temperature
            if temperature != 1.0:
                counts = [c ** (1/temperature) for c in counts]
            
            total = sum(counts)
            probs = [c / total for c in counts]
            
            next_word = random.choices(words, weights=probs)[0]
            
            if next_word == self.END:
                break
            
            generated.append(next_word)
            context = tuple(list(context)[1:] + [next_word])
        
        return generated
 
 
def demonstrate_language_model():
    """
    Demonstrate n-gram language model.
    """
    print("="*60)
    print("N-GRAM LANGUAGE MODEL DEMONSTRATION")
    print("="*60)
    
    # Training corpus (tokenized)
    training_data = [
        ["the", "quick", "brown", "fox", "jumps"],
        ["the", "quick", "brown", "dog", "runs"],
        ["the", "lazy", "brown", "fox", "sleeps"],
        ["a", "quick", "brown", "cat", "jumps"],
        ["the", "brown", "fox", "is", "quick"],
    ]
    
    # Test data
    test_data = [
        ["the", "quick", "brown", "fox"],
        ["the", "lazy", "brown", "cat"],
    ]
    
    # Compare different n values
    for n in [1, 2, 3]:
        print(f"
{n}-gram Language Model:")
        
        model = NgramLanguageModel(n=n, smoothing='laplace')
        model.fit(training_data)
        
        pp = model.perplexity(test_data)
        print(f"  Perplexity: {pp:.2f}")
        
        # Generate sample text
        for _ in range(3):
            generated = model.generate(max_length=10)
            print(f"  Generated: {' '.join(generated)}")
 
 
if __name__ == "__main__":
    demonstrate_language_model()

Practical Considerations

Deploying n-gram features in production requires attention to several practical factors beyond pure algorithmic considerations.

Production Deployment Checklist

•Memory Budget: Higher n = exponentially larger vocabulary. Plan storage accordingly. A 500K vocabulary requires ~4MB just for string storage, plus indices.
•Vocabulary Versioning: N-gram vocabularies must match exactly between training and inference. Version and hash your vocabulary files.
•Fallback Strategy: Decide how to handle n-grams seen at inference but not training. Options: ignore, use lower-order backoff, or feature hashing.
•Tokenization Consistency: The same tokenizer must be used everywhere. Different tokenizers produce incompatible n-grams.
•Sentence Boundaries: Decide whether to extract n-grams across sentence boundaries. Usually no—add sentence tokenization first.
•Case Sensitivity: Most applications use lowercase. If case matters (e.g., named entities), preserve it.
•Punctuation Handling: Include punctuation in n-grams? "don't" vs "don ' t"? Depends on task.

production_ngrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
 
def production_ngram_pipeline():
    """
    Production-ready n-gram pipeline with best practices.
    """
    
    # Sample data
    texts = [
        "This product is absolutely wonderful and amazing",
        "I really loved this excellent high quality item",
        "Terrible waste of money complete garbage",
        "The worst purchase I have ever made awful",
        "Outstanding quality highly recommend to everyone",
        "Disappointing experience would not buy again",
    ] * 10
    
    labels = [1, 1, 0, 0, 1, 0] * 10
    
    # Production pipeline with feature selection
    pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(
            ngram_range=(1, 2),
            min_df=2,
            max_df=0.95,
            max_features=10000,  # Hard limit
            lowercase=True,
            strip_accents='unicode',
            sublinear_tf=True,  # Use 1 + log(tf)
            stop_words='english',
        )),
        # Feature selection to reduce dimensionality
        ('feature_selection', SelectKBest(chi2, k=1000)),
        ('classifier', LogisticRegression(
            max_iter=1000,
            random_state=42,
            C=1.0,
        )),
    ])
    
    pipeline.fit(texts, labels)
    
    # Analyze selected features
    vectorizer = pipeline.named_steps['vectorizer']
    selector = pipeline.named_steps['feature_selection']
    classifier = pipeline.named_steps['classifier']
    
    feature_names = vectorizer.get_feature_names_out()
    selected_mask = selector.get_support()
    selected_features = feature_names[selected_mask]
    
    print("Pipeline Summary:")
    print(f"  Original features: {len(feature_names)}")
    print(f"  Selected features: {len(selected_features)}")
    
    # Count n-gram orders in selected features
    unigrams = sum(1 for f in selected_features if ' ' not in f)
    bigrams = sum(1 for f in selected_features if ' ' in f)
    
    print(f"  Unigrams selected: {unigrams}")
    print(f"  Bigrams selected: {bigrams}")
    
    # Top features by chi-square score
    scores = selector.scores_[selected_mask]
    top_indices = np.argsort(scores)[-20:]
    
    print("
Top 20 Features by Chi-Square Score:")
    for idx in reversed(top_indices):
        print(f"  {selected_features[idx]}: {scores[idx]:.2f}")
    
    return pipeline
 
 
def memory_estimation():
    """
    Estimate memory requirements for n-gram vocabularies.
    """
    print("
" + "="*60)
    print("MEMORY ESTIMATION")
    print("="*60)
    
    # Typical vocabulary sizes and estimated memory
    configurations = [
        {'name': 'Unigrams only', 'vocab_size': 50000},
        {'name': 'Uni+Bigrams', 'vocab_size': 200000},
        {'name': 'Uni+Bi+Trigrams', 'vocab_size': 500000},
        {'name': 'Up to 4-grams', 'vocab_size': 1000000},
    ]
    
    for config in configurations:
        vocab_size = config['vocab_size']
        
        # Estimate: average feature is ~15 characters
        avg_feature_len = 15
        
        # String storage
        string_bytes = vocab_size * avg_feature_len
        
        # Vocabulary dict (str → int mapping)
        # Python dict overhead ~50 bytes per entry
        dict_bytes = vocab_size * 50
        
        # Sparse matrix for 10K documents
        # Assume 0.1% density
        num_docs = 10000
        density = 0.001
        nnz = int(num_docs * vocab_size * density)
        sparse_bytes = nnz * 12  # 4 bytes data + 4 bytes col + 4 bytes row
        
        total_mb = (string_bytes + dict_bytes + sparse_bytes) / (1024**2)
        
        print(f"
{config['name']} ({vocab_size:,} features):")
        print(f"  Vocabulary storage: {string_bytes/1024**2:.1f} MB")
        print(f"  Dict overhead: {dict_bytes/1024**2:.1f} MB")
        print(f"  Sparse matrix (10K docs): {sparse_bytes/1024**2:.1f} MB")
        print(f"  Total estimate: {total_mb:.1f} MB")
 
 
if __name__ == "__main__":
    production_ngram_pipeline()
    memory_estimation()

Summary: The N-gram Framework

We've explored the complete n-gram framework, from mathematical foundations to production deployment. The key insights:

Key Takeaways

•N-grams generalize sequential context: Unigrams, bigrams, and trigrams are special cases of the n-gram framework, each capturing n consecutive tokens.
•Vocabulary grows sub-exponentially: While theoretical bounds are V^n, actual growth is closer to V^1.4 due to linguistic constraints.
•Diminishing returns after bigrams: For most classification tasks, (1,2) n-grams capture nearly all the benefit with manageable complexity.
•Language models connect to features: N-gram features derive from the same principles as n-gram language models—the Markov assumption.
•Production requires discipline: Memory budgets, vocabulary versioning, and consistent tokenization are essential for reliable systems.
•Task determines optimal n: Sentiment needs bigrams for negation; topic classification may not. Always validate empirically.

Looking Ahead: Character N-grams

Page Complete

3 / 5