Loading content...
You now understand unigrams, bigrams, n-grams, and character n-grams. You know their mathematics, implementations, and applications. But when faced with a real NLP task, how do you choose?
These decisions involve fundamental tradeoffs that depend on your task, data characteristics, and constraints. This page provides a systematic framework for making these choices, synthesizing everything we've learned into actionable guidance for real-world text feature engineering.
By the end of this page, you will understand the core tradeoffs in n-gram feature engineering, have a decision framework for configuration selection, know how to validate your choices empirically, and be equipped to make principled decisions for any text classification or NLP task.
Every n-gram configuration decision involves balancing competing concerns. Understanding these tradeoffs enables informed choices rather than arbitrary defaults.
Tradeoff 1: Context Capture vs. Generalization
Higher-order n-grams capture more context but generalize worse:
The sweet spot depends on training data volume and task specificity.
Tradeoff 2: Vocabulary Size vs. Computational Resources
Larger vocabularies capture more information but cost more:
Tradeoff 3: Precision vs. Recall in Feature Selection
Strict filtering removes noise but may discard signal:
Tradeoff 4: Interpretability vs. Performance
Word n-grams are interpretable; character n-grams often perform better:
| Tradeoff | Low End | High End | Optimal Strategy |
|---|---|---|---|
| Context Capture | Unigrams (no context) | High-order n-grams (full context) | Match to task; (1,2) is often sufficient |
| Vocabulary Size | 100-1K features | 100K+ features | Use validation; diminishing returns >50K |
| Feature Filtering | min_df=1 (keep all) | min_df=10+ (strict) | min_df=2-5 for most tasks |
| Word vs. Character | Pure word n-grams | Pure character n-grams | Combine when robustness matters |
N-gram configuration can be understood through the lens of the bias-variance tradeoff, fundamental to all machine learning.
Low Bias (Complex Models):
→ Can fit training data well, but may overfit
Low Variance (Simple Models):
→ Generalizes well, but may underfit complex patterns
Data Volume Determines Optimal Complexity:
| Training Size | Recommended N-gram Range | Max Vocabulary | Character N-grams? |
|---|---|---|---|
| < 1,000 docs | (1, 1) unigrams only | 1,000 - 5,000 | Only if essential (language ID) |
| 1,000 - 10,000 docs | (1, 2) uni + bigrams | 5,000 - 20,000 | Consider for noisy text |
| 10,000 - 100,000 docs | (1, 2) or (1, 3) | 20,000 - 50,000 | Yes, with feature selection |
100,000 docs | (1, 3) or higher | 50,000 - 100,000+ | Yes, combined features optimal |
Plot training and validation accuracy as you increase training data. If validation lags significantly behind training, you're overfitting—reduce n-gram order or vocabulary size. If both are low and closely tracked, you're underfitting—consider higher-order n-grams or larger vocabulary.
Different NLP tasks have different optimal configurations based on what information is most discriminative.
Topic Classification (e.g., News Categories, Subject Detection):
Sentiment Analysis:
Spam Detection:
Language Identification:
Authorship Attribution:
| Task | Word N-grams | Char N-grams | Stop Words | Special Considerations |
|---|---|---|---|---|
| Topic Classification | (1, 2) | Optional | Remove | TF-IDF important |
| Sentiment Analysis | (1, 2) or (1, 3) | Optional | Keep | Negation handling critical |
| Spam Detection | (1, 2) | (3, 5) | Keep | Obfuscation handling |
| Language ID | None | (2, 5) | N/A | Small training data sufficient |
| Authorship | (1, 2) | (3, 5) | Keep | Style over content |
| Named Entity | (1, 3) | (3, 4) | Keep | Boundary markers help |
| Document Retrieval | (1, 1) | Optional | Remove | TF-IDF ranking |
Finding the optimal vocabulary size requires balancing information capture against overfitting and computational cost.
Empirical Approach:
Typical Findings:
Filtering Strategies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.feature_selection import SelectKBest, chi2from sklearn.pipeline import Pipelineimport numpy as npimport matplotlib.pyplot as pltfrom typing import List, Dict def optimize_vocabulary_size( texts: List[str], labels: List[int], vocab_sizes: List[int] = [100, 500, 1000, 5000, 10000, 25000, 50000], ngram_range: tuple = (1, 2), cv: int = 5) -> Dict: """ Find optimal vocabulary size using cross-validation. """ results = {'vocab_size': [], 'mean_score': [], 'std_score': []} for vocab_size in vocab_sizes: pipeline = Pipeline([ ('vectorizer', TfidfVectorizer( ngram_range=ngram_range, max_features=vocab_size, min_df=2, sublinear_tf=True, )), ('classifier', LogisticRegression( max_iter=1000, random_state=42, C=1.0, )) ]) scores = cross_val_score(pipeline, texts, labels, cv=cv, scoring='accuracy') results['vocab_size'].append(vocab_size) results['mean_score'].append(np.mean(scores)) results['std_score'].append(np.std(scores)) print(f"Vocab {vocab_size:>6}: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})") # Find optimal (highest mean score) best_idx = np.argmax(results['mean_score']) best_vocab = results['vocab_size'][best_idx] print(f"Optimal vocabulary size: {best_vocab}") return results def compare_filtering_strategies( texts: List[str], labels: List[int], cv: int = 5) -> Dict: """ Compare different vocabulary filtering strategies. """ strategies = { 'No Filtering': {'min_df': 1, 'max_df': 1.0, 'max_features': None}, 'min_df=2': {'min_df': 2, 'max_df': 1.0, 'max_features': None}, 'min_df=5': {'min_df': 5, 'max_df': 1.0, 'max_features': None}, 'max_df=0.9': {'min_df': 2, 'max_df': 0.9, 'max_features': None}, 'max_features=10K': {'min_df': 2, 'max_df': 1.0, 'max_features': 10000}, 'Chi2 + 5K': {'min_df': 2, 'max_df': 1.0, 'max_features': None, 'select_k': 5000}, } results = {} for name, params in strategies.items(): select_k = params.pop('select_k', None) if select_k: pipeline = Pipeline([ ('vectorizer', TfidfVectorizer( ngram_range=(1, 2), sublinear_tf=True, **params )), ('feature_selection', SelectKBest(chi2, k=select_k)), ('classifier', LogisticRegression(max_iter=1000, random_state=42)) ]) else: pipeline = Pipeline([ ('vectorizer', TfidfVectorizer( ngram_range=(1, 2), sublinear_tf=True, **params )), ('classifier', LogisticRegression(max_iter=1000, random_state=42)) ]) scores = cross_val_score(pipeline, texts, labels, cv=cv, scoring='accuracy') # Get actual vocab size after fitting pipeline.fit(texts, labels) if select_k: vocab_size = select_k else: vocab_size = len(pipeline.named_steps['vectorizer'].vocabulary_) results[name] = { 'mean_score': np.mean(scores), 'std_score': np.std(scores), 'vocab_size': vocab_size, } print(f"{name:20}: {np.mean(scores):.4f} (+/- {np.std(scores):.4f}) | vocab={vocab_size}") return results def demonstrate_optimization(): """ Demonstrate vocabulary optimization. """ print("="*60) print("VOCABULARY SIZE OPTIMIZATION") print("="*60) # Sample data (in practice, use real dataset) texts = [ "The machine learning model achieved excellent results", "Deep neural networks require extensive training data", "Feature engineering improves model performance significantly", "Overfitting occurs when models are too complex", "Cross-validation helps estimate generalization error", "Regularization prevents overfitting in high dimensions", "Gradient descent optimizes neural network weights", "Batch normalization accelerates training convergence", "Dropout provides effective neural network regularization", "Transfer learning leverages pre-trained representations", ] * 100 labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1] * 100 print("--- Vocabulary Size Sweep ---") results = optimize_vocabulary_size( texts, labels, vocab_sizes=[100, 500, 1000, 2000, 5000], ngram_range=(1, 2), cv=5 ) print("--- Filtering Strategy Comparison ---") filter_results = compare_filtering_strategies(texts, labels, cv=5) if __name__ == "__main__": demonstrate_optimization()Beyond accuracy, practical systems must consider memory, latency, and throughput constraints.
Memory Requirements:
| Component | Approximate Size | Scaling |
|---|---|---|
| Vocabulary dict | ~100 bytes/term | Linear with |
| Feature matrix | ~12 bytes/nonzero | O(n_docs × avg_features) |
| Model weights | ~8 bytes/feature | Linear with |
| Vectorizer state | ~50 MB for 100K vocab | Linear with |
Latency Analysis:
Production Optimizations:
A common production failure: vocabulary grows unbounded on streaming data. Always enforce max_features limits and use fitted vocabularies at inference time. Never call fit() on production data—only transform() with a frozen vocabulary.
| Configuration | Vocabulary Size | Memory | Fit Time | Transform Time |
|---|---|---|---|---|
| Unigrams (1,1) | ~20K | ~10 MB | ~2s | ~0.5s |
| Uni+Bigrams (1,2) | ~100K | ~50 MB | ~5s | ~1s |
| Up to Trigrams (1,3) | ~300K | ~150 MB | ~15s | ~2s |
| Char (3,5) | ~50K | ~25 MB | ~10s | ~3s |
| Combined Word+Char | ~150K | ~75 MB | ~15s | ~4s |
Use this systematic framework when configuring n-gram features for a new task:
Step 1: Identify Task Characteristics
Step 2: Select Initial Configuration
Based on task type, choose a starting point:
Step 3: Set Vocabulary Parameters
Step 4: Validate and Iterate
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.model_selection import GridSearchCV, cross_val_scorefrom typing import Dict, List, Tupleimport numpy as np class NgramConfigurationSearch: """ Systematic search for optimal n-gram configuration. """ def __init__(self, task_type: str = 'classification'): self.task_type = task_type self.best_config_: Dict = None self.best_score_: float = None self.results_: List[Dict] = [] def _get_initial_configs(self, task_hint: str = None) -> List[Dict]: """ Generate configuration candidates based on task type. """ configs = [] if task_hint == 'topic': # Topic classification: unigrams dominant configs = [ {'word_ngram_range': (1, 1), 'char': False, 'max_vocab': 10000}, {'word_ngram_range': (1, 2), 'char': False, 'max_vocab': 20000}, ] elif task_hint == 'sentiment': # Sentiment: bigrams for negation configs = [ {'word_ngram_range': (1, 2), 'char': False, 'max_vocab': 20000}, {'word_ngram_range': (1, 3), 'char': False, 'max_vocab': 30000}, ] elif task_hint == 'style': # Style/Authorship: character n-grams configs = [ {'word_ngram_range': None, 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 20000}, {'word_ngram_range': (1, 2), 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 30000}, ] elif task_hint == 'noisy': # Noisy text: character ngrams essential configs = [ {'word_ngram_range': (1, 2), 'char': True, 'char_ngram_range': (3, 4), 'max_vocab': 30000}, {'word_ngram_range': (1, 1), 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 25000}, ] else: # Unknown: comprehensive search configs = [ {'word_ngram_range': (1, 1), 'char': False, 'max_vocab': 10000}, {'word_ngram_range': (1, 2), 'char': False, 'max_vocab': 20000}, {'word_ngram_range': (1, 3), 'char': False, 'max_vocab': 30000}, {'word_ngram_range': (1, 2), 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 30000}, ] return configs def _build_pipeline(self, config: Dict) -> Pipeline: """ Build sklearn pipeline from configuration. """ transformers = [] if config.get('word_ngram_range'): word_vec = TfidfVectorizer( analyzer='word', ngram_range=config['word_ngram_range'], min_df=2, max_features=config.get('max_vocab', 10000) // (2 if config.get('char') else 1), sublinear_tf=True, stop_words='english' if config.get('remove_stop_words') else None, ) transformers.append(('word', word_vec)) if config.get('char'): char_range = config.get('char_ngram_range', (3, 5)) char_vec = TfidfVectorizer( analyzer='char_wb', ngram_range=char_range, min_df=2, max_features=config.get('max_vocab', 10000) // (2 if config.get('word_ngram_range') else 1), sublinear_tf=True, ) transformers.append(('char', char_vec)) if len(transformers) == 1: features = transformers[0][1] else: features = FeatureUnion(transformers) pipeline = Pipeline([ ('features', features), ('classifier', LogisticRegression(max_iter=1000, random_state=42, C=1.0)) ]) return pipeline def search( self, texts: List[str], labels: List[int], task_hint: str = None, cv: int = 5, ) -> Dict: """ Search for best n-gram configuration. """ configs = self._get_initial_configs(task_hint) print("N-gram Configuration Search") print("="*60) for config in configs: pipeline = self._build_pipeline(config) scores = cross_val_score( pipeline, texts, labels, cv=cv, scoring='accuracy' ) mean_score = np.mean(scores) std_score = np.std(scores) result = { 'config': config, 'mean_score': mean_score, 'std_score': std_score, } self.results_.append(result) # Format config for display config_str = f"word={config.get('word_ngram_range', 'None')}" if config.get('char'): config_str += f", char={config.get('char_ngram_range')}" config_str += f", vocab={config.get('max_vocab')}" print(f"{config_str}") print(f" Score: {mean_score:.4f} (+/- {std_score:.4f})") if self.best_score_ is None or mean_score > self.best_score_: self.best_score_ = mean_score self.best_config_ = config print("" + "="*60) print(f"Best configuration: {self.best_config_}") print(f"Best score: {self.best_score_:.4f}") return self.best_config_ def demonstrate_decision_framework(): """ Demonstrate the n-gram decision framework. """ # Sample data texts = [ "This movie was absolutely fantastic and amazing", "I loved every moment of this wonderful film", "Terrible waste of time completely boring", "The worst film I have ever seen awful", "Great acting brilliant storyline highly recommend", "Disappointing slow and not worth watching", ] * 50 labels = [1, 1, 0, 0, 1, 0] * 50 # Run search with sentiment hint searcher = NgramConfigurationSearch() best_config = searcher.search(texts, labels, task_hint='sentiment', cv=5) print(f"Recommended configuration for sentiment analysis:") print(f" {best_config}") if __name__ == "__main__": demonstrate_decision_framework()Years of practical experience reveal recurring mistakes and proven solutions in n-gram feature engineering.
This module has taken you from the fundamentals of unigrams to a comprehensive understanding of n-gram feature engineering. Let's consolidate the key insights across all five pages:
The Practitioner's Cheat Sheet:
| Scenario | Recommended Configuration |
|---|---|
| Quick baseline | (1,2) word n-grams, 10K vocab, min_df=2 |
| Sentiment analysis | (1,2) word, keep stop words, TF-IDF |
| Topic classification | (1,1) word, remove stop words, TF-IDF |
| Noisy user text | (1,2) word + (3,5) char combined |
| Language ID | (2,5) character only |
| Authorship | (3,5) character, optionally + (1,2) word |
| Large training data | (1,3) word, 50K+ vocab, feature selection |
| Small training data | (1,1) or (1,2) word, <10K vocab |
Where to Go From Here:
N-gram features, despite their age, remain relevant even in the era of deep learning. They serve as:
As you progress to word embeddings, transformers, and other deep learning approaches, remember that the fundamentals of text feature engineering—vocabulary, context, filtering, and tradeoffs—remain relevant throughout.
Congratulations! You have mastered n-gram feature engineering for text. You understand unigrams, bigrams, general n-grams, and character n-grams. You can configure n-gram extraction for any NLP task, optimize vocabulary size, balance tradeoffs, and deploy production-ready text features. This knowledge forms the foundation for all text-based machine learning.