Text Feature Engineering Ngrams - Learning Module

Loading content...

0/278

N-gram Tradeoffs: A Decision Framework for Text Feature Engineering

The Art of N-gram Configuration

You now understand unigrams, bigrams, n-grams, and character n-grams. You know their mathematics, implementations, and applications. But when faced with a real NLP task, how do you choose?

Should you use (1,2) or (1,3) word n-grams?
When do character n-grams outweigh their computational cost?
How do you balance vocabulary size against expressiveness?
What minimum document frequency filters out noise without losing signal?

These decisions involve fundamental tradeoffs that depend on your task, data characteristics, and constraints. This page provides a systematic framework for making these choices, synthesizing everything we've learned into actionable guidance for real-world text feature engineering.

What You Will Learn

By the end of this page, you will understand the core tradeoffs in n-gram feature engineering, have a decision framework for configuration selection, know how to validate your choices empirically, and be equipped to make principled decisions for any text classification or NLP task.

The Four Fundamental Tradeoffs

Every n-gram configuration decision involves balancing competing concerns. Understanding these tradeoffs enables informed choices rather than arbitrary defaults.

Tradeoff 1: Context Capture vs. Generalization

Higher-order n-grams capture more context but generalize worse:

More context → Better discrimination of specific patterns
Less generalization → Overfitting to training patterns, failure on unseen text

The sweet spot depends on training data volume and task specificity.

Tradeoff 2: Vocabulary Size vs. Computational Resources

Larger vocabularies capture more information but cost more:

Larger vocabulary → More features, more memory, slower training
Smaller vocabulary → Faster, cheaper, but potentially missing important patterns

Tradeoff 3: Precision vs. Recall in Feature Selection

Strict filtering removes noise but may discard signal:

Aggressive filtering (high min_df) → Fewer noisy features, but may lose rare discriminative patterns
Lenient filtering (low min_df) → Captures rare patterns, but includes noise

Tradeoff 4: Interpretability vs. Performance

Word n-grams are interpretable; character n-grams often perform better:

Word n-grams → Feature importance is understandable
Character n-grams → Often better accuracy, but "why" is opaque

The Four Fundamental N-gram Tradeoffs
Tradeoff	Low End	High End	Optimal Strategy
Context Capture	Unigrams (no context)	High-order n-grams (full context)	Match to task; (1,2) is often sufficient
Vocabulary Size	100-1K features	100K+ features	Use validation; diminishing returns >50K
Feature Filtering	min_df=1 (keep all)	min_df=10+ (strict)	min_df=2-5 for most tasks
Word vs. Character	Pure word n-grams	Pure character n-grams	Combine when robustness matters

The Bias-Variance Perspective

N-gram configuration can be understood through the lens of the bias-variance tradeoff, fundamental to all machine learning.

Low Bias (Complex Models):

Higher-order n-grams (3-grams, 4-grams)
Large vocabulary (100K+ features)
Lenient filtering (min_df=1)
Combined word + character features

→ Can fit training data well, but may overfit

Low Variance (Simple Models):

Lower-order n-grams (unigrams only)
Small vocabulary (<10K features)
Strict filtering (min_df=10+)
Word features only

→ Generalizes well, but may underfit complex patterns

Data Volume Determines Optimal Complexity:

Recommended Complexity by Data Volume
Training Size	Recommended N-gram Range	Max Vocabulary	Character N-grams?
< 1,000 docs	(1, 1) unigrams only	1,000 - 5,000	Only if essential (language ID)
1,000 - 10,000 docs	(1, 2) uni + bigrams	5,000 - 20,000	Consider for noisy text
10,000 - 100,000 docs	(1, 2) or (1, 3)	20,000 - 50,000	Yes, with feature selection
100,000 docs	(1, 3) or higher	50,000 - 100,000+	Yes, combined features optimal

The Learning Curve Test

Plot training and validation accuracy as you increase training data. If validation lags significantly behind training, you're overfitting—reduce n-gram order or vocabulary size. If both are low and closely tracked, you're underfitting—consider higher-order n-grams or larger vocabulary.

Task-Specific Recommendations

Different NLP tasks have different optimal configurations based on what information is most discriminative.

Topic Classification (e.g., News Categories, Subject Detection):

Recommended: (1, 1) or (1, 2) word n-grams
Rationale: Topics are defined by vocabulary, not phrase patterns
Key settings: Remove stop words, use TF-IDF, moderate vocabulary (10K-30K)

Sentiment Analysis:

Recommended: (1, 2) word n-grams, potentially (1, 3)
Rationale: Negation patterns ("not good") require bigrams; intensifiers may need trigrams
Key settings: Keep stop words ("not" is critical), TF-IDF, feature selection for negation patterns

Spam Detection:

Recommended: (1, 2) word n-grams + character (3, 5) n-grams
Rationale: Spam has distinctive words AND avoids filters with typos/obfuscation
Key settings: Keep punctuation, include character features for obfuscation handling

Language Identification:

Recommended: Character (2, 5) n-grams only
Rationale: Languages differ in character patterns, not vocabulary
Key settings: Include spaces (word boundaries), lowercase, no filtering

Authorship Attribution:

Recommended: Character (3, 5) n-grams, optionally combined with word (1, 2)
Rationale: Style is captured at character level; vocabulary for genre
Key settings: Case-sensitive optional (style signal), include punctuation

Task-Specific N-gram Configuration Summary
Task	Word N-grams	Char N-grams	Stop Words	Special Considerations
Topic Classification	(1, 2)	Optional	Remove	TF-IDF important
Sentiment Analysis	(1, 2) or (1, 3)	Optional	Keep	Negation handling critical
Spam Detection	(1, 2)	(3, 5)	Keep	Obfuscation handling
Language ID	None	(2, 5)	N/A	Small training data sufficient
Authorship	(1, 2)	(3, 5)	Keep	Style over content
Named Entity	(1, 3)	(3, 4)	Keep	Boundary markers help
Document Retrieval	(1, 1)	Optional	Remove	TF-IDF ranking

Vocabulary Size Optimization

Finding the optimal vocabulary size requires balancing information capture against overfitting and computational cost.

Empirical Approach:

Train models with increasing vocabulary sizes: 1K, 5K, 10K, 25K, 50K, 100K
Evaluate on held-out validation set
Plot performance vs. vocabulary size
Select vocabulary where performance plateaus

Typical Findings:

Performance increases rapidly to ~10K features
Marginal gains from 10K to 50K
Often negative returns (overfitting) beyond 50K without regularization
Character n-grams need larger vocabularies than word n-grams

Filtering Strategies:

Document Frequency Filtering

•min_df=1: Keep all (maximum recall, noisy)
•min_df=2: Remove hapax (recommended minimum)
•min_df=5: Moderate filtering (good default)
•min_df=10+: Strict (large corpora only)
•max_df=0.9: Remove very common n-grams
•max_df=0.5: Aggressive common removal

Feature Selection Methods

•Chi-Square: Statistical association with labels
•Mutual Information: Information gain about labels
•L1 Regularization: Sparse model selects features
•SelectKBest: Keep top k by score
•Variance Threshold: Remove low-variance features
•Recursive Elimination: Iterative removal

vocabulary_optimization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict
 
def optimize_vocabulary_size(
    texts: List[str],
    labels: List[int],
    vocab_sizes: List[int] = [100, 500, 1000, 5000, 10000, 25000, 50000],
    ngram_range: tuple = (1, 2),
    cv: int = 5
) -> Dict:
    """
    Find optimal vocabulary size using cross-validation.
    """
    results = {'vocab_size': [], 'mean_score': [], 'std_score': []}
    
    for vocab_size in vocab_sizes:
        pipeline = Pipeline([
            ('vectorizer', TfidfVectorizer(
                ngram_range=ngram_range,
                max_features=vocab_size,
                min_df=2,
                sublinear_tf=True,
            )),
            ('classifier', LogisticRegression(
                max_iter=1000,
                random_state=42,
                C=1.0,
            ))
        ])
        
        scores = cross_val_score(pipeline, texts, labels, cv=cv, scoring='accuracy')
        
        results['vocab_size'].append(vocab_size)
        results['mean_score'].append(np.mean(scores))
        results['std_score'].append(np.std(scores))
        
        print(f"Vocab {vocab_size:>6}: {np.mean(scores):.4f} (+/- {np.std(scores):.4f})")
    
    # Find optimal (highest mean score)
    best_idx = np.argmax(results['mean_score'])
    best_vocab = results['vocab_size'][best_idx]
    
    print(f"
Optimal vocabulary size: {best_vocab}")
    
    return results
 
 
def compare_filtering_strategies(
    texts: List[str],
    labels: List[int],
    cv: int = 5
) -> Dict:
    """
    Compare different vocabulary filtering strategies.
    """
    strategies = {
        'No Filtering': {'min_df': 1, 'max_df': 1.0, 'max_features': None},
        'min_df=2': {'min_df': 2, 'max_df': 1.0, 'max_features': None},
        'min_df=5': {'min_df': 5, 'max_df': 1.0, 'max_features': None},
        'max_df=0.9': {'min_df': 2, 'max_df': 0.9, 'max_features': None},
        'max_features=10K': {'min_df': 2, 'max_df': 1.0, 'max_features': 10000},
        'Chi2 + 5K': {'min_df': 2, 'max_df': 1.0, 'max_features': None, 'select_k': 5000},
    }
    
    results = {}
    
    for name, params in strategies.items():
        select_k = params.pop('select_k', None)
        
        if select_k:
            pipeline = Pipeline([
                ('vectorizer', TfidfVectorizer(
                    ngram_range=(1, 2),
                    sublinear_tf=True,
                    **params
                )),
                ('feature_selection', SelectKBest(chi2, k=select_k)),
                ('classifier', LogisticRegression(max_iter=1000, random_state=42))
            ])
        else:
            pipeline = Pipeline([
                ('vectorizer', TfidfVectorizer(
                    ngram_range=(1, 2),
                    sublinear_tf=True,
                    **params
                )),
                ('classifier', LogisticRegression(max_iter=1000, random_state=42))
            ])
        
        scores = cross_val_score(pipeline, texts, labels, cv=cv, scoring='accuracy')
        
        # Get actual vocab size after fitting
        pipeline.fit(texts, labels)
        if select_k:
            vocab_size = select_k
        else:
            vocab_size = len(pipeline.named_steps['vectorizer'].vocabulary_)
        
        results[name] = {
            'mean_score': np.mean(scores),
            'std_score': np.std(scores),
            'vocab_size': vocab_size,
        }
        
        print(f"{name:20}: {np.mean(scores):.4f} (+/- {np.std(scores):.4f}) | vocab={vocab_size}")
    
    return results
 
 
def demonstrate_optimization():
    """
    Demonstrate vocabulary optimization.
    """
    print("="*60)
    print("VOCABULARY SIZE OPTIMIZATION")
    print("="*60)
    
    # Sample data (in practice, use real dataset)
    texts = [
        "The machine learning model achieved excellent results",
        "Deep neural networks require extensive training data",
        "Feature engineering improves model performance significantly",
        "Overfitting occurs when models are too complex",
        "Cross-validation helps estimate generalization error",
        "Regularization prevents overfitting in high dimensions",
        "Gradient descent optimizes neural network weights",
        "Batch normalization accelerates training convergence",
        "Dropout provides effective neural network regularization",
        "Transfer learning leverages pre-trained representations",
    ] * 100
    
    labels = [1, 1, 1, 0, 0, 0, 1, 1, 0, 1] * 100
    
    print("
--- Vocabulary Size Sweep ---")
    results = optimize_vocabulary_size(
        texts, labels,
        vocab_sizes=[100, 500, 1000, 2000, 5000],
        ngram_range=(1, 2),
        cv=5
    )
    
    print("
--- Filtering Strategy Comparison ---")
    filter_results = compare_filtering_strategies(texts, labels, cv=5)
 
 
if __name__ == "__main__":
    demonstrate_optimization()

Computational Considerations

Beyond accuracy, practical systems must consider memory, latency, and throughput constraints.

Memory Requirements:

Component	Approximate Size	Scaling
Vocabulary dict	~100 bytes/term	Linear with
Feature matrix	~12 bytes/nonzero	O(n_docs × avg_features)
Model weights	~8 bytes/feature	Linear with
Vectorizer state	~50 MB for 100K vocab	Linear with

Latency Analysis:

Tokenization: O(n) where n = text length
N-gram extraction: O(n × max_n)
Vocabulary lookup: O(1) per n-gram with hashing
Sparse vector construction: O(|features|)
Model inference: O(|V|) for linear models

Production Optimizations:

Performance Optimization Strategies

•Feature Hashing: Eliminates vocabulary storage, trades accuracy for bounded memory (HashingVectorizer)
•Lazy Evaluation: Only extract n-grams actually in vocabulary (significant speedup)
•Batch Processing: Amortize overhead across multiple documents
•Vocabulary Precompilation: Freeze vocabulary as compiled regex or trie
•Model Compression: Prune zero/small-weight features post-training
•Caching: Cache extracted features for frequently seen text
•Native Implementation: Use Cython/Rust for tokenization and extraction

Memory Explosion Warning

A common production failure: vocabulary grows unbounded on streaming data. Always enforce max_features limits and use fitted vocabularies at inference time. Never call fit() on production data—only transform() with a frozen vocabulary.

Approximate Processing Times (10K docs, 1K words/doc average)
Configuration	Vocabulary Size	Memory	Fit Time	Transform Time
Unigrams (1,1)	~20K	~10 MB	~2s	~0.5s
Uni+Bigrams (1,2)	~100K	~50 MB	~5s	~1s
Up to Trigrams (1,3)	~300K	~150 MB	~15s	~2s
Char (3,5)	~50K	~25 MB	~10s	~3s
Combined Word+Char	~150K	~75 MB	~15s	~4s

The N-gram Decision Framework

Use this systematic framework when configuring n-gram features for a new task:

Step 1: Identify Task Characteristics

What information discriminates classes? (content vs. style vs. structure)
Is word order important? (sentiment: yes; topic: less so)
Is the text noisy? (typos, abbreviations, non-standard spelling)
How much training data is available?

Step 2: Select Initial Configuration

Based on task type, choose a starting point:

Topic/Content: (1,1) word n-grams
Sentiment/Opinion: (1,2) word n-grams
Style/Author: (3,5) character n-grams
Noisy Text: Combined word (1,2) + char (3,4)
Unknown: (1,2) word n-grams (robust default)

Step 3: Set Vocabulary Parameters

min_df: Start with 2-5 (removes hapax, keeps rare patterns)
max_df: 0.9-1.0 (usually don't remove common terms)
max_features: 10K-50K (adjust based on validation)

Step 4: Validate and Iterate

Train with initial configuration
Evaluate on validation set
Analyze errors—what patterns are missed?
Adjust n-gram range, vocabulary size, or add character features
Repeat until performance stops improving

decision_framework.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV, cross_val_score
from typing import Dict, List, Tuple
import numpy as np
 
class NgramConfigurationSearch:
    """
    Systematic search for optimal n-gram configuration.
    """
    
    def __init__(self, task_type: str = 'classification'):
        self.task_type = task_type
        self.best_config_: Dict = None
        self.best_score_: float = None
        self.results_: List[Dict] = []
    
    def _get_initial_configs(self, task_hint: str = None) -> List[Dict]:
        """
        Generate configuration candidates based on task type.
        """
        configs = []
        
        if task_hint == 'topic':
            # Topic classification: unigrams dominant
            configs = [
                {'word_ngram_range': (1, 1), 'char': False, 'max_vocab': 10000},
                {'word_ngram_range': (1, 2), 'char': False, 'max_vocab': 20000},
            ]
        
        elif task_hint == 'sentiment':
            # Sentiment: bigrams for negation
            configs = [
                {'word_ngram_range': (1, 2), 'char': False, 'max_vocab': 20000},
                {'word_ngram_range': (1, 3), 'char': False, 'max_vocab': 30000},
            ]
        
        elif task_hint == 'style':
            # Style/Authorship: character n-grams
            configs = [
                {'word_ngram_range': None, 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 20000},
                {'word_ngram_range': (1, 2), 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 30000},
            ]
        
        elif task_hint == 'noisy':
            # Noisy text: character ngrams essential
            configs = [
                {'word_ngram_range': (1, 2), 'char': True, 'char_ngram_range': (3, 4), 'max_vocab': 30000},
                {'word_ngram_range': (1, 1), 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 25000},
            ]
        
        else:
            # Unknown: comprehensive search
            configs = [
                {'word_ngram_range': (1, 1), 'char': False, 'max_vocab': 10000},
                {'word_ngram_range': (1, 2), 'char': False, 'max_vocab': 20000},
                {'word_ngram_range': (1, 3), 'char': False, 'max_vocab': 30000},
                {'word_ngram_range': (1, 2), 'char': True, 'char_ngram_range': (3, 5), 'max_vocab': 30000},
            ]
        
        return configs
    
    def _build_pipeline(self, config: Dict) -> Pipeline:
        """
        Build sklearn pipeline from configuration.
        """
        transformers = []
        
        if config.get('word_ngram_range'):
            word_vec = TfidfVectorizer(
                analyzer='word',
                ngram_range=config['word_ngram_range'],
                min_df=2,
                max_features=config.get('max_vocab', 10000) // (2 if config.get('char') else 1),
                sublinear_tf=True,
                stop_words='english' if config.get('remove_stop_words') else None,
            )
            transformers.append(('word', word_vec))
        
        if config.get('char'):
            char_range = config.get('char_ngram_range', (3, 5))
            char_vec = TfidfVectorizer(
                analyzer='char_wb',
                ngram_range=char_range,
                min_df=2,
                max_features=config.get('max_vocab', 10000) // (2 if config.get('word_ngram_range') else 1),
                sublinear_tf=True,
            )
            transformers.append(('char', char_vec))
        
        if len(transformers) == 1:
            features = transformers[0][1]
        else:
            features = FeatureUnion(transformers)
        
        pipeline = Pipeline([
            ('features', features),
            ('classifier', LogisticRegression(max_iter=1000, random_state=42, C=1.0))
        ])
        
        return pipeline
    
    def search(
        self,
        texts: List[str],
        labels: List[int],
        task_hint: str = None,
        cv: int = 5,
    ) -> Dict:
        """
        Search for best n-gram configuration.
        """
        configs = self._get_initial_configs(task_hint)
        
        print("N-gram Configuration Search")
        print("="*60)
        
        for config in configs:
            pipeline = self._build_pipeline(config)
            
            scores = cross_val_score(
                pipeline, texts, labels,
                cv=cv, scoring='accuracy'
            )
            
            mean_score = np.mean(scores)
            std_score = np.std(scores)
            
            result = {
                'config': config,
                'mean_score': mean_score,
                'std_score': std_score,
            }
            self.results_.append(result)
            
            # Format config for display
            config_str = f"word={config.get('word_ngram_range', 'None')}"
            if config.get('char'):
                config_str += f", char={config.get('char_ngram_range')}"
            config_str += f", vocab={config.get('max_vocab')}"
            
            print(f"
{config_str}")
            print(f"  Score: {mean_score:.4f} (+/- {std_score:.4f})")
            
            if self.best_score_ is None or mean_score > self.best_score_:
                self.best_score_ = mean_score
                self.best_config_ = config
        
        print("
" + "="*60)
        print(f"Best configuration: {self.best_config_}")
        print(f"Best score: {self.best_score_:.4f}")
        
        return self.best_config_
 
 
def demonstrate_decision_framework():
    """
    Demonstrate the n-gram decision framework.
    """
    # Sample data
    texts = [
        "This movie was absolutely fantastic and amazing",
        "I loved every moment of this wonderful film",
        "Terrible waste of time completely boring",
        "The worst film I have ever seen awful",
        "Great acting brilliant storyline highly recommend",
        "Disappointing slow and not worth watching",
    ] * 50
    
    labels = [1, 1, 0, 0, 1, 0] * 50
    
    # Run search with sentiment hint
    searcher = NgramConfigurationSearch()
    best_config = searcher.search(texts, labels, task_hint='sentiment', cv=5)
    
    print(f"
Recommended configuration for sentiment analysis:")
    print(f"  {best_config}")
 
 
if __name__ == "__main__":
    demonstrate_decision_framework()

Common Pitfalls and Best Practices

Years of practical experience reveal recurring mistakes and proven solutions in n-gram feature engineering.

Common Pitfalls to Avoid

•Fitting vocabulary on test data: Never include test data when building vocabulary. Use only training data.
•Ignoring stop words for sentiment: Removing 'not', 'no', 'never' destroys negation signal. Keep stop words for sentiment.
•Using raw counts with regularization: L1/L2 regularization assumes features are scaled. Use TF-IDF or normalize counts.
•Excessive n-gram order: Trigrams and beyond rarely help for classification. Start simple, add complexity only if needed.
•Inconsistent preprocessing: Different tokenization between training and inference produces incompatible features.
•Ignoring class imbalance: Rare classes may have distinctive n-grams with low overall frequency. Don't filter too aggressively.
•No validation: Always use held-out validation to select n-gram configuration. Training accuracy is meaningless.

Best Practices to Follow

•Start with (1,2) word n-grams: This is the robust default that works well across most tasks.
•Use TF-IDF, not raw counts: Sublinear TF (1 + log(tf)) and IDF weighting consistently improve performance.
•Set min_df=2-5: Removes hapax legomena (single occurrences) which are noise.
•Limit vocabulary initially: Start with 10K-20K features, increase only if validation improves.
•Version your vocabulary: Hash or version-control the vocabulary for production reproducibility.
•Combine approaches when robust: Word + character n-grams often outperform either alone for noisy text.
•Analyze feature importance: After training, examine top features to sanity-check the model.
•Monitor production performance: Feature distribution can shift; track prediction confidence over time.

Summary: Mastering N-gram Feature Engineering

This module has taken you from the fundamentals of unigrams to a comprehensive understanding of n-gram feature engineering. Let's consolidate the key insights across all five pages:

Module Key Takeaways

•Unigrams are the foundation: Despite their simplicity, bag-of-words representations remain powerful and interpretable baseline features.
•Bigrams capture local context: Consecutive word pairs enable negation handling, phrase detection, and local syntactic patterns that unigrams miss.
•N-gram order follows diminishing returns: (1,2) captures most value; trigrams add marginal benefit; 4-grams and beyond rarely help for classification.
•Character n-grams capture sub-word patterns: Morphology, typo tolerance, and stylistic signatures emerge at the character level.
•All decisions involve tradeoffs: Context vs. generalization, vocabulary vs. resources, precision vs. recall, interpretability vs. performance.
•Task determines configuration: Different NLP tasks have different optimal n-gram settings based on what information discriminates classes.
•Empirical validation is essential: Never assume a configuration is optimal—always validate on held-out data.
•Production requires discipline: Vocabulary versioning, consistent preprocessing, and resource awareness separate prototype from production.

The Practitioner's Cheat Sheet:

Scenario	Recommended Configuration
Quick baseline	(1,2) word n-grams, 10K vocab, min_df=2
Sentiment analysis	(1,2) word, keep stop words, TF-IDF
Topic classification	(1,1) word, remove stop words, TF-IDF
Noisy user text	(1,2) word + (3,5) char combined
Language ID	(2,5) character only
Authorship	(3,5) character, optionally + (1,2) word
Large training data	(1,3) word, 50K+ vocab, feature selection
Small training data	(1,1) or (1,2) word, <10K vocab

Where to Go From Here:

N-gram features, despite their age, remain relevant even in the era of deep learning. They serve as:

Strong baselines to beat
Efficient solutions when computational resources are limited
Interpretable features when explainability matters
Complementary features to neural representations

As you progress to word embeddings, transformers, and other deep learning approaches, remember that the fundamentals of text feature engineering—vocabulary, context, filtering, and tradeoffs—remain relevant throughout.

Module Complete

Congratulations! You have mastered n-gram feature engineering for text. You understand unigrams, bigrams, general n-grams, and character n-grams. You can configure n-gram extraction for any NLP task, optimize vocabulary size, balance tradeoffs, and deploy production-ready text features. This knowledge forms the foundation for all text-based machine learning.