Machine LearningFeature Engineering & Selection

Text Feature Engineering - Basics

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

3 / 5

Vocabulary Construction

The Hidden Decisions That Shape Your Feature Space

Every text machine learning model rests upon a vocabulary—the set of tokens recognized as features. This vocabulary defines the dimension of your feature space, determines what information can be captured, and profoundly influences model performance.

Yet vocabulary construction is often treated as an afterthought: split on whitespace, maybe lowercase, use whatever results. This casual approach leaves significant performance on the table and can introduce subtle bugs that are difficult to diagnose.

The reality: Vocabulary construction involves dozens of design decisions, each with measurable impact on downstream task performance. Should you include numbers? Punctuation? How do you handle hyphenated words? What about misspellings, slang, or domain-specific jargon? At what frequency threshold do you prune rare terms? How do you handle words never seen during training?

These decisions collectively determine the quality of your text representation—and thus the ceiling on your model's capabilities.

What You Will Master

By the end of this page, you will understand: (1) The complete vocabulary construction pipeline from raw text to feature indices, (2) Tokenization strategies and their trade-offs, (3) Vocabulary pruning techniques (min/max document frequency), (4) Out-of-vocabulary (OOV) handling strategies, (5) Subword tokenization for open-vocabulary models, and (6) Production considerations for vocabulary management.

The Vocabulary Construction Pipeline

Vocabulary construction follows a multi-stage pipeline, each stage transforming text closer to the final token-to-index mapping.

Vocabulary Construction Stages

•Stage 1: Text Normalization — Convert text to a canonical form: Unicode normalization (NFD/NFC), lowercasing, accent removal, whitespace standardization. Reduces vocabulary fragmentation from encoding variants.
•Stage 2: Tokenization — Split continuous text into discrete tokens. Choices: word-level, character-level, subword, or domain-specific tokenizers (sentence-piece, BPE). The atomic units of your vocabulary.
•Stage 3: Token Filtering — Remove unwanted tokens: stop words, punctuation-only tokens, pure numeric tokens, single-character tokens. Balance noise reduction against information loss.
•Stage 4: Token Transformation — Apply morphological normalization: stemming (Porter, Snowball) or lemmatization (WordNet). Reduces vocabulary by merging surface forms to base forms.
•Stage 5: Vocabulary Selection — From all unique tokens, select which become features. Apply frequency thresholds (min_df, max_df), limit total features (max_features), handle ties consistently.
•Stage 6: Index Assignment — Assign each selected token a unique integer index. Order matters for reproducibility: alphabetical sorting is common. Store the mapping for inference.

vocabulary_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
import re
import unicodedata
from collections import Counter
from typing import List, Dict, Set, Optional, Tuple
 
class VocabularyBuilder:
    """
    Complete vocabulary construction pipeline with configurable stages.
    """
    
    # Common English stop words
    STOP_WORDS: Set[str] = {
        'a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been',
        'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
        'could', 'should', 'to', 'of', 'in', 'for', 'on', 'with', 'at',
        'by', 'from', 'as', 'into', 'through', 'and', 'but', 'or', 'if',
        'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
        'we', 'they', 'what', 'which', 'who'
    }
    
    def __init__(
        self,
        lowercase: bool = True,
        remove_accents: bool = True,
        remove_stop_words: bool = True,
        min_token_length: int = 2,
        max_token_length: int = 25,
        min_df: int = 1,
        max_df: float = 1.0,
        max_features: Optional[int] = None,
        token_pattern: str = r'\b[a-zA-Z]+\b'
    ):
        self.lowercase = lowercase
        self.remove_accents = remove_accents
        self.remove_stop_words = remove_stop_words
        self.min_token_length = min_token_length
        self.max_token_length = max_token_length
        self.min_df = min_df
        self.max_df = max_df
        self.max_features = max_features
        self.token_pattern = token_pattern
        
        self.vocabulary_: Optional[Dict[str, int]] = None
        self.document_frequency_: Optional[Dict[str, int]] = None
        
    def _normalize_unicode(self, text: str) -> str:
        """Normalize Unicode text (NFD normalization)."""
        return unicodedata.normalize('NFD', text)
    
    def _remove_accents(self, text: str) -> str:
        """Remove diacritical marks (accents)."""
        return ''.join(
            c for c in unicodedata.normalize('NFD', text)
            if unicodedata.category(c) != 'Mn'
        )
    
    def _normalize(self, text: str) -> str:
        """Stage 1: Text normalization."""
        text = self._normalize_unicode(text)
        
        if self.remove_accents:
            text = self._remove_accents(text)
        
        if self.lowercase:
            text = text.lower()
        
        return text
    
    def _tokenize(self, text: str) -> List[str]:
        """Stage 2: Tokenization."""
        return re.findall(self.token_pattern, text)
    
    def _filter_token(self, token: str) -> bool:
        """Stage 3: Token filtering - returns True if token should be kept."""
        # Length filter
        if len(token) < self.min_token_length:
            return False
        if len(token) > self.max_token_length:
            return False
        
        # Stop word filter
        if self.remove_stop_words and token.lower() in self.STOP_WORDS:
            return False
        
        return True
    
    def _process_document(self, document: str) -> List[str]:
        """Process a single document through normalize -> tokenize -> filter."""
        normalized = self._normalize(document)
        tokens = self._tokenize(normalized)
        filtered = [t for t in tokens if self._filter_token(t)]
        return filtered
    
    def fit(self, corpus: List[str]) -> 'VocabularyBuilder':
        """
        Build vocabulary from corpus.
        
        Stages 4-6: Token transformation, selection, and index assignment.
        """
        n_docs = len(corpus)
        
        # Count document frequencies
        doc_freq: Counter = Counter()
        term_freq: Counter = Counter()
        
        for document in corpus:
            tokens = self._process_document(document)
            unique_tokens = set(tokens)
            
            # Document frequency: count unique tokens per document
            doc_freq.update(unique_tokens)
            
            # Term frequency: count all occurrences (for max_features ranking)
            term_freq.update(tokens)
        
        self.document_frequency_ = dict(doc_freq)
        
        # Stage 5: Vocabulary selection based on df thresholds
        min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs)
        max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs)
        
        # Filter by document frequency
        candidates = {
            term for term, df in doc_freq.items()
            if min_count <= df <= max_count
        }
        
        # Apply max_features limit (keep most frequent)
        if self.max_features is not None and len(candidates) > self.max_features:
            # Sort by term frequency, keep top N
            sorted_terms = sorted(
                candidates,
                key=lambda t: term_freq[t],
                reverse=True
            )
            candidates = set(sorted_terms[:self.max_features])
        
        # Stage 6: Index assignment (alphabetical for reproducibility)
        sorted_vocab = sorted(candidates)
        self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_vocab)}
        
        return self
    
    def get_vocabulary(self) -> Dict[str, int]:
        """Return the vocabulary mapping."""
        if self.vocabulary_ is None:
            raise ValueError("Vocabulary not built. Call fit() first.")
        return self.vocabulary_.copy()
    
    def get_stats(self) -> Dict[str, any]:
        """Return vocabulary statistics."""
        if self.vocabulary_ is None:
            raise ValueError("Vocabulary not built. Call fit() first.")
        
        return {
            'vocabulary_size': len(self.vocabulary_),
            'total_unique_tokens_seen': len(self.document_frequency_),
            'tokens_pruned': len(self.document_frequency_) - len(self.vocabulary_),
        }
 
 
# Demonstration
corpus = [
    "Machine learning algorithms learn patterns from data.",
    "Deep learning neural networks require large amounts of data.",
    "Natural language processing applies machine learning to text.",
    "Computer vision uses deep learning for image recognition.",
    "Reinforcement learning agents learn through trial and error.",
    "Transfer learning leverages pre-trained models effectively.",
]
 
builder = VocabularyBuilder(
    lowercase=True,
    remove_stop_words=True,
    min_df=2,  # Term must appear in at least 2 documents
    max_df=0.9,  # Term must appear in at most 90% of documents
    min_token_length=3
)
 
builder.fit(corpus)
 
print("Vocabulary Statistics:")
print(builder.get_stats())
 
print("
Vocabulary (term -> index):")
for term, idx in sorted(builder.get_vocabulary().items(), key=lambda x: x[1]):
    df = builder.document_frequency_.get(term, 0)
    print(f"  [{idx:2d}] {term:20s} (df={df})")

Tokenization Strategies

Tokenization—splitting text into atomic units—is the most consequential vocabulary decision. Different strategies produce drastically different vocabularies and downstream representations.

Tokenization Strategy Comparison
Strategy	Units	Vocab Size	OOV Rate	Use Case
Word-level	Whitespace-delimited words	50K-500K	5-15%	Classical NLP, BoW, TF-IDF
Character-level	Individual characters	50-200	0%	Spelling correction, low-resource languages
Subword (BPE)	Learned subword units	10K-50K	~0%	Neural LMs, transformers, multilingual
SentencePiece	Language-agnostic subwords	8K-32K	~0%	Multilingual, raw text processing
WordPiece	Likelihood-based subwords	30K-50K	~0%	BERT, LLMs, production NLP

Word-Level Tokenization:

The classical approach splits on whitespace and punctuation. Simple, interpretable, and sufficient for many tasks.

Input: "Machine learning isn't rocket science!"
Tokens: ["Machine", "learning", "isn", "t", "rocket", "science"]

Challenges:

How to handle contractions? ("isn't" → "isn", "t" or "is", "n't" or "isn't")
Hyphenated words? ("state-of-the-art" → 1 token or 4?)
Multi-word expressions? ("New York" → 2 tokens, losing the entity)
Vocabulary explosion with morphologically rich languages

Character-Level Tokenization:

Treats each character as a token. Zero OOV rate but loses word semantics.

Input: "Hello"
Tokens: ["H", "e", "l", "l", "o"]

Subword Tokenization (BPE, WordPiece, Unigram):

Modern approaches that learn a vocabulary of subword units balancing vocabulary size against sequence length. Common words remain whole; rare words decompose into frequent substrings.

Input: "unhappiness"
BPE Tokens: ["un", "happiness"] or ["un", "happ", "iness"]

This handles OOV elegantly: unseen words decompose into seen subwords. "Transformational" might become ["Trans", "form", "ational"] even if the full word was never in training.

tokenization_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import re
from typing import List, Tuple
 
def word_tokenize(text: str) -> List[str]:
    """Simple word-level tokenization."""
    return re.findall(r"\b\w+\b", text.lower())
 
def char_tokenize(text: str) -> List[str]:
    """Character-level tokenization."""
    return list(text)
 
def ngram_tokenize(text: str, n: int = 3) -> List[str]:
    """Character n-gram tokenization."""
    text = text.lower().replace(" ", "_")
    return [text[i:i+n] for i in range(len(text) - n + 1)]
 
class SimpleBPE:
    """
    Simplified Byte Pair Encoding (BPE) tokenizer.
    
    Real implementations (sentencepiece, tokenizers) are more sophisticated,
    but this illustrates the core algorithm.
    """
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.merges: List[Tuple[str, str]] = []
        self.vocab: set = set()
    
    def _get_pairs(self, tokens: List[str]) -> dict:
        """Count adjacent token pairs."""
        pairs = {}
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i + 1])
            pairs[pair] = pairs.get(pair, 0) + 1
        return pairs
    
    def _merge_pair(self, tokens: List[str], pair: Tuple[str, str]) -> List[str]:
        """Merge all occurrences of a pair in token sequence."""
        merged = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == pair:
                merged.append(tokens[i] + tokens[i+1])
                i += 2
            else:
                merged.append(tokens[i])
                i += 1
        return merged
    
    def fit(self, corpus: List[str]) -> 'SimpleBPE':
        """Learn BPE merges from corpus."""
        # Initialize: split all words into characters
        word_freqs = {}
        for text in corpus:
            for word in text.lower().split():
                # Add end-of-word marker
                chars = tuple(list(word) + ['</w>'])
                word_freqs[chars] = word_freqs.get(chars, 0) + 1
        
        # Initial vocabulary: all characters
        self.vocab = set()
        for word in word_freqs:
            self.vocab.update(word)
        
        # Iteratively merge most frequent pairs
        while len(self.vocab) < self.vocab_size:
            # Count pairs across all words
            pair_counts = {}
            for word, freq in word_freqs.items():
                for i in range(len(word) - 1):
                    pair = (word[i], word[i+1])
                    pair_counts[pair] = pair_counts.get(pair, 0) + freq
            
            if not pair_counts:
                break
            
            # Find most frequent pair
            best_pair = max(pair_counts, key=pair_counts.get)
            
            # Merge this pair in all words
            new_word_freqs = {}
            for word, freq in word_freqs.items():
                new_word = self._merge_pair(list(word), best_pair)
                new_word_freqs[tuple(new_word)] = freq
            
            word_freqs = new_word_freqs
            
            # Update vocabulary and merges
            merged_token = best_pair[0] + best_pair[1]
            self.vocab.add(merged_token)
            self.merges.append(best_pair)
        
        return self
    
    def tokenize(self, text: str) -> List[str]:
        """Tokenize text using learned BPE merges."""
        tokens = []
        for word in text.lower().split():
            word_tokens = list(word) + ['</w>']
            
            # Apply merges in order
            for pair in self.merges:
                word_tokens = self._merge_pair(word_tokens, pair)
            
            tokens.extend(word_tokens)
        
        return tokens
 
 
# Demonstration
text = "Machine learning and deep learning are transforming AI research"
 
print("Tokenization Strategy Comparison:")
print("=" * 60)
print(f"
Input: '{text}'")
 
print("
1. Word-level:")
word_tokens = word_tokenize(text)
print(f"   Tokens ({len(word_tokens)}): {word_tokens}")
 
print("
2. Character-level:")
char_tokens = char_tokenize(text)
print(f"   Tokens ({len(char_tokens)}): {char_tokens[:20]}...")
 
print("
3. Character trigrams:")
trigrams = ngram_tokenize(text, 3)
print(f"   Tokens ({len(trigrams)}): {trigrams[:10]}...")
 
# Simple BPE demo
print("
4. BPE (simplified):")
corpus = [
    "machine learning is powerful",
    "deep learning uses neural networks", 
    "learning algorithms learn patterns",
    "machine learning and deep learning"
]
bpe = SimpleBPE(vocab_size=50)
bpe.fit(corpus)
bpe_tokens = bpe.tokenize("machine learning")
print(f"   Vocabulary size: {len(bpe.vocab)}")
print(f"   Sample merges: {bpe.merges[:5]}")
print(f"   'machine learning' -> {bpe_tokens}")

Vocabulary Pruning Strategies

Raw vocabularies are often too large for practical use. A typical English text corpus yields 100,000+ unique tokens, but many are noise: typos, rare domain jargon, numeric identifiers, or HTML artifacts. Vocabulary pruning reduces dimensionality while preserving discriminative power.

Document Frequency Thresholds:

The most important pruning mechanism controls which tokens become features based on how many documents contain them.

Document Frequency Pruning Effects
Threshold	Effect	Removes	Rationale
min_df = 5	Minimum 5 documents	Rare terms, typos, noise	Rare terms don't generalize; often noise
min_df = 0.01	Minimum 1% of documents	Very rare terms	Scale with corpus size
max_df = 0.95	Maximum 95% of documents	Near-universal terms	Terms in 95% of docs aren't discriminative
max_df = 0.5	Maximum 50% of documents	Common terms (aggressive)	Keeps only discriminative terms

Minimum Document Frequency (min_df):

Terms appearing in very few documents are typically:

Typos and misspellings
Highly domain-specific jargon
Named entities (which might be useful!)
Statistical noise

Setting min_df = 2 or min_df = 5 eliminates most noise with minimal information loss. However, be careful with named entity recognition tasks—rare proper nouns might be exactly what you need.

Maximum Document Frequency (max_df):

Terms appearing in almost every document provide no discrimination between classes. These include:

Function words not caught by stop word lists
Domain boilerplate ("abstract", "introduction" in papers)
Common phrases

Setting max_df = 0.9 or max_df = 0.95 removes these without aggressive filtering.

Maximum Features (max_features):

After df filtering, you can limit vocabulary to the N most frequent terms. This provides a hard cap on dimensionality, useful for memory-constrained environments or when you want to focus on the most informative features.

vocabulary_pruning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from typing import List, Tuple
 
def analyze_vocabulary_pruning(
    corpus: List[str],
    settings: List[dict]
) -> List[Tuple[str, int, List[str]]]:
    """
    Compare vocabulary sizes under different pruning settings.
    
    Args:
        corpus: List of documents
        settings: List of dicts with CountVectorizer params
    
    Returns:
        List of (setting_name, vocab_size, sample_terms)
    """
    results = []
    
    for setting in settings:
        name = setting.pop('name', 'unnamed')
        
        vectorizer = CountVectorizer(**setting)
        vectorizer.fit(corpus)
        
        vocab = vectorizer.get_feature_names_out()
        
        # Sample some terms
        sample = list(vocab[:10]) if len(vocab) > 10 else list(vocab)
        
        results.append((name, len(vocab), sample))
        
        # Restore name for display
        setting['name'] = name
    
    return results
 
 
# Synthetic corpus with various term frequency patterns
np.random.seed(42)
 
# Common terms (appear in 80%+ of docs)
common = ["the", "is", "of", "and", "to", "in", "that", "it"]
 
# Frequent terms (appear in 30-60% of docs)
frequent = ["machine", "learning", "data", "model", "algorithm", "neural"]
 
# Moderate terms (appear in 10-30% of docs)
moderate = ["classification", "training", "features", "prediction", "optimization"]
 
# Rare terms (appear in 1-5% of docs) 
rare = ["hyperparameter", "backpropagation", "convolution", "regularization"]
 
# Very rare / typos (appear in 1 doc)
noise = ["leraning", "daat", "modle", "x7f2k", "http123"]
 
def generate_doc():
    """Generate a synthetic document."""
    doc = []
    doc.extend(np.random.choice(common, size=np.random.randint(5, 15)))
    doc.extend(np.random.choice(frequent, size=np.random.randint(2, 6)))
    
    if np.random.random() < 0.3:
        doc.extend(np.random.choice(moderate, size=np.random.randint(1, 3)))
    
    if np.random.random() < 0.1:
        doc.extend(np.random.choice(rare, size=1))
    
    if np.random.random() < 0.02:
        doc.append(np.random.choice(noise))
    
    return " ".join(doc)
 
corpus = [generate_doc() for _ in range(500)]
 
print(f"Corpus: {len(corpus)} documents")
print(f"Total unique tokens: {len(set(' '.join(corpus).lower().split()))}")
 
# Test different pruning strategies
settings = [
    {'name': 'No pruning', 'min_df': 1, 'max_df': 1.0},
    {'name': 'min_df=2', 'min_df': 2, 'max_df': 1.0},
    {'name': 'min_df=5', 'min_df': 5, 'max_df': 1.0},
    {'name': 'min_df=5, max_df=0.9', 'min_df': 5, 'max_df': 0.9},
    {'name': 'min_df=5, max_df=0.5', 'min_df': 5, 'max_df': 0.5},
    {'name': 'max_features=20', 'min_df': 1, 'max_df': 1.0, 'max_features': 20},
]
 
results = analyze_vocabulary_pruning(corpus, settings)
 
print("
" + "=" * 70)
print("Vocabulary Pruning Comparison")
print("=" * 70)
 
for name, size, sample in results:
    print(f"
{name}:")
    print(f"  Vocabulary size: {size}")
    print(f"  Sample terms: {sample}")

Practical Defaults

A robust starting point for most text classification tasks: min_df=5 (or 0.001 for large corpora), max_df=0.95, max_features=10000-50000. Tune from there based on task-specific validation performance. If you're removing stop words separately, max_df can be lower.

Out-of-Vocabulary (OOV) Handling

A critical production consideration: What happens when new documents contain words not in the vocabulary?

This is the Out-of-Vocabulary (OOV) problem, and it's unavoidable. Language evolves, new products launch, users make typos, and domains have long-tail terminology.

OOV Statistics:

Typical OOV rates for word-level tokenization:

Web text: 5-15% of tokens are OOV
Domain-specific text: 10-30% OOV
Social media/informal: 15-40% OOV
Multi-lingual text: 20-50%+ OOV

These percentages mean substantial information loss if OOV tokens are simply ignored.

OOV Handling Strategies
Strategy	Implementation	Pros	Cons
Ignore	Treat OOV as zero count	Simple, no changes needed	Loses all OOV information; can bias predictions
<UNK> Token	Map all OOV to single index	Preserves OOV count; single feature	All OOV terms become indistinguishable
Character n-grams	Use character-level features	Zero OOV rate	Longer sequences; less semantic
Subword Tokenization	BPE, WordPiece, SentencePiece	Near-zero OOV; compositional	Requires training; longer sequences
Hashing Trick	Hash terms to fixed indices	No OOV; fixed memory	Collisions possible; no inverse mapping
Vocabulary Expansion	Periodically retrain with new terms	Captures real vocabulary drift	Requires retraining; versioning complexity

oov_handling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from collections import Counter
from typing import List, Tuple, Optional
import re
 
class OOVAwareVectorizer:
    """
    Vectorizer with explicit OOV handling and tracking.
    """
    
    def __init__(
        self,
        oov_strategy: str = "ignore",  # "ignore", "unk", "hash"
        hash_size: int = 1000,
        **vectorizer_kwargs
    ):
        self.oov_strategy = oov_strategy
        self.hash_size = hash_size
        self.vectorizer_kwargs = vectorizer_kwargs
        
        self.vectorizer_: Optional[CountVectorizer] = None
        self.vocabulary_: Optional[set] = None
        self.oov_stats_: dict = {}
    
    def fit(self, corpus: List[str]) -> 'OOVAwareVectorizer':
        """Fit vocabulary on training corpus."""
        self.vectorizer_ = CountVectorizer(**self.vectorizer_kwargs)
        self.vectorizer_.fit(corpus)
        self.vocabulary_ = set(self.vectorizer_.vocabulary_.keys())
        return self
    
    def _get_tokens(self, text: str) -> List[str]:
        """Extract tokens using same rules as vectorizer."""
        pattern = self.vectorizer_.token_pattern
        tokens = re.findall(pattern, text.lower())
        return tokens
    
    def _count_oov(self, corpus: List[str]) -> Tuple[int, int, List[str]]:
        """Count OOV tokens in corpus."""
        oov_count = 0
        total_count = 0
        oov_examples = []
        
        for doc in corpus:
            tokens = self._get_tokens(doc)
            for token in tokens:
                total_count += 1
                if token not in self.vocabulary_:
                    oov_count += 1
                    if len(oov_examples) < 20:
                        oov_examples.append(token)
        
        return oov_count, total_count, oov_examples
    
    def transform(self, corpus: List[str]) -> np.ndarray:
        """Transform corpus, applying OOV strategy."""
        if self.vectorizer_ is None:
            raise ValueError("Vectorizer not fitted. Call fit() first.")
        
        # Track OOV statistics
        oov_count, total_count, oov_examples = self._count_oov(corpus)
        self.oov_stats_ = {
            'oov_tokens': oov_count,
            'total_tokens': total_count,
            'oov_rate': oov_count / total_count if total_count > 0 else 0,
            'oov_examples': oov_examples
        }
        
        if self.oov_strategy == "ignore":
            # Default behavior: OOV terms become 0
            return self.vectorizer_.transform(corpus).toarray()
        
        elif self.oov_strategy == "unk":
            # Add <UNK> feature for OOV count
            base_matrix = self.vectorizer_.transform(corpus).toarray()
            
            # Count OOV per document
            unk_counts = []
            for doc in corpus:
                tokens = self._get_tokens(doc)
                oov = sum(1 for t in tokens if t not in self.vocabulary_)
                unk_counts.append(oov)
            
            # Append UNK column
            unk_col = np.array(unk_counts).reshape(-1, 1)
            return np.hstack([base_matrix, unk_col])
        
        elif self.oov_strategy == "hash":
            # Hash OOV terms to additional features
            base_matrix = self.vectorizer_.transform(corpus).toarray()
            
            # Create hash features for OOV
            hash_features = np.zeros((len(corpus), self.hash_size))
            
            for doc_idx, doc in enumerate(corpus):
                tokens = self._get_tokens(doc)
                for token in tokens:
                    if token not in self.vocabulary_:
                        hash_idx = hash(token) % self.hash_size
                        hash_features[doc_idx, hash_idx] += 1
            
            return np.hstack([base_matrix, hash_features])
        
        else:
            raise ValueError(f"Unknown OOV strategy: {self.oov_strategy}")
 
 
# Demonstration
train_corpus = [
    "machine learning algorithms process data",
    "neural networks learn patterns from examples",
    "deep learning models require large datasets",
    "supervised learning uses labeled training data",
]
 
# Test corpus with OOV terms
test_corpus = [
    "transformer architectures revolutionize NLP",  # transformer, architectures, revolutionize, nlp = OOV
    "machine learning uses gradient descent optimization",  # gradient, descent, optimization = OOV
    "convolutional networks detect image features",  # convolutional, detect, image = OOV
]
 
print("OOV Handling Strategies Comparison")
print("=" * 60)
 
for strategy in ["ignore", "unk", "hash"]:
    vectorizer = OOVAwareVectorizer(
        oov_strategy=strategy,
        hash_size=100,
        min_df=1
    )
    
    vectorizer.fit(train_corpus)
    X = vectorizer.transform(test_corpus)
    
    stats = vectorizer.oov_stats_
    
    print(f"
[{strategy.upper()}] Strategy:")
    print(f"  Output shape: {X.shape}")
    print(f"  OOV rate: {stats['oov_rate']:.1%}")
    print(f"  OOV examples: {stats['oov_examples'][:5]}")

OOV in Production

High OOV rates in production can silently degrade model accuracy. Monitor OOV rates as a key metric. If OOV exceeds 10%, consider vocabulary expansion, subword tokenization, or the hashing trick. Alert if OOV spikes—it often indicates domain drift or data quality issues.

Subword Tokenization: The Modern Approach

Subword tokenization has become the dominant approach in modern NLP, powering models from BERT to GPT. Understanding it is essential professional knowledge.

Core Insight:

Subword tokenization learns a vocabulary of subword units that optimally balance:

Vocabulary size (smaller is better for memory and sparsity)
Sequence length (shorter is better for computation)
OOV rate (lower is better for coverage)

The learning process finds frequently co-occurring character sequences and treats them as atomic units.

Subword Advantages

•Near-zero OOV — Rare words decompose into seen subunits
•Morphological awareness — Captures prefixes/suffixes (un-, -ing, -tion)
•Compact vocabulary — 30K tokens cover most languages well
•Cross-lingual transfer — Shared subwords between similar languages

Subword Limitations

•Longer sequences — Words become multiple tokens, increasing sequence length
•Training required — Must learn tokenizer on representative corpus
•Less interpretable — Subword features harder to explain than word features
•Corpus-dependent — Tokenization quality depends on training data

The Three Major Algorithms:

1. Byte Pair Encoding (BPE):

Starts with character vocabulary
Iteratively merges most frequent adjacent pairs
Deterministic and fast
Used by GPT-2, RoBERTa, many others

2. WordPiece:

Similar to BPE but uses likelihood-based merging
Maximizes language model probability, not just frequency
Used by BERT, DistilBERT

3. Unigram Language Model (SentencePiece):

Starts with large vocabulary, prunes to target size
Stochastic: multiple valid tokenizations exist
Language-agnostic: processes raw bytes
Used by XLNet, T5, multilingual models

subword_tokenization_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Using the Hugging Face tokenizers library
# pip install tokenizers transformers
 
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
 
def train_bpe_tokenizer(corpus: list, vocab_size: int = 1000):
    """
    Train a BPE tokenizer on a corpus.
    """
    # Initialize BPE model
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    
    # Pre-tokenization: split on whitespace first
    tokenizer.pre_tokenizer = Whitespace()
    
    # Trainer configuration
    trainer = BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]"]
    )
    
    # Train on corpus
    tokenizer.train_from_iterator(corpus, trainer)
    
    return tokenizer
 
# Example corpus for training
training_corpus = [
    "machine learning is transforming artificial intelligence research",
    "deep learning neural networks process complex data patterns",
    "natural language processing enables machines to understand text",
    "computer vision algorithms detect objects in images",
    "reinforcement learning agents learn through trial and error",
    "transformer architectures have revolutionized language models",
    "pre-trained models enable transfer learning across domains",
    "unsupervised learning discovers patterns without labeled data",
] * 100  # Repeat for meaningful statistics
 
# Train tokenizer
print("Training BPE tokenizer...")
tokenizer = train_bpe_tokenizer(training_corpus, vocab_size=500)
 
print(f"Vocabulary size: {tokenizer.get_vocab_size()}")
 
# Test tokenization
test_sentences = [
    "machine learning",  # Common terms
    "transformer models",  # May be split
    "superintelligence",  # Rare word
    "xyzabc123",  # Total OOV
]
 
print("
Tokenization examples:")
for sentence in test_sentences:
    output = tokenizer.encode(sentence)
    print(f"  '{sentence}'")
    print(f"    -> {output.tokens}")
 
# Show how OOV is handled
print("
OOV Handling Demonstration:")
print("  'superintelligence' is unseen, but decomposes into:")
output = tokenizer.encode("superintelligence")
print(f"    {output.tokens}")
print("  Each subword was seen during training, so zero OOV!")

Production Vocabulary Management

Managing vocabulary in production systems requires careful attention to versioning, consistency, and monitoring.

Production Best Practices

•Version Your Vocabulary — Hash or version the vocabulary file. Different vocabulary = different model. Never assume vocabularies are compatible without explicit verification.
•Serialize Complete Pipeline — Save the vectorizer alongside the model. Use joblib, pickle, or framework-specific serialization. Include all preprocessing steps.
•Freeze at Deployment — The vocabulary used during training must be identical at inference. Never recompute vocabulary on production data.
•Monitor OOV Rates — Track what percentage of tokens are OOV in production. Alert on significant increases—they signal distribution shift.
•Document Preprocessing — Explicitly document every preprocessing step: lowercasing, stop words, tokenization pattern, stemming. Implicit differences cause subtle bugs.
•Plan for Updates — Have a process for vocabulary expansion. Decide: freeze forever, or periodic retraining? Both are valid depending on domain.

production_vocabulary.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import json
import hashlib
from pathlib import Path
from datetime import datetime
from sklearn.feature_extraction.text import CountVectorizer
import joblib
from typing import Dict, Optional
 
class ProductionVocabulary:
    """
    Production-ready vocabulary manager with versioning and monitoring.
    """
    
    def __init__(
        self,
        vectorizer: CountVectorizer,
        metadata: Optional[Dict] = None
    ):
        self.vectorizer = vectorizer
        self.vocabulary = set(vectorizer.vocabulary_.keys())
        self.metadata = metadata or {}
        
        # Generate version hash
        vocab_str = '
'.join(sorted(self.vocabulary))
        self.version_hash = hashlib.sha256(vocab_str.encode()).hexdigest()[:12]
        
        self.metadata.update({
            'version_hash': self.version_hash,
            'vocabulary_size': len(self.vocabulary),
            'created_at': datetime.now().isoformat(),
        })
    
    def save(self, directory: str):
        """Save vocabulary and vectorizer to directory."""
        path = Path(directory)
        path.mkdir(parents=True, exist_ok=True)
        
        # Save vectorizer
        joblib.dump(self.vectorizer, path / 'vectorizer.joblib')
        
        # Save vocabulary as text (for inspection)
        with open(path / 'vocabulary.txt', 'w') as f:
            for term in sorted(self.vocabulary):
                f.write(f"{term}
")
        
        # Save metadata
        with open(path / 'metadata.json', 'w') as f:
            json.dump(self.metadata, f, indent=2)
        
        print(f"Saved vocabulary version {self.version_hash} to {directory}")
    
    @classmethod
    def load(cls, directory: str) -> 'ProductionVocabulary':
        """Load vocabulary from directory."""
        path = Path(directory)
        
        vectorizer = joblib.load(path / 'vectorizer.joblib')
        
        with open(path / 'metadata.json', 'r') as f:
            metadata = json.load(f)
        
        loaded = cls(vectorizer, metadata)
        
        # Verify hash
        if loaded.version_hash != metadata.get('version_hash'):
            raise ValueError(
                f"Vocabulary hash mismatch! Expected {metadata['version_hash']}, "
                f"got {loaded.version_hash}"
            )
        
        return loaded
    
    def check_oov(self, text: str) -> Dict:
        """Analyze OOV tokens in new text."""
        # Use same tokenization as vectorizer
        import re
        pattern = self.vectorizer.token_pattern
        tokens = re.findall(pattern, text.lower())
        
        oov_tokens = [t for t in tokens if t not in self.vocabulary]
        
        return {
            'total_tokens': len(tokens),
            'oov_count': len(oov_tokens),
            'oov_rate': len(oov_tokens) / len(tokens) if tokens else 0,
            'oov_tokens': oov_tokens[:10],  # Sample
        }
 
 
# Demonstration
train_data = [
    "machine learning algorithms",
    "neural network architectures", 
    "deep learning models",
    "natural language processing",
]
 
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit(train_data)
 
vocab_manager = ProductionVocabulary(
    vectorizer=vectorizer,
    metadata={
        'training_samples': len(train_data),
        'model_name': 'text_classifier_v1',
    }
)
 
print("Production Vocabulary Manager")
print("=" * 50)
print(f"Version: {vocab_manager.version_hash}")
print(f"Size: {len(vocab_manager.vocabulary)} terms")
 
# Check OOV on new data
test_texts = [
    "machine learning is great",  # Mostly known
    "transformer models are powerful",  # Some OOV
    "quantum computing uses qubits",  # Mostly OOV
]
 
print("
OOV Analysis on New Data:")
for text in test_texts:
    analysis = vocab_manager.check_oov(text)
    print(f"
  '{text}'")
    print(f"    OOV rate: {analysis['oov_rate']:.1%}")
    print(f"    OOV tokens: {analysis['oov_tokens']}")

Summary: Vocabulary as Foundation

We've explored vocabulary construction comprehensively—from the multi-stage pipeline through tokenization strategies, pruning mechanisms, OOV handling, subword approaches, and production considerations. Let's consolidate the key insights:

Key Takeaways

•Vocabulary defines feature space — Every vocabulary decision (tokenization, pruning, OOV handling) directly determines what information your model can capture.
•Tokenization is critical — Word-level is simple but has high OOV. Subword (BPE, WordPiece) is the modern standard for neural models.
•Prune strategically — min_df removes noise (typos, rare terms). max_df removes non-discriminative common terms. max_features provides a hard cap.
•OOV is inevitable — Plan for it. Ignore, <UNK>, hashing, or subword tokenization each have trade-offs. Monitor OOV rates in production.
•Subword tokenization dominates modern NLP — BPE, WordPiece, and SentencePiece achieve near-zero OOV while maintaining compact vocabularies.
•Production requires discipline — Version vocabularies, serialize complete pipelines, freeze at deployment, and monitor OOV continuously.
•Vocabulary is not set-and-forget — Language evolves. Have a process for vocabulary updates while maintaining model compatibility.

Looking Ahead:

With vocabulary construction mastered, the next page explores Sparse Representation—the data structures that make high-dimensional BoW vectors computationally tractable. Understanding sparsity is essential for working with text at scale, where naive dense representations quickly exhaust memory and computation budgets.

Page Complete

You now have comprehensive knowledge of vocabulary construction—the decisions that define your text feature space. From tokenization strategy through OOV handling to production management, you understand the full lifecycle of vocabulary in text ML systems. Next, we'll explore how sparse representations make these high-dimensional vocabularies computationally feasible.

3 / 5

Loading learning content...

Machine LearningFeature Engineering & Selection

Text Feature Engineering - Basics

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

3 / 5

Vocabulary Construction

The Hidden Decisions That Shape Your Feature Space

These decisions collectively determine the quality of your text representation—and thus the ceiling on your model's capabilities.

What You Will Master

The Vocabulary Construction Pipeline

Vocabulary construction follows a multi-stage pipeline, each stage transforming text closer to the final token-to-index mapping.

Vocabulary Construction Stages

•Stage 1: Text Normalization — Convert text to a canonical form: Unicode normalization (NFD/NFC), lowercasing, accent removal, whitespace standardization. Reduces vocabulary fragmentation from encoding variants.
•Stage 2: Tokenization — Split continuous text into discrete tokens. Choices: word-level, character-level, subword, or domain-specific tokenizers (sentence-piece, BPE). The atomic units of your vocabulary.
•Stage 3: Token Filtering — Remove unwanted tokens: stop words, punctuation-only tokens, pure numeric tokens, single-character tokens. Balance noise reduction against information loss.
•Stage 4: Token Transformation — Apply morphological normalization: stemming (Porter, Snowball) or lemmatization (WordNet). Reduces vocabulary by merging surface forms to base forms.
•Stage 5: Vocabulary Selection — From all unique tokens, select which become features. Apply frequency thresholds (min_df, max_df), limit total features (max_features), handle ties consistently.
•Stage 6: Index Assignment — Assign each selected token a unique integer index. Order matters for reproducibility: alphabetical sorting is common. Store the mapping for inference.

vocabulary_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
import re
import unicodedata
from collections import Counter
from typing import List, Dict, Set, Optional, Tuple
 
class VocabularyBuilder:
    """
    Complete vocabulary construction pipeline with configurable stages.
    """
    
    # Common English stop words
    STOP_WORDS: Set[str] = {
        'a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been',
        'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
        'could', 'should', 'to', 'of', 'in', 'for', 'on', 'with', 'at',
        'by', 'from', 'as', 'into', 'through', 'and', 'but', 'or', 'if',
        'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it',
        'we', 'they', 'what', 'which', 'who'
    }
    
    def __init__(
        self,
        lowercase: bool = True,
        remove_accents: bool = True,
        remove_stop_words: bool = True,
        min_token_length: int = 2,
        max_token_length: int = 25,
        min_df: int = 1,
        max_df: float = 1.0,
        max_features: Optional[int] = None,
        token_pattern: str = r'\b[a-zA-Z]+\b'
    ):
        self.lowercase = lowercase
        self.remove_accents = remove_accents
        self.remove_stop_words = remove_stop_words
        self.min_token_length = min_token_length
        self.max_token_length = max_token_length
        self.min_df = min_df
        self.max_df = max_df
        self.max_features = max_features
        self.token_pattern = token_pattern
        
        self.vocabulary_: Optional[Dict[str, int]] = None
        self.document_frequency_: Optional[Dict[str, int]] = None
        
    def _normalize_unicode(self, text: str) -> str:
        """Normalize Unicode text (NFD normalization)."""
        return unicodedata.normalize('NFD', text)
    
    def _remove_accents(self, text: str) -> str:
        """Remove diacritical marks (accents)."""
        return ''.join(
            c for c in unicodedata.normalize('NFD', text)
            if unicodedata.category(c) != 'Mn'
        )
    
    def _normalize(self, text: str) -> str:
        """Stage 1: Text normalization."""
        text = self._normalize_unicode(text)
        
        if self.remove_accents:
            text = self._remove_accents(text)
        
        if self.lowercase:
            text = text.lower()
        
        return text
    
    def _tokenize(self, text: str) -> List[str]:
        """Stage 2: Tokenization."""
        return re.findall(self.token_pattern, text)
    
    def _filter_token(self, token: str) -> bool:
        """Stage 3: Token filtering - returns True if token should be kept."""
        # Length filter
        if len(token) < self.min_token_length:
            return False
        if len(token) > self.max_token_length:
            return False
        
        # Stop word filter
        if self.remove_stop_words and token.lower() in self.STOP_WORDS:
            return False
        
        return True
    
    def _process_document(self, document: str) -> List[str]:
        """Process a single document through normalize -> tokenize -> filter."""
        normalized = self._normalize(document)
        tokens = self._tokenize(normalized)
        filtered = [t for t in tokens if self._filter_token(t)]
        return filtered
    
    def fit(self, corpus: List[str]) -> 'VocabularyBuilder':
        """
        Build vocabulary from corpus.
        
        Stages 4-6: Token transformation, selection, and index assignment.
        """
        n_docs = len(corpus)
        
        # Count document frequencies
        doc_freq: Counter = Counter()
        term_freq: Counter = Counter()
        
        for document in corpus:
            tokens = self._process_document(document)
            unique_tokens = set(tokens)
            
            # Document frequency: count unique tokens per document
            doc_freq.update(unique_tokens)
            
            # Term frequency: count all occurrences (for max_features ranking)
            term_freq.update(tokens)
        
        self.document_frequency_ = dict(doc_freq)
        
        # Stage 5: Vocabulary selection based on df thresholds
        min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs)
        max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs)
        
        # Filter by document frequency
        candidates = {
            term for term, df in doc_freq.items()
            if min_count <= df <= max_count
        }
        
        # Apply max_features limit (keep most frequent)
        if self.max_features is not None and len(candidates) > self.max_features:
            # Sort by term frequency, keep top N
            sorted_terms = sorted(
                candidates,
                key=lambda t: term_freq[t],
                reverse=True
            )
            candidates = set(sorted_terms[:self.max_features])
        
        # Stage 6: Index assignment (alphabetical for reproducibility)
        sorted_vocab = sorted(candidates)
        self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_vocab)}
        
        return self
    
    def get_vocabulary(self) -> Dict[str, int]:
        """Return the vocabulary mapping."""
        if self.vocabulary_ is None:
            raise ValueError("Vocabulary not built. Call fit() first.")
        return self.vocabulary_.copy()
    
    def get_stats(self) -> Dict[str, any]:
        """Return vocabulary statistics."""
        if self.vocabulary_ is None:
            raise ValueError("Vocabulary not built. Call fit() first.")
        
        return {
            'vocabulary_size': len(self.vocabulary_),
            'total_unique_tokens_seen': len(self.document_frequency_),
            'tokens_pruned': len(self.document_frequency_) - len(self.vocabulary_),
        }
 
 
# Demonstration
corpus = [
    "Machine learning algorithms learn patterns from data.",
    "Deep learning neural networks require large amounts of data.",
    "Natural language processing applies machine learning to text.",
    "Computer vision uses deep learning for image recognition.",
    "Reinforcement learning agents learn through trial and error.",
    "Transfer learning leverages pre-trained models effectively.",
]
 
builder = VocabularyBuilder(
    lowercase=True,
    remove_stop_words=True,
    min_df=2,  # Term must appear in at least 2 documents
    max_df=0.9,  # Term must appear in at most 90% of documents
    min_token_length=3
)
 
builder.fit(corpus)
 
print("Vocabulary Statistics:")
print(builder.get_stats())
 
print("
Vocabulary (term -> index):")
for term, idx in sorted(builder.get_vocabulary().items(), key=lambda x: x[1]):
    df = builder.document_frequency_.get(term, 0)
    print(f"  [{idx:2d}] {term:20s} (df={df})")

Tokenization Strategies

Tokenization—splitting text into atomic units—is the most consequential vocabulary decision. Different strategies produce drastically different vocabularies and downstream representations.

Tokenization Strategy Comparison
Strategy	Units	Vocab Size	OOV Rate	Use Case
Word-level	Whitespace-delimited words	50K-500K	5-15%	Classical NLP, BoW, TF-IDF
Character-level	Individual characters	50-200	0%	Spelling correction, low-resource languages
Subword (BPE)	Learned subword units	10K-50K	~0%	Neural LMs, transformers, multilingual
SentencePiece	Language-agnostic subwords	8K-32K	~0%	Multilingual, raw text processing
WordPiece	Likelihood-based subwords	30K-50K	~0%	BERT, LLMs, production NLP

Word-Level Tokenization:

The classical approach splits on whitespace and punctuation. Simple, interpretable, and sufficient for many tasks.

Input: "Machine learning isn't rocket science!"
Tokens: ["Machine", "learning", "isn", "t", "rocket", "science"]

Challenges:

How to handle contractions? ("isn't" → "isn", "t" or "is", "n't" or "isn't")
Hyphenated words? ("state-of-the-art" → 1 token or 4?)
Multi-word expressions? ("New York" → 2 tokens, losing the entity)
Vocabulary explosion with morphologically rich languages

Character-Level Tokenization:

Treats each character as a token. Zero OOV rate but loses word semantics.

Input: "Hello"
Tokens: ["H", "e", "l", "l", "o"]

Subword Tokenization (BPE, WordPiece, Unigram):

Modern approaches that learn a vocabulary of subword units balancing vocabulary size against sequence length. Common words remain whole; rare words decompose into frequent substrings.

Input: "unhappiness"
BPE Tokens: ["un", "happiness"] or ["un", "happ", "iness"]

This handles OOV elegantly: unseen words decompose into seen subwords. "Transformational" might become ["Trans", "form", "ational"] even if the full word was never in training.

tokenization_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import re
from typing import List, Tuple
 
def word_tokenize(text: str) -> List[str]:
    """Simple word-level tokenization."""
    return re.findall(r"\b\w+\b", text.lower())
 
def char_tokenize(text: str) -> List[str]:
    """Character-level tokenization."""
    return list(text)
 
def ngram_tokenize(text: str, n: int = 3) -> List[str]:
    """Character n-gram tokenization."""
    text = text.lower().replace(" ", "_")
    return [text[i:i+n] for i in range(len(text) - n + 1)]
 
class SimpleBPE:
    """
    Simplified Byte Pair Encoding (BPE) tokenizer.
    
    Real implementations (sentencepiece, tokenizers) are more sophisticated,
    but this illustrates the core algorithm.
    """
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.merges: List[Tuple[str, str]] = []
        self.vocab: set = set()
    
    def _get_pairs(self, tokens: List[str]) -> dict:
        """Count adjacent token pairs."""
        pairs = {}
        for i in range(len(tokens) - 1):
            pair = (tokens[i], tokens[i + 1])
            pairs[pair] = pairs.get(pair, 0) + 1
        return pairs
    
    def _merge_pair(self, tokens: List[str], pair: Tuple[str, str]) -> List[str]:
        """Merge all occurrences of a pair in token sequence."""
        merged = []
        i = 0
        while i < len(tokens):
            if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == pair:
                merged.append(tokens[i] + tokens[i+1])
                i += 2
            else:
                merged.append(tokens[i])
                i += 1
        return merged
    
    def fit(self, corpus: List[str]) -> 'SimpleBPE':
        """Learn BPE merges from corpus."""
        # Initialize: split all words into characters
        word_freqs = {}
        for text in corpus:
            for word in text.lower().split():
                # Add end-of-word marker
                chars = tuple(list(word) + ['</w>'])
                word_freqs[chars] = word_freqs.get(chars, 0) + 1
        
        # Initial vocabulary: all characters
        self.vocab = set()
        for word in word_freqs:
            self.vocab.update(word)
        
        # Iteratively merge most frequent pairs
        while len(self.vocab) < self.vocab_size:
            # Count pairs across all words
            pair_counts = {}
            for word, freq in word_freqs.items():
                for i in range(len(word) - 1):
                    pair = (word[i], word[i+1])
                    pair_counts[pair] = pair_counts.get(pair, 0) + freq
            
            if not pair_counts:
                break
            
            # Find most frequent pair
            best_pair = max(pair_counts, key=pair_counts.get)
            
            # Merge this pair in all words
            new_word_freqs = {}
            for word, freq in word_freqs.items():
                new_word = self._merge_pair(list(word), best_pair)
                new_word_freqs[tuple(new_word)] = freq
            
            word_freqs = new_word_freqs
            
            # Update vocabulary and merges
            merged_token = best_pair[0] + best_pair[1]
            self.vocab.add(merged_token)
            self.merges.append(best_pair)
        
        return self
    
    def tokenize(self, text: str) -> List[str]:
        """Tokenize text using learned BPE merges."""
        tokens = []
        for word in text.lower().split():
            word_tokens = list(word) + ['</w>']
            
            # Apply merges in order
            for pair in self.merges:
                word_tokens = self._merge_pair(word_tokens, pair)
            
            tokens.extend(word_tokens)
        
        return tokens
 
 
# Demonstration
text = "Machine learning and deep learning are transforming AI research"
 
print("Tokenization Strategy Comparison:")
print("=" * 60)
print(f"
Input: '{text}'")
 
print("
1. Word-level:")
word_tokens = word_tokenize(text)
print(f"   Tokens ({len(word_tokens)}): {word_tokens}")
 
print("
2. Character-level:")
char_tokens = char_tokenize(text)
print(f"   Tokens ({len(char_tokens)}): {char_tokens[:20]}...")
 
print("
3. Character trigrams:")
trigrams = ngram_tokenize(text, 3)
print(f"   Tokens ({len(trigrams)}): {trigrams[:10]}...")
 
# Simple BPE demo
print("
4. BPE (simplified):")
corpus = [
    "machine learning is powerful",
    "deep learning uses neural networks", 
    "learning algorithms learn patterns",
    "machine learning and deep learning"
]
bpe = SimpleBPE(vocab_size=50)
bpe.fit(corpus)
bpe_tokens = bpe.tokenize("machine learning")
print(f"   Vocabulary size: {len(bpe.vocab)}")
print(f"   Sample merges: {bpe.merges[:5]}")
print(f"   'machine learning' -> {bpe_tokens}")

Vocabulary Pruning Strategies

Document Frequency Thresholds:

The most important pruning mechanism controls which tokens become features based on how many documents contain them.

Document Frequency Pruning Effects
Threshold	Effect	Removes	Rationale
min_df = 5	Minimum 5 documents	Rare terms, typos, noise	Rare terms don't generalize; often noise
min_df = 0.01	Minimum 1% of documents	Very rare terms	Scale with corpus size
max_df = 0.95	Maximum 95% of documents	Near-universal terms	Terms in 95% of docs aren't discriminative
max_df = 0.5	Maximum 50% of documents	Common terms (aggressive)	Keeps only discriminative terms

Minimum Document Frequency (min_df):

Terms appearing in very few documents are typically:

Typos and misspellings
Highly domain-specific jargon
Named entities (which might be useful!)
Statistical noise

Setting min_df = 2 or min_df = 5 eliminates most noise with minimal information loss. However, be careful with named entity recognition tasks—rare proper nouns might be exactly what you need.

Maximum Document Frequency (max_df):

Terms appearing in almost every document provide no discrimination between classes. These include:

Function words not caught by stop word lists
Domain boilerplate ("abstract", "introduction" in papers)
Common phrases

Setting max_df = 0.9 or max_df = 0.95 removes these without aggressive filtering.

Maximum Features (max_features):

vocabulary_pruning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from typing import List, Tuple
 
def analyze_vocabulary_pruning(
    corpus: List[str],
    settings: List[dict]
) -> List[Tuple[str, int, List[str]]]:
    """
    Compare vocabulary sizes under different pruning settings.
    
    Args:
        corpus: List of documents
        settings: List of dicts with CountVectorizer params
    
    Returns:
        List of (setting_name, vocab_size, sample_terms)
    """
    results = []
    
    for setting in settings:
        name = setting.pop('name', 'unnamed')
        
        vectorizer = CountVectorizer(**setting)
        vectorizer.fit(corpus)
        
        vocab = vectorizer.get_feature_names_out()
        
        # Sample some terms
        sample = list(vocab[:10]) if len(vocab) > 10 else list(vocab)
        
        results.append((name, len(vocab), sample))
        
        # Restore name for display
        setting['name'] = name
    
    return results
 
 
# Synthetic corpus with various term frequency patterns
np.random.seed(42)
 
# Common terms (appear in 80%+ of docs)
common = ["the", "is", "of", "and", "to", "in", "that", "it"]
 
# Frequent terms (appear in 30-60% of docs)
frequent = ["machine", "learning", "data", "model", "algorithm", "neural"]
 
# Moderate terms (appear in 10-30% of docs)
moderate = ["classification", "training", "features", "prediction", "optimization"]
 
# Rare terms (appear in 1-5% of docs) 
rare = ["hyperparameter", "backpropagation", "convolution", "regularization"]
 
# Very rare / typos (appear in 1 doc)
noise = ["leraning", "daat", "modle", "x7f2k", "http123"]
 
def generate_doc():
    """Generate a synthetic document."""
    doc = []
    doc.extend(np.random.choice(common, size=np.random.randint(5, 15)))
    doc.extend(np.random.choice(frequent, size=np.random.randint(2, 6)))
    
    if np.random.random() < 0.3:
        doc.extend(np.random.choice(moderate, size=np.random.randint(1, 3)))
    
    if np.random.random() < 0.1:
        doc.extend(np.random.choice(rare, size=1))
    
    if np.random.random() < 0.02:
        doc.append(np.random.choice(noise))
    
    return " ".join(doc)
 
corpus = [generate_doc() for _ in range(500)]
 
print(f"Corpus: {len(corpus)} documents")
print(f"Total unique tokens: {len(set(' '.join(corpus).lower().split()))}")
 
# Test different pruning strategies
settings = [
    {'name': 'No pruning', 'min_df': 1, 'max_df': 1.0},
    {'name': 'min_df=2', 'min_df': 2, 'max_df': 1.0},
    {'name': 'min_df=5', 'min_df': 5, 'max_df': 1.0},
    {'name': 'min_df=5, max_df=0.9', 'min_df': 5, 'max_df': 0.9},
    {'name': 'min_df=5, max_df=0.5', 'min_df': 5, 'max_df': 0.5},
    {'name': 'max_features=20', 'min_df': 1, 'max_df': 1.0, 'max_features': 20},
]
 
results = analyze_vocabulary_pruning(corpus, settings)
 
print("
" + "=" * 70)
print("Vocabulary Pruning Comparison")
print("=" * 70)
 
for name, size, sample in results:
    print(f"
{name}:")
    print(f"  Vocabulary size: {size}")
    print(f"  Sample terms: {sample}")

Practical Defaults

Out-of-Vocabulary (OOV) Handling

A critical production consideration: What happens when new documents contain words not in the vocabulary?

This is the Out-of-Vocabulary (OOV) problem, and it's unavoidable. Language evolves, new products launch, users make typos, and domains have long-tail terminology.

OOV Statistics:

Typical OOV rates for word-level tokenization:

Web text: 5-15% of tokens are OOV
Domain-specific text: 10-30% OOV
Social media/informal: 15-40% OOV
Multi-lingual text: 20-50%+ OOV

These percentages mean substantial information loss if OOV tokens are simply ignored.

OOV Handling Strategies
Strategy	Implementation	Pros	Cons
Ignore	Treat OOV as zero count	Simple, no changes needed	Loses all OOV information; can bias predictions
<UNK> Token	Map all OOV to single index	Preserves OOV count; single feature	All OOV terms become indistinguishable
Character n-grams	Use character-level features	Zero OOV rate	Longer sequences; less semantic
Subword Tokenization	BPE, WordPiece, SentencePiece	Near-zero OOV; compositional	Requires training; longer sequences
Hashing Trick	Hash terms to fixed indices	No OOV; fixed memory	Collisions possible; no inverse mapping
Vocabulary Expansion	Periodically retrain with new terms	Captures real vocabulary drift	Requires retraining; versioning complexity

oov_handling.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from collections import Counter
from typing import List, Tuple, Optional
import re
 
class OOVAwareVectorizer:
    """
    Vectorizer with explicit OOV handling and tracking.
    """
    
    def __init__(
        self,
        oov_strategy: str = "ignore",  # "ignore", "unk", "hash"
        hash_size: int = 1000,
        **vectorizer_kwargs
    ):
        self.oov_strategy = oov_strategy
        self.hash_size = hash_size
        self.vectorizer_kwargs = vectorizer_kwargs
        
        self.vectorizer_: Optional[CountVectorizer] = None
        self.vocabulary_: Optional[set] = None
        self.oov_stats_: dict = {}
    
    def fit(self, corpus: List[str]) -> 'OOVAwareVectorizer':
        """Fit vocabulary on training corpus."""
        self.vectorizer_ = CountVectorizer(**self.vectorizer_kwargs)
        self.vectorizer_.fit(corpus)
        self.vocabulary_ = set(self.vectorizer_.vocabulary_.keys())
        return self
    
    def _get_tokens(self, text: str) -> List[str]:
        """Extract tokens using same rules as vectorizer."""
        pattern = self.vectorizer_.token_pattern
        tokens = re.findall(pattern, text.lower())
        return tokens
    
    def _count_oov(self, corpus: List[str]) -> Tuple[int, int, List[str]]:
        """Count OOV tokens in corpus."""
        oov_count = 0
        total_count = 0
        oov_examples = []
        
        for doc in corpus:
            tokens = self._get_tokens(doc)
            for token in tokens:
                total_count += 1
                if token not in self.vocabulary_:
                    oov_count += 1
                    if len(oov_examples) < 20:
                        oov_examples.append(token)
        
        return oov_count, total_count, oov_examples
    
    def transform(self, corpus: List[str]) -> np.ndarray:
        """Transform corpus, applying OOV strategy."""
        if self.vectorizer_ is None:
            raise ValueError("Vectorizer not fitted. Call fit() first.")
        
        # Track OOV statistics
        oov_count, total_count, oov_examples = self._count_oov(corpus)
        self.oov_stats_ = {
            'oov_tokens': oov_count,
            'total_tokens': total_count,
            'oov_rate': oov_count / total_count if total_count > 0 else 0,
            'oov_examples': oov_examples
        }
        
        if self.oov_strategy == "ignore":
            # Default behavior: OOV terms become 0
            return self.vectorizer_.transform(corpus).toarray()
        
        elif self.oov_strategy == "unk":
            # Add <UNK> feature for OOV count
            base_matrix = self.vectorizer_.transform(corpus).toarray()
            
            # Count OOV per document
            unk_counts = []
            for doc in corpus:
                tokens = self._get_tokens(doc)
                oov = sum(1 for t in tokens if t not in self.vocabulary_)
                unk_counts.append(oov)
            
            # Append UNK column
            unk_col = np.array(unk_counts).reshape(-1, 1)
            return np.hstack([base_matrix, unk_col])
        
        elif self.oov_strategy == "hash":
            # Hash OOV terms to additional features
            base_matrix = self.vectorizer_.transform(corpus).toarray()
            
            # Create hash features for OOV
            hash_features = np.zeros((len(corpus), self.hash_size))
            
            for doc_idx, doc in enumerate(corpus):
                tokens = self._get_tokens(doc)
                for token in tokens:
                    if token not in self.vocabulary_:
                        hash_idx = hash(token) % self.hash_size
                        hash_features[doc_idx, hash_idx] += 1
            
            return np.hstack([base_matrix, hash_features])
        
        else:
            raise ValueError(f"Unknown OOV strategy: {self.oov_strategy}")
 
 
# Demonstration
train_corpus = [
    "machine learning algorithms process data",
    "neural networks learn patterns from examples",
    "deep learning models require large datasets",
    "supervised learning uses labeled training data",
]
 
# Test corpus with OOV terms
test_corpus = [
    "transformer architectures revolutionize NLP",  # transformer, architectures, revolutionize, nlp = OOV
    "machine learning uses gradient descent optimization",  # gradient, descent, optimization = OOV
    "convolutional networks detect image features",  # convolutional, detect, image = OOV
]
 
print("OOV Handling Strategies Comparison")
print("=" * 60)
 
for strategy in ["ignore", "unk", "hash"]:
    vectorizer = OOVAwareVectorizer(
        oov_strategy=strategy,
        hash_size=100,
        min_df=1
    )
    
    vectorizer.fit(train_corpus)
    X = vectorizer.transform(test_corpus)
    
    stats = vectorizer.oov_stats_
    
    print(f"
[{strategy.upper()}] Strategy:")
    print(f"  Output shape: {X.shape}")
    print(f"  OOV rate: {stats['oov_rate']:.1%}")
    print(f"  OOV examples: {stats['oov_examples'][:5]}")

OOV in Production

Subword Tokenization: The Modern Approach

Subword tokenization has become the dominant approach in modern NLP, powering models from BERT to GPT. Understanding it is essential professional knowledge.

Core Insight:

Subword tokenization learns a vocabulary of subword units that optimally balance:

Vocabulary size (smaller is better for memory and sparsity)
Sequence length (shorter is better for computation)
OOV rate (lower is better for coverage)

The learning process finds frequently co-occurring character sequences and treats them as atomic units.

Subword Advantages

•Near-zero OOV — Rare words decompose into seen subunits
•Morphological awareness — Captures prefixes/suffixes (un-, -ing, -tion)
•Compact vocabulary — 30K tokens cover most languages well
•Cross-lingual transfer — Shared subwords between similar languages

Subword Limitations

•Longer sequences — Words become multiple tokens, increasing sequence length
•Training required — Must learn tokenizer on representative corpus
•Less interpretable — Subword features harder to explain than word features
•Corpus-dependent — Tokenization quality depends on training data

The Three Major Algorithms:

1. Byte Pair Encoding (BPE):

Starts with character vocabulary
Iteratively merges most frequent adjacent pairs
Deterministic and fast
Used by GPT-2, RoBERTa, many others

2. WordPiece:

Similar to BPE but uses likelihood-based merging
Maximizes language model probability, not just frequency
Used by BERT, DistilBERT

3. Unigram Language Model (SentencePiece):

Starts with large vocabulary, prunes to target size
Stochastic: multiple valid tokenizations exist
Language-agnostic: processes raw bytes
Used by XLNet, T5, multilingual models

subword_tokenization_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Using the Hugging Face tokenizers library
# pip install tokenizers transformers
 
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
 
def train_bpe_tokenizer(corpus: list, vocab_size: int = 1000):
    """
    Train a BPE tokenizer on a corpus.
    """
    # Initialize BPE model
    tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
    
    # Pre-tokenization: split on whitespace first
    tokenizer.pre_tokenizer = Whitespace()
    
    # Trainer configuration
    trainer = BpeTrainer(
        vocab_size=vocab_size,
        special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]"]
    )
    
    # Train on corpus
    tokenizer.train_from_iterator(corpus, trainer)
    
    return tokenizer
 
# Example corpus for training
training_corpus = [
    "machine learning is transforming artificial intelligence research",
    "deep learning neural networks process complex data patterns",
    "natural language processing enables machines to understand text",
    "computer vision algorithms detect objects in images",
    "reinforcement learning agents learn through trial and error",
    "transformer architectures have revolutionized language models",
    "pre-trained models enable transfer learning across domains",
    "unsupervised learning discovers patterns without labeled data",
] * 100  # Repeat for meaningful statistics
 
# Train tokenizer
print("Training BPE tokenizer...")
tokenizer = train_bpe_tokenizer(training_corpus, vocab_size=500)
 
print(f"Vocabulary size: {tokenizer.get_vocab_size()}")
 
# Test tokenization
test_sentences = [
    "machine learning",  # Common terms
    "transformer models",  # May be split
    "superintelligence",  # Rare word
    "xyzabc123",  # Total OOV
]
 
print("
Tokenization examples:")
for sentence in test_sentences:
    output = tokenizer.encode(sentence)
    print(f"  '{sentence}'")
    print(f"    -> {output.tokens}")
 
# Show how OOV is handled
print("
OOV Handling Demonstration:")
print("  'superintelligence' is unseen, but decomposes into:")
output = tokenizer.encode("superintelligence")
print(f"    {output.tokens}")
print("  Each subword was seen during training, so zero OOV!")

Production Vocabulary Management

Managing vocabulary in production systems requires careful attention to versioning, consistency, and monitoring.

Production Best Practices

•Version Your Vocabulary — Hash or version the vocabulary file. Different vocabulary = different model. Never assume vocabularies are compatible without explicit verification.
•Serialize Complete Pipeline — Save the vectorizer alongside the model. Use joblib, pickle, or framework-specific serialization. Include all preprocessing steps.
•Freeze at Deployment — The vocabulary used during training must be identical at inference. Never recompute vocabulary on production data.
•Monitor OOV Rates — Track what percentage of tokens are OOV in production. Alert on significant increases—they signal distribution shift.
•Document Preprocessing — Explicitly document every preprocessing step: lowercasing, stop words, tokenization pattern, stemming. Implicit differences cause subtle bugs.
•Plan for Updates — Have a process for vocabulary expansion. Decide: freeze forever, or periodic retraining? Both are valid depending on domain.

production_vocabulary.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import json
import hashlib
from pathlib import Path
from datetime import datetime
from sklearn.feature_extraction.text import CountVectorizer
import joblib
from typing import Dict, Optional
 
class ProductionVocabulary:
    """
    Production-ready vocabulary manager with versioning and monitoring.
    """
    
    def __init__(
        self,
        vectorizer: CountVectorizer,
        metadata: Optional[Dict] = None
    ):
        self.vectorizer = vectorizer
        self.vocabulary = set(vectorizer.vocabulary_.keys())
        self.metadata = metadata or {}
        
        # Generate version hash
        vocab_str = '
'.join(sorted(self.vocabulary))
        self.version_hash = hashlib.sha256(vocab_str.encode()).hexdigest()[:12]
        
        self.metadata.update({
            'version_hash': self.version_hash,
            'vocabulary_size': len(self.vocabulary),
            'created_at': datetime.now().isoformat(),
        })
    
    def save(self, directory: str):
        """Save vocabulary and vectorizer to directory."""
        path = Path(directory)
        path.mkdir(parents=True, exist_ok=True)
        
        # Save vectorizer
        joblib.dump(self.vectorizer, path / 'vectorizer.joblib')
        
        # Save vocabulary as text (for inspection)
        with open(path / 'vocabulary.txt', 'w') as f:
            for term in sorted(self.vocabulary):
                f.write(f"{term}
")
        
        # Save metadata
        with open(path / 'metadata.json', 'w') as f:
            json.dump(self.metadata, f, indent=2)
        
        print(f"Saved vocabulary version {self.version_hash} to {directory}")
    
    @classmethod
    def load(cls, directory: str) -> 'ProductionVocabulary':
        """Load vocabulary from directory."""
        path = Path(directory)
        
        vectorizer = joblib.load(path / 'vectorizer.joblib')
        
        with open(path / 'metadata.json', 'r') as f:
            metadata = json.load(f)
        
        loaded = cls(vectorizer, metadata)
        
        # Verify hash
        if loaded.version_hash != metadata.get('version_hash'):
            raise ValueError(
                f"Vocabulary hash mismatch! Expected {metadata['version_hash']}, "
                f"got {loaded.version_hash}"
            )
        
        return loaded
    
    def check_oov(self, text: str) -> Dict:
        """Analyze OOV tokens in new text."""
        # Use same tokenization as vectorizer
        import re
        pattern = self.vectorizer.token_pattern
        tokens = re.findall(pattern, text.lower())
        
        oov_tokens = [t for t in tokens if t not in self.vocabulary]
        
        return {
            'total_tokens': len(tokens),
            'oov_count': len(oov_tokens),
            'oov_rate': len(oov_tokens) / len(tokens) if tokens else 0,
            'oov_tokens': oov_tokens[:10],  # Sample
        }
 
 
# Demonstration
train_data = [
    "machine learning algorithms",
    "neural network architectures", 
    "deep learning models",
    "natural language processing",
]
 
vectorizer = CountVectorizer(min_df=1)
vectorizer.fit(train_data)
 
vocab_manager = ProductionVocabulary(
    vectorizer=vectorizer,
    metadata={
        'training_samples': len(train_data),
        'model_name': 'text_classifier_v1',
    }
)
 
print("Production Vocabulary Manager")
print("=" * 50)
print(f"Version: {vocab_manager.version_hash}")
print(f"Size: {len(vocab_manager.vocabulary)} terms")
 
# Check OOV on new data
test_texts = [
    "machine learning is great",  # Mostly known
    "transformer models are powerful",  # Some OOV
    "quantum computing uses qubits",  # Mostly OOV
]
 
print("
OOV Analysis on New Data:")
for text in test_texts:
    analysis = vocab_manager.check_oov(text)
    print(f"
  '{text}'")
    print(f"    OOV rate: {analysis['oov_rate']:.1%}")
    print(f"    OOV tokens: {analysis['oov_tokens']}")

Summary: Vocabulary as Foundation

Key Takeaways

•Vocabulary defines feature space — Every vocabulary decision (tokenization, pruning, OOV handling) directly determines what information your model can capture.
•Tokenization is critical — Word-level is simple but has high OOV. Subword (BPE, WordPiece) is the modern standard for neural models.
•Prune strategically — min_df removes noise (typos, rare terms). max_df removes non-discriminative common terms. max_features provides a hard cap.
•OOV is inevitable — Plan for it. Ignore, <UNK>, hashing, or subword tokenization each have trade-offs. Monitor OOV rates in production.
•Subword tokenization dominates modern NLP — BPE, WordPiece, and SentencePiece achieve near-zero OOV while maintaining compact vocabularies.
•Production requires discipline — Version vocabularies, serialize complete pipelines, freeze at deployment, and monitor OOV continuously.
•Vocabulary is not set-and-forget — Language evolves. Have a process for vocabulary updates while maintaining model compatibility.

Looking Ahead:

Page Complete

3 / 5