Loading learning content...
Every text machine learning model rests upon a vocabulary—the set of tokens recognized as features. This vocabulary defines the dimension of your feature space, determines what information can be captured, and profoundly influences model performance.
Yet vocabulary construction is often treated as an afterthought: split on whitespace, maybe lowercase, use whatever results. This casual approach leaves significant performance on the table and can introduce subtle bugs that are difficult to diagnose.
The reality: Vocabulary construction involves dozens of design decisions, each with measurable impact on downstream task performance. Should you include numbers? Punctuation? How do you handle hyphenated words? What about misspellings, slang, or domain-specific jargon? At what frequency threshold do you prune rare terms? How do you handle words never seen during training?
These decisions collectively determine the quality of your text representation—and thus the ceiling on your model's capabilities.
By the end of this page, you will understand: (1) The complete vocabulary construction pipeline from raw text to feature indices, (2) Tokenization strategies and their trade-offs, (3) Vocabulary pruning techniques (min/max document frequency), (4) Out-of-vocabulary (OOV) handling strategies, (5) Subword tokenization for open-vocabulary models, and (6) Production considerations for vocabulary management.
Vocabulary construction follows a multi-stage pipeline, each stage transforming text closer to the final token-to-index mapping.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189
import reimport unicodedatafrom collections import Counterfrom typing import List, Dict, Set, Optional, Tuple class VocabularyBuilder: """ Complete vocabulary construction pipeline with configurable stages. """ # Common English stop words STOP_WORDS: Set[str] = { 'a', 'an', 'the', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from', 'as', 'into', 'through', 'and', 'but', 'or', 'if', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it', 'we', 'they', 'what', 'which', 'who' } def __init__( self, lowercase: bool = True, remove_accents: bool = True, remove_stop_words: bool = True, min_token_length: int = 2, max_token_length: int = 25, min_df: int = 1, max_df: float = 1.0, max_features: Optional[int] = None, token_pattern: str = r'\b[a-zA-Z]+\b' ): self.lowercase = lowercase self.remove_accents = remove_accents self.remove_stop_words = remove_stop_words self.min_token_length = min_token_length self.max_token_length = max_token_length self.min_df = min_df self.max_df = max_df self.max_features = max_features self.token_pattern = token_pattern self.vocabulary_: Optional[Dict[str, int]] = None self.document_frequency_: Optional[Dict[str, int]] = None def _normalize_unicode(self, text: str) -> str: """Normalize Unicode text (NFD normalization).""" return unicodedata.normalize('NFD', text) def _remove_accents(self, text: str) -> str: """Remove diacritical marks (accents).""" return ''.join( c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn' ) def _normalize(self, text: str) -> str: """Stage 1: Text normalization.""" text = self._normalize_unicode(text) if self.remove_accents: text = self._remove_accents(text) if self.lowercase: text = text.lower() return text def _tokenize(self, text: str) -> List[str]: """Stage 2: Tokenization.""" return re.findall(self.token_pattern, text) def _filter_token(self, token: str) -> bool: """Stage 3: Token filtering - returns True if token should be kept.""" # Length filter if len(token) < self.min_token_length: return False if len(token) > self.max_token_length: return False # Stop word filter if self.remove_stop_words and token.lower() in self.STOP_WORDS: return False return True def _process_document(self, document: str) -> List[str]: """Process a single document through normalize -> tokenize -> filter.""" normalized = self._normalize(document) tokens = self._tokenize(normalized) filtered = [t for t in tokens if self._filter_token(t)] return filtered def fit(self, corpus: List[str]) -> 'VocabularyBuilder': """ Build vocabulary from corpus. Stages 4-6: Token transformation, selection, and index assignment. """ n_docs = len(corpus) # Count document frequencies doc_freq: Counter = Counter() term_freq: Counter = Counter() for document in corpus: tokens = self._process_document(document) unique_tokens = set(tokens) # Document frequency: count unique tokens per document doc_freq.update(unique_tokens) # Term frequency: count all occurrences (for max_features ranking) term_freq.update(tokens) self.document_frequency_ = dict(doc_freq) # Stage 5: Vocabulary selection based on df thresholds min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs) max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs) # Filter by document frequency candidates = { term for term, df in doc_freq.items() if min_count <= df <= max_count } # Apply max_features limit (keep most frequent) if self.max_features is not None and len(candidates) > self.max_features: # Sort by term frequency, keep top N sorted_terms = sorted( candidates, key=lambda t: term_freq[t], reverse=True ) candidates = set(sorted_terms[:self.max_features]) # Stage 6: Index assignment (alphabetical for reproducibility) sorted_vocab = sorted(candidates) self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_vocab)} return self def get_vocabulary(self) -> Dict[str, int]: """Return the vocabulary mapping.""" if self.vocabulary_ is None: raise ValueError("Vocabulary not built. Call fit() first.") return self.vocabulary_.copy() def get_stats(self) -> Dict[str, any]: """Return vocabulary statistics.""" if self.vocabulary_ is None: raise ValueError("Vocabulary not built. Call fit() first.") return { 'vocabulary_size': len(self.vocabulary_), 'total_unique_tokens_seen': len(self.document_frequency_), 'tokens_pruned': len(self.document_frequency_) - len(self.vocabulary_), } # Demonstrationcorpus = [ "Machine learning algorithms learn patterns from data.", "Deep learning neural networks require large amounts of data.", "Natural language processing applies machine learning to text.", "Computer vision uses deep learning for image recognition.", "Reinforcement learning agents learn through trial and error.", "Transfer learning leverages pre-trained models effectively.",] builder = VocabularyBuilder( lowercase=True, remove_stop_words=True, min_df=2, # Term must appear in at least 2 documents max_df=0.9, # Term must appear in at most 90% of documents min_token_length=3) builder.fit(corpus) print("Vocabulary Statistics:")print(builder.get_stats()) print("Vocabulary (term -> index):")for term, idx in sorted(builder.get_vocabulary().items(), key=lambda x: x[1]): df = builder.document_frequency_.get(term, 0) print(f" [{idx:2d}] {term:20s} (df={df})")Tokenization—splitting text into atomic units—is the most consequential vocabulary decision. Different strategies produce drastically different vocabularies and downstream representations.
| Strategy | Units | Vocab Size | OOV Rate | Use Case |
|---|---|---|---|---|
| Word-level | Whitespace-delimited words | 50K-500K | 5-15% | Classical NLP, BoW, TF-IDF |
| Character-level | Individual characters | 50-200 | 0% | Spelling correction, low-resource languages |
| Subword (BPE) | Learned subword units | 10K-50K | ~0% | Neural LMs, transformers, multilingual |
| SentencePiece | Language-agnostic subwords | 8K-32K | ~0% | Multilingual, raw text processing |
| WordPiece | Likelihood-based subwords | 30K-50K | ~0% | BERT, LLMs, production NLP |
Word-Level Tokenization:
The classical approach splits on whitespace and punctuation. Simple, interpretable, and sufficient for many tasks.
Input: "Machine learning isn't rocket science!"
Tokens: ["Machine", "learning", "isn", "t", "rocket", "science"]
Challenges:
Character-Level Tokenization:
Treats each character as a token. Zero OOV rate but loses word semantics.
Input: "Hello"
Tokens: ["H", "e", "l", "l", "o"]
Subword Tokenization (BPE, WordPiece, Unigram):
Modern approaches that learn a vocabulary of subword units balancing vocabulary size against sequence length. Common words remain whole; rare words decompose into frequent substrings.
Input: "unhappiness"
BPE Tokens: ["un", "happiness"] or ["un", "happ", "iness"]
This handles OOV elegantly: unseen words decompose into seen subwords. "Transformational" might become ["Trans", "form", "ational"] even if the full word was never in training.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import refrom typing import List, Tuple def word_tokenize(text: str) -> List[str]: """Simple word-level tokenization.""" return re.findall(r"\b\w+\b", text.lower()) def char_tokenize(text: str) -> List[str]: """Character-level tokenization.""" return list(text) def ngram_tokenize(text: str, n: int = 3) -> List[str]: """Character n-gram tokenization.""" text = text.lower().replace(" ", "_") return [text[i:i+n] for i in range(len(text) - n + 1)] class SimpleBPE: """ Simplified Byte Pair Encoding (BPE) tokenizer. Real implementations (sentencepiece, tokenizers) are more sophisticated, but this illustrates the core algorithm. """ def __init__(self, vocab_size: int = 1000): self.vocab_size = vocab_size self.merges: List[Tuple[str, str]] = [] self.vocab: set = set() def _get_pairs(self, tokens: List[str]) -> dict: """Count adjacent token pairs.""" pairs = {} for i in range(len(tokens) - 1): pair = (tokens[i], tokens[i + 1]) pairs[pair] = pairs.get(pair, 0) + 1 return pairs def _merge_pair(self, tokens: List[str], pair: Tuple[str, str]) -> List[str]: """Merge all occurrences of a pair in token sequence.""" merged = [] i = 0 while i < len(tokens): if i < len(tokens) - 1 and (tokens[i], tokens[i+1]) == pair: merged.append(tokens[i] + tokens[i+1]) i += 2 else: merged.append(tokens[i]) i += 1 return merged def fit(self, corpus: List[str]) -> 'SimpleBPE': """Learn BPE merges from corpus.""" # Initialize: split all words into characters word_freqs = {} for text in corpus: for word in text.lower().split(): # Add end-of-word marker chars = tuple(list(word) + ['</w>']) word_freqs[chars] = word_freqs.get(chars, 0) + 1 # Initial vocabulary: all characters self.vocab = set() for word in word_freqs: self.vocab.update(word) # Iteratively merge most frequent pairs while len(self.vocab) < self.vocab_size: # Count pairs across all words pair_counts = {} for word, freq in word_freqs.items(): for i in range(len(word) - 1): pair = (word[i], word[i+1]) pair_counts[pair] = pair_counts.get(pair, 0) + freq if not pair_counts: break # Find most frequent pair best_pair = max(pair_counts, key=pair_counts.get) # Merge this pair in all words new_word_freqs = {} for word, freq in word_freqs.items(): new_word = self._merge_pair(list(word), best_pair) new_word_freqs[tuple(new_word)] = freq word_freqs = new_word_freqs # Update vocabulary and merges merged_token = best_pair[0] + best_pair[1] self.vocab.add(merged_token) self.merges.append(best_pair) return self def tokenize(self, text: str) -> List[str]: """Tokenize text using learned BPE merges.""" tokens = [] for word in text.lower().split(): word_tokens = list(word) + ['</w>'] # Apply merges in order for pair in self.merges: word_tokens = self._merge_pair(word_tokens, pair) tokens.extend(word_tokens) return tokens # Demonstrationtext = "Machine learning and deep learning are transforming AI research" print("Tokenization Strategy Comparison:")print("=" * 60)print(f"Input: '{text}'") print("1. Word-level:")word_tokens = word_tokenize(text)print(f" Tokens ({len(word_tokens)}): {word_tokens}") print("2. Character-level:")char_tokens = char_tokenize(text)print(f" Tokens ({len(char_tokens)}): {char_tokens[:20]}...") print("3. Character trigrams:")trigrams = ngram_tokenize(text, 3)print(f" Tokens ({len(trigrams)}): {trigrams[:10]}...") # Simple BPE demoprint("4. BPE (simplified):")corpus = [ "machine learning is powerful", "deep learning uses neural networks", "learning algorithms learn patterns", "machine learning and deep learning"]bpe = SimpleBPE(vocab_size=50)bpe.fit(corpus)bpe_tokens = bpe.tokenize("machine learning")print(f" Vocabulary size: {len(bpe.vocab)}")print(f" Sample merges: {bpe.merges[:5]}")print(f" 'machine learning' -> {bpe_tokens}")Raw vocabularies are often too large for practical use. A typical English text corpus yields 100,000+ unique tokens, but many are noise: typos, rare domain jargon, numeric identifiers, or HTML artifacts. Vocabulary pruning reduces dimensionality while preserving discriminative power.
Document Frequency Thresholds:
The most important pruning mechanism controls which tokens become features based on how many documents contain them.
| Threshold | Effect | Removes | Rationale |
|---|---|---|---|
| min_df = 5 | Minimum 5 documents | Rare terms, typos, noise | Rare terms don't generalize; often noise |
| min_df = 0.01 | Minimum 1% of documents | Very rare terms | Scale with corpus size |
| max_df = 0.95 | Maximum 95% of documents | Near-universal terms | Terms in 95% of docs aren't discriminative |
| max_df = 0.5 | Maximum 50% of documents | Common terms (aggressive) | Keeps only discriminative terms |
Minimum Document Frequency (min_df):
Terms appearing in very few documents are typically:
Setting min_df = 2 or min_df = 5 eliminates most noise with minimal information loss. However, be careful with named entity recognition tasks—rare proper nouns might be exactly what you need.
Maximum Document Frequency (max_df):
Terms appearing in almost every document provide no discrimination between classes. These include:
Setting max_df = 0.9 or max_df = 0.95 removes these without aggressive filtering.
Maximum Features (max_features):
After df filtering, you can limit vocabulary to the N most frequent terms. This provides a hard cap on dimensionality, useful for memory-constrained environments or when you want to focus on the most informative features.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
from sklearn.feature_extraction.text import CountVectorizerimport numpy as npfrom typing import List, Tuple def analyze_vocabulary_pruning( corpus: List[str], settings: List[dict]) -> List[Tuple[str, int, List[str]]]: """ Compare vocabulary sizes under different pruning settings. Args: corpus: List of documents settings: List of dicts with CountVectorizer params Returns: List of (setting_name, vocab_size, sample_terms) """ results = [] for setting in settings: name = setting.pop('name', 'unnamed') vectorizer = CountVectorizer(**setting) vectorizer.fit(corpus) vocab = vectorizer.get_feature_names_out() # Sample some terms sample = list(vocab[:10]) if len(vocab) > 10 else list(vocab) results.append((name, len(vocab), sample)) # Restore name for display setting['name'] = name return results # Synthetic corpus with various term frequency patternsnp.random.seed(42) # Common terms (appear in 80%+ of docs)common = ["the", "is", "of", "and", "to", "in", "that", "it"] # Frequent terms (appear in 30-60% of docs)frequent = ["machine", "learning", "data", "model", "algorithm", "neural"] # Moderate terms (appear in 10-30% of docs)moderate = ["classification", "training", "features", "prediction", "optimization"] # Rare terms (appear in 1-5% of docs) rare = ["hyperparameter", "backpropagation", "convolution", "regularization"] # Very rare / typos (appear in 1 doc)noise = ["leraning", "daat", "modle", "x7f2k", "http123"] def generate_doc(): """Generate a synthetic document.""" doc = [] doc.extend(np.random.choice(common, size=np.random.randint(5, 15))) doc.extend(np.random.choice(frequent, size=np.random.randint(2, 6))) if np.random.random() < 0.3: doc.extend(np.random.choice(moderate, size=np.random.randint(1, 3))) if np.random.random() < 0.1: doc.extend(np.random.choice(rare, size=1)) if np.random.random() < 0.02: doc.append(np.random.choice(noise)) return " ".join(doc) corpus = [generate_doc() for _ in range(500)] print(f"Corpus: {len(corpus)} documents")print(f"Total unique tokens: {len(set(' '.join(corpus).lower().split()))}") # Test different pruning strategiessettings = [ {'name': 'No pruning', 'min_df': 1, 'max_df': 1.0}, {'name': 'min_df=2', 'min_df': 2, 'max_df': 1.0}, {'name': 'min_df=5', 'min_df': 5, 'max_df': 1.0}, {'name': 'min_df=5, max_df=0.9', 'min_df': 5, 'max_df': 0.9}, {'name': 'min_df=5, max_df=0.5', 'min_df': 5, 'max_df': 0.5}, {'name': 'max_features=20', 'min_df': 1, 'max_df': 1.0, 'max_features': 20},] results = analyze_vocabulary_pruning(corpus, settings) print("" + "=" * 70)print("Vocabulary Pruning Comparison")print("=" * 70) for name, size, sample in results: print(f"{name}:") print(f" Vocabulary size: {size}") print(f" Sample terms: {sample}")A robust starting point for most text classification tasks: min_df=5 (or 0.001 for large corpora), max_df=0.95, max_features=10000-50000. Tune from there based on task-specific validation performance. If you're removing stop words separately, max_df can be lower.
A critical production consideration: What happens when new documents contain words not in the vocabulary?
This is the Out-of-Vocabulary (OOV) problem, and it's unavoidable. Language evolves, new products launch, users make typos, and domains have long-tail terminology.
OOV Statistics:
Typical OOV rates for word-level tokenization:
These percentages mean substantial information loss if OOV tokens are simply ignored.
| Strategy | Implementation | Pros | Cons |
|---|---|---|---|
| Ignore | Treat OOV as zero count | Simple, no changes needed | Loses all OOV information; can bias predictions |
| <UNK> Token | Map all OOV to single index | Preserves OOV count; single feature | All OOV terms become indistinguishable |
| Character n-grams | Use character-level features | Zero OOV rate | Longer sequences; less semantic |
| Subword Tokenization | BPE, WordPiece, SentencePiece | Near-zero OOV; compositional | Requires training; longer sequences |
| Hashing Trick | Hash terms to fixed indices | No OOV; fixed memory | Collisions possible; no inverse mapping |
| Vocabulary Expansion | Periodically retrain with new terms | Captures real vocabulary drift | Requires retraining; versioning complexity |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143
import numpy as npfrom sklearn.feature_extraction.text import CountVectorizer, HashingVectorizerfrom collections import Counterfrom typing import List, Tuple, Optionalimport re class OOVAwareVectorizer: """ Vectorizer with explicit OOV handling and tracking. """ def __init__( self, oov_strategy: str = "ignore", # "ignore", "unk", "hash" hash_size: int = 1000, **vectorizer_kwargs ): self.oov_strategy = oov_strategy self.hash_size = hash_size self.vectorizer_kwargs = vectorizer_kwargs self.vectorizer_: Optional[CountVectorizer] = None self.vocabulary_: Optional[set] = None self.oov_stats_: dict = {} def fit(self, corpus: List[str]) -> 'OOVAwareVectorizer': """Fit vocabulary on training corpus.""" self.vectorizer_ = CountVectorizer(**self.vectorizer_kwargs) self.vectorizer_.fit(corpus) self.vocabulary_ = set(self.vectorizer_.vocabulary_.keys()) return self def _get_tokens(self, text: str) -> List[str]: """Extract tokens using same rules as vectorizer.""" pattern = self.vectorizer_.token_pattern tokens = re.findall(pattern, text.lower()) return tokens def _count_oov(self, corpus: List[str]) -> Tuple[int, int, List[str]]: """Count OOV tokens in corpus.""" oov_count = 0 total_count = 0 oov_examples = [] for doc in corpus: tokens = self._get_tokens(doc) for token in tokens: total_count += 1 if token not in self.vocabulary_: oov_count += 1 if len(oov_examples) < 20: oov_examples.append(token) return oov_count, total_count, oov_examples def transform(self, corpus: List[str]) -> np.ndarray: """Transform corpus, applying OOV strategy.""" if self.vectorizer_ is None: raise ValueError("Vectorizer not fitted. Call fit() first.") # Track OOV statistics oov_count, total_count, oov_examples = self._count_oov(corpus) self.oov_stats_ = { 'oov_tokens': oov_count, 'total_tokens': total_count, 'oov_rate': oov_count / total_count if total_count > 0 else 0, 'oov_examples': oov_examples } if self.oov_strategy == "ignore": # Default behavior: OOV terms become 0 return self.vectorizer_.transform(corpus).toarray() elif self.oov_strategy == "unk": # Add <UNK> feature for OOV count base_matrix = self.vectorizer_.transform(corpus).toarray() # Count OOV per document unk_counts = [] for doc in corpus: tokens = self._get_tokens(doc) oov = sum(1 for t in tokens if t not in self.vocabulary_) unk_counts.append(oov) # Append UNK column unk_col = np.array(unk_counts).reshape(-1, 1) return np.hstack([base_matrix, unk_col]) elif self.oov_strategy == "hash": # Hash OOV terms to additional features base_matrix = self.vectorizer_.transform(corpus).toarray() # Create hash features for OOV hash_features = np.zeros((len(corpus), self.hash_size)) for doc_idx, doc in enumerate(corpus): tokens = self._get_tokens(doc) for token in tokens: if token not in self.vocabulary_: hash_idx = hash(token) % self.hash_size hash_features[doc_idx, hash_idx] += 1 return np.hstack([base_matrix, hash_features]) else: raise ValueError(f"Unknown OOV strategy: {self.oov_strategy}") # Demonstrationtrain_corpus = [ "machine learning algorithms process data", "neural networks learn patterns from examples", "deep learning models require large datasets", "supervised learning uses labeled training data",] # Test corpus with OOV termstest_corpus = [ "transformer architectures revolutionize NLP", # transformer, architectures, revolutionize, nlp = OOV "machine learning uses gradient descent optimization", # gradient, descent, optimization = OOV "convolutional networks detect image features", # convolutional, detect, image = OOV] print("OOV Handling Strategies Comparison")print("=" * 60) for strategy in ["ignore", "unk", "hash"]: vectorizer = OOVAwareVectorizer( oov_strategy=strategy, hash_size=100, min_df=1 ) vectorizer.fit(train_corpus) X = vectorizer.transform(test_corpus) stats = vectorizer.oov_stats_ print(f"[{strategy.upper()}] Strategy:") print(f" Output shape: {X.shape}") print(f" OOV rate: {stats['oov_rate']:.1%}") print(f" OOV examples: {stats['oov_examples'][:5]}")High OOV rates in production can silently degrade model accuracy. Monitor OOV rates as a key metric. If OOV exceeds 10%, consider vocabulary expansion, subword tokenization, or the hashing trick. Alert if OOV spikes—it often indicates domain drift or data quality issues.
Subword tokenization has become the dominant approach in modern NLP, powering models from BERT to GPT. Understanding it is essential professional knowledge.
Core Insight:
Subword tokenization learns a vocabulary of subword units that optimally balance:
The learning process finds frequently co-occurring character sequences and treats them as atomic units.
The Three Major Algorithms:
1. Byte Pair Encoding (BPE):
2. WordPiece:
3. Unigram Language Model (SentencePiece):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
# Using the Hugging Face tokenizers library# pip install tokenizers transformers from tokenizers import Tokenizer, models, trainers, pre_tokenizersfrom tokenizers.models import BPEfrom tokenizers.trainers import BpeTrainerfrom tokenizers.pre_tokenizers import Whitespace def train_bpe_tokenizer(corpus: list, vocab_size: int = 1000): """ Train a BPE tokenizer on a corpus. """ # Initialize BPE model tokenizer = Tokenizer(BPE(unk_token="[UNK]")) # Pre-tokenization: split on whitespace first tokenizer.pre_tokenizer = Whitespace() # Trainer configuration trainer = BpeTrainer( vocab_size=vocab_size, special_tokens=["[UNK]", "[PAD]", "[CLS]", "[SEP]"] ) # Train on corpus tokenizer.train_from_iterator(corpus, trainer) return tokenizer # Example corpus for trainingtraining_corpus = [ "machine learning is transforming artificial intelligence research", "deep learning neural networks process complex data patterns", "natural language processing enables machines to understand text", "computer vision algorithms detect objects in images", "reinforcement learning agents learn through trial and error", "transformer architectures have revolutionized language models", "pre-trained models enable transfer learning across domains", "unsupervised learning discovers patterns without labeled data",] * 100 # Repeat for meaningful statistics # Train tokenizerprint("Training BPE tokenizer...")tokenizer = train_bpe_tokenizer(training_corpus, vocab_size=500) print(f"Vocabulary size: {tokenizer.get_vocab_size()}") # Test tokenizationtest_sentences = [ "machine learning", # Common terms "transformer models", # May be split "superintelligence", # Rare word "xyzabc123", # Total OOV] print("Tokenization examples:")for sentence in test_sentences: output = tokenizer.encode(sentence) print(f" '{sentence}'") print(f" -> {output.tokens}") # Show how OOV is handledprint("OOV Handling Demonstration:")print(" 'superintelligence' is unseen, but decomposes into:")output = tokenizer.encode("superintelligence")print(f" {output.tokens}")print(" Each subword was seen during training, so zero OOV!")Managing vocabulary in production systems requires careful attention to versioning, consistency, and monitoring.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import jsonimport hashlibfrom pathlib import Pathfrom datetime import datetimefrom sklearn.feature_extraction.text import CountVectorizerimport joblibfrom typing import Dict, Optional class ProductionVocabulary: """ Production-ready vocabulary manager with versioning and monitoring. """ def __init__( self, vectorizer: CountVectorizer, metadata: Optional[Dict] = None ): self.vectorizer = vectorizer self.vocabulary = set(vectorizer.vocabulary_.keys()) self.metadata = metadata or {} # Generate version hash vocab_str = ''.join(sorted(self.vocabulary)) self.version_hash = hashlib.sha256(vocab_str.encode()).hexdigest()[:12] self.metadata.update({ 'version_hash': self.version_hash, 'vocabulary_size': len(self.vocabulary), 'created_at': datetime.now().isoformat(), }) def save(self, directory: str): """Save vocabulary and vectorizer to directory.""" path = Path(directory) path.mkdir(parents=True, exist_ok=True) # Save vectorizer joblib.dump(self.vectorizer, path / 'vectorizer.joblib') # Save vocabulary as text (for inspection) with open(path / 'vocabulary.txt', 'w') as f: for term in sorted(self.vocabulary): f.write(f"{term}") # Save metadata with open(path / 'metadata.json', 'w') as f: json.dump(self.metadata, f, indent=2) print(f"Saved vocabulary version {self.version_hash} to {directory}") @classmethod def load(cls, directory: str) -> 'ProductionVocabulary': """Load vocabulary from directory.""" path = Path(directory) vectorizer = joblib.load(path / 'vectorizer.joblib') with open(path / 'metadata.json', 'r') as f: metadata = json.load(f) loaded = cls(vectorizer, metadata) # Verify hash if loaded.version_hash != metadata.get('version_hash'): raise ValueError( f"Vocabulary hash mismatch! Expected {metadata['version_hash']}, " f"got {loaded.version_hash}" ) return loaded def check_oov(self, text: str) -> Dict: """Analyze OOV tokens in new text.""" # Use same tokenization as vectorizer import re pattern = self.vectorizer.token_pattern tokens = re.findall(pattern, text.lower()) oov_tokens = [t for t in tokens if t not in self.vocabulary] return { 'total_tokens': len(tokens), 'oov_count': len(oov_tokens), 'oov_rate': len(oov_tokens) / len(tokens) if tokens else 0, 'oov_tokens': oov_tokens[:10], # Sample } # Demonstrationtrain_data = [ "machine learning algorithms", "neural network architectures", "deep learning models", "natural language processing",] vectorizer = CountVectorizer(min_df=1)vectorizer.fit(train_data) vocab_manager = ProductionVocabulary( vectorizer=vectorizer, metadata={ 'training_samples': len(train_data), 'model_name': 'text_classifier_v1', }) print("Production Vocabulary Manager")print("=" * 50)print(f"Version: {vocab_manager.version_hash}")print(f"Size: {len(vocab_manager.vocabulary)} terms") # Check OOV on new datatest_texts = [ "machine learning is great", # Mostly known "transformer models are powerful", # Some OOV "quantum computing uses qubits", # Mostly OOV] print("OOV Analysis on New Data:")for text in test_texts: analysis = vocab_manager.check_oov(text) print(f" '{text}'") print(f" OOV rate: {analysis['oov_rate']:.1%}") print(f" OOV tokens: {analysis['oov_tokens']}")We've explored vocabulary construction comprehensively—from the multi-stage pipeline through tokenization strategies, pruning mechanisms, OOV handling, subword approaches, and production considerations. Let's consolidate the key insights:
Looking Ahead:
With vocabulary construction mastered, the next page explores Sparse Representation—the data structures that make high-dimensional BoW vectors computationally tractable. Understanding sparsity is essential for working with text at scale, where naive dense representations quickly exhaust memory and computation budgets.
You now have comprehensive knowledge of vocabulary construction—the decisions that define your text feature space. From tokenization strategy through OOV handling to production management, you understand the full lifecycle of vocabulary in text ML systems. Next, we'll explore how sparse representations make these high-dimensional vocabularies computationally feasible.