Loading learning content...
Every sophisticated natural language processing system, from spam filters to sentiment analyzers to machine translation engines, begins with a deceptively simple question: How do we convert human language into numbers that machines can process?
The answer to this question forms the foundation of all text-based machine learning, and at the very base of this foundation lies the concept of unigrams—the atomic units of text representation that serve as the starting point for virtually every text feature engineering approach.
Unigrams represent the simplest possible decomposition of text: treating each individual token (typically a word) as an independent feature. Despite their simplicity, unigrams remain extraordinarily powerful and continue to be used in production systems at companies like Google, Amazon, and Netflix. Understanding unigrams deeply is essential before progressing to more complex n-gram representations.
By the end of this page, you will understand the theoretical and practical foundations of unigram features, including their mathematical formalization, implementation patterns, advantages and limitations, and how they relate to the broader landscape of text feature engineering. You'll be equipped to implement unigram-based features and make informed decisions about when to use them.
A unigram is a single token extracted from a text sequence. In the context of n-gram analysis, where 'n' represents the number of consecutive tokens considered together, a unigram corresponds to n=1. Each unigram is treated as an independent, atomic unit with no consideration of the tokens that precede or follow it.
Formal Definition:
Given a text sequence T consisting of tokens t₁, t₂, ..., tₖ, the set of unigrams U(T) is simply:
U(T) = {t₁, t₂, ..., tₖ}
The vocabulary V of a corpus C (a collection of documents) is the union of all unique unigrams across all documents:
V = ⋃_{T ∈ C} U(T)
Example:
Consider the sentence: "The quick brown fox jumps over the lazy dog"
The unigrams extracted from this sentence are:
Note that "the" appears twice in the sequence. When constructing a unigram representation, we must decide whether to:
Unigrams make a fundamental assumption known as the bag-of-words hypothesis: the order of words in a document does not matter for the task at hand. While this assumption is clearly false for many linguistic phenomena, it works surprisingly well for numerous practical applications, particularly document classification and information retrieval.
Unigrams vs. Words:
While we often use "unigram" and "word" interchangeably, they are not identical concepts:
Depending on the tokenizer used, a "unigram" might be:
The choice of tokenization fundamentally shapes what constitutes a unigram in your system.
To build robust text processing systems, we need a rigorous mathematical framework for representing unigrams. This formalization enables us to reason precisely about text representations and their properties.
Vocabulary Construction:
Given a corpus C = {d₁, d₂, ..., dₙ} containing N documents, the vocabulary V is the set of all unique tokens appearing in the corpus:
V = {w₁, w₂, ..., wₘ}
where M = |V| is the vocabulary size.
Vector Space Representation:
Each document d can be represented as a vector in ℝᴹ, where each dimension corresponds to a vocabulary term. The most common representations are:
| Representation | Mathematical Definition | Properties |
|---|---|---|
| Binary (One-Hot) | x_i = 1 if w_i ∈ d, else 0 | Simple; loses frequency information |
| Term Frequency (TF) | x_i = count(w_i, d) | Captures word importance; biased toward long documents |
| Normalized TF | x_i = count(w_i, d) / |d| | Length-independent; comparable across documents |
| Log-Normalized TF | x_i = 1 + log(count(w_i, d)) | Sublinear scaling; reduces impact of very frequent terms |
| TF-IDF | x_i = tf(w_i, d) × idf(w_i) | Balances local and global importance |
The Document-Term Matrix:
For a corpus of N documents and vocabulary size M, we construct a document-term matrix A ∈ ℝᴺˣᴹ where:
A[i,j] = representation of term wⱼ in document dᵢ
This matrix is the foundation of most classical text analysis techniques:
Sparsity Characteristics:
Document-term matrices are typically extremely sparse. In natural language:
The sparsity ratio is often > 99%, which has critical implications for storage and computation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom collections import Counterfrom typing import List, Dict, Tuplefrom scipy.sparse import csr_matrix class UnigramVectorizer: """ A from-scratch implementation of unigram vectorization demonstrating the mathematical foundations. """ def __init__(self, max_vocab_size: int = None, min_doc_freq: int = 1): self.vocabulary_: Dict[str, int] = {} self.doc_freq_: Dict[str, int] = {} self.max_vocab_size = max_vocab_size self.min_doc_freq = min_doc_freq def _tokenize(self, text: str) -> List[str]: """Simple whitespace tokenization with lowercasing.""" return text.lower().split() def fit(self, documents: List[str]) -> 'UnigramVectorizer': """ Build vocabulary from corpus. Time Complexity: O(N * L) where N = num docs, L = avg doc length Space Complexity: O(V) where V = vocabulary size """ # Count document frequencies for each term term_doc_freq = Counter() for doc in documents: # Use set to count each term once per document unique_terms = set(self._tokenize(doc)) term_doc_freq.update(unique_terms) # Filter by minimum document frequency filtered_terms = [ term for term, freq in term_doc_freq.items() if freq >= self.min_doc_freq ] # Sort by frequency (descending) and limit vocabulary sorted_terms = sorted( filtered_terms, key=lambda t: term_doc_freq[t], reverse=True ) if self.max_vocab_size: sorted_terms = sorted_terms[:self.max_vocab_size] # Build vocabulary mapping self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_terms)} self.doc_freq_ = {term: term_doc_freq[term] for term in sorted_terms} return self def transform_binary(self, documents: List[str]) -> csr_matrix: """Transform documents to binary unigram vectors.""" rows, cols, data = [], [], [] for doc_idx, doc in enumerate(documents): seen_terms = set() for term in self._tokenize(doc): if term in self.vocabulary_ and term not in seen_terms: rows.append(doc_idx) cols.append(self.vocabulary_[term]) data.append(1) seen_terms.add(term) return csr_matrix( (data, (rows, cols)), shape=(len(documents), len(self.vocabulary_)) ) def transform_tf(self, documents: List[str]) -> csr_matrix: """Transform documents to term frequency vectors.""" rows, cols, data = [], [], [] for doc_idx, doc in enumerate(documents): term_counts = Counter(self._tokenize(doc)) for term, count in term_counts.items(): if term in self.vocabulary_: rows.append(doc_idx) cols.append(self.vocabulary_[term]) data.append(count) return csr_matrix( (data, (rows, cols)), shape=(len(documents), len(self.vocabulary_)) ) def analyze_sparsity(self, matrix: csr_matrix) -> Dict: """Analyze sparsity characteristics of the document-term matrix.""" total_elements = matrix.shape[0] * matrix.shape[1] nonzero_elements = matrix.nnz return { 'shape': matrix.shape, 'total_elements': total_elements, 'nonzero_elements': nonzero_elements, 'sparsity_ratio': 1 - (nonzero_elements / total_elements), 'avg_terms_per_doc': nonzero_elements / matrix.shape[0], 'memory_dense_mb': (total_elements * 8) / (1024**2), # float64 'memory_sparse_mb': matrix.data.nbytes / (1024**2), } # Demonstrationif __name__ == "__main__": # Sample corpus corpus = [ "the quick brown fox jumps over the lazy dog", "the lazy dog sleeps all day", "the quick rabbit runs faster than the fox", "brown dogs are friendly animals", ] vectorizer = UnigramVectorizer(min_doc_freq=1) vectorizer.fit(corpus) print("Vocabulary:") for term, idx in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]): print(f" {idx}: '{term}' (doc_freq={vectorizer.doc_freq_[term]})") # Binary representation binary_matrix = vectorizer.transform_binary(corpus) print(f"\nBinary Matrix Shape: {binary_matrix.shape}") print(f"Sparsity: {vectorizer.analyze_sparsity(binary_matrix)}") # TF representation tf_matrix = vectorizer.transform_tf(corpus) print(f"\nTF Matrix (dense for visualization):") print(tf_matrix.toarray())The choice of tokenization strategy fundamentally determines what constitutes a unigram in your system. This decision has far-reaching implications for model performance, vocabulary size, and the ability to handle out-of-vocabulary terms.
Common Tokenization Strategies:
Whitespace Tokenization
Punctuation-Aware Tokenization
Rule-Based Tokenization (e.g., NLTK, spaCy)
Subword Tokenization (BPE, WordPiece, SentencePiece)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
import refrom typing import List, Dictfrom collections import Counter class TokenizationStrategies: """ Demonstration of different tokenization approaches and their impact on unigram representation. """ @staticmethod def whitespace_tokenize(text: str) -> List[str]: """Simplest approach: split on whitespace.""" return text.split() @staticmethod def punctuation_aware_tokenize(text: str) -> List[str]: """Separate punctuation from words.""" # Pattern: word characters OR punctuation pattern = r"\w+|[^\w\s]" return re.findall(pattern, text.lower()) @staticmethod def nltk_tokenize(text: str) -> List[str]: """Use NLTK's word tokenizer (rule-based).""" try: from nltk.tokenize import word_tokenize return word_tokenize(text.lower()) except ImportError: print("NLTK not installed, falling back to regex") return TokenizationStrategies.punctuation_aware_tokenize(text) @staticmethod def simple_bpe_tokenize(text: str, vocab: Dict[str, int]) -> List[str]: """ Simplified BPE-like subword tokenization. Actual BPE requires training on a corpus. """ words = text.lower().split() tokens = [] for word in words: # Add word boundary marker word = word + '</w>' # Greedily match longest subwords from vocabulary i = 0 while i < len(word): longest_match = None for j in range(len(word), i, -1): subword = word[i:j] if subword in vocab: longest_match = subword break if longest_match: tokens.append(longest_match) i += len(longest_match) else: # Unknown character, add as single token tokens.append(word[i]) i += 1 return tokens def compare_tokenization_strategies(): """ Compare how different strategies handle various text patterns. """ test_cases = [ "The quick brown fox jumps!", "I can't believe it's not butter.", "The price is $19.99 (20% off).", "Email me at john@example.com", "Running, runs, ran, runner", "TensorFlow2.0 is awesome!!!", ] strategies = { 'Whitespace': TokenizationStrategies.whitespace_tokenize, 'Punctuation-Aware': TokenizationStrategies.punctuation_aware_tokenize, 'NLTK': TokenizationStrategies.nltk_tokenize, } for text in test_cases: print(f"\nInput: '{text}'") for name, tokenizer in strategies.items(): tokens = tokenizer(text) print(f" {name}: {tokens}") def analyze_vocabulary_impact(): """ Analyze how tokenization affects vocabulary size. """ # Simulated corpus (in practice, use a real corpus) corpus = [ "Machine learning is transforming industries.", "Deep learning neural networks are powerful.", "Natural language processing enables chatbots.", "Computer vision systems can detect objects.", "Reinforcement learning trains game-playing agents.", ] * 100 # Simulate larger corpus results = {} for name, tokenizer in [ ('Whitespace', TokenizationStrategies.whitespace_tokenize), ('Punctuation-Aware', TokenizationStrategies.punctuation_aware_tokenize), ]: all_tokens = [] for doc in corpus: all_tokens.extend(tokenizer(doc)) vocab = set(all_tokens) token_freq = Counter(all_tokens) results[name] = { 'vocabulary_size': len(vocab), 'total_tokens': len(all_tokens), 'avg_token_freq': len(all_tokens) / len(vocab), 'hapax_legomena': sum(1 for t, c in token_freq.items() if c == 1), } print("\nVocabulary Analysis:") for name, stats in results.items(): print(f" {name}:") for key, value in stats.items(): print(f" {key}: {value}") if __name__ == "__main__": compare_tokenization_strategies() analyze_vocabulary_impact()Managing the vocabulary is one of the most critical decisions in unigram-based systems. The vocabulary determines the dimensionality of your feature space, the model's memory footprint, and its ability to generalize to new text.
The Vocabulary Size Dilemma:
Natural language has a heavy-tailed distribution of word frequencies (Zipf's Law). This creates a fundamental tension:
Vocabulary Filtering Strategies:
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Minimum Document Frequency | Remove terms appearing in fewer than k documents | Removes typos, very rare terms | May remove important domain terms |
| Maximum Document Frequency | Remove terms appearing in more than p% of documents | Removes non-discriminative terms | May remove important common words |
| Stop Word Removal | Remove predefined common words | Reduces noise, smaller vocabulary | May lose meaning ("not", "no") |
| Maximum Vocabulary Size | Keep only top-k most frequent terms | Controlled dimensionality | Arbitrary cutoff |
| Minimum Term Frequency | Remove terms with total count < k | Removes very rare words | Different from doc frequency |
Zipf's Law and Its Implications:
Zipf's Law states that the frequency of a word is inversely proportional to its rank:
frequency(r) ∝ 1/r^α
where r is the rank and α ≈ 1 for natural language.
Implications for Vocabulary Management:
Heaps' Law:
V(n) ≈ K × n^β
where V(n) is vocabulary size after n tokens, K ≈ 10-100, and β ≈ 0.4-0.6.
This means doubling corpus size does NOT double vocabulary size—vocabulary growth slows logarithmically.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
import numpy as npfrom collections import Counterfrom typing import List, Set, Dict, Tupleimport matplotlib.pyplot as plt class VocabularyManager: """ Comprehensive vocabulary management for unigram systems. Implements various filtering strategies and analysis methods. """ def __init__(self): self.term_freq: Counter = Counter() self.doc_freq: Counter = Counter() self.num_documents: int = 0 def fit(self, documents: List[List[str]]) -> 'VocabularyManager': """ Build vocabulary statistics from tokenized documents. Args: documents: List of tokenized documents (list of token lists) """ self.num_documents = len(documents) for doc in documents: # Count term frequencies self.term_freq.update(doc) # Count document frequencies (each term once per doc) self.doc_freq.update(set(doc)) return self def filter_vocabulary( self, min_doc_freq: int = 1, max_doc_freq_ratio: float = 1.0, min_term_freq: int = 1, max_vocab_size: int = None, stop_words: Set[str] = None, ) -> List[str]: """ Apply multiple filtering strategies to build final vocabulary. Returns: Ordered list of vocabulary terms """ candidates = set(self.term_freq.keys()) # Apply filters if stop_words: candidates -= stop_words max_doc_freq = int(self.num_documents * max_doc_freq_ratio) candidates = { term for term in candidates if (self.doc_freq[term] >= min_doc_freq and self.doc_freq[term] <= max_doc_freq and self.term_freq[term] >= min_term_freq) } # Sort by frequency and limit size sorted_terms = sorted( candidates, key=lambda t: self.term_freq[t], reverse=True ) if max_vocab_size: sorted_terms = sorted_terms[:max_vocab_size] return sorted_terms def analyze_zipf_distribution(self) -> Dict: """ Analyze how well the vocabulary follows Zipf's Law. """ # Sort terms by frequency sorted_freqs = sorted(self.term_freq.values(), reverse=True) ranks = np.arange(1, len(sorted_freqs) + 1) freqs = np.array(sorted_freqs) # Fit Zipf's Law: log(freq) = -α * log(rank) + log(C) log_ranks = np.log(ranks) log_freqs = np.log(freqs + 1) # +1 to avoid log(0) # Linear regression in log-log space coeffs = np.polyfit(log_ranks, log_freqs, 1) alpha = -coeffs[0] # Calculate R² for fit quality predicted = coeffs[0] * log_ranks + coeffs[1] ss_res = np.sum((log_freqs - predicted) ** 2) ss_tot = np.sum((log_freqs - np.mean(log_freqs)) ** 2) r_squared = 1 - (ss_res / ss_tot) return { 'zipf_alpha': alpha, 'r_squared': r_squared, 'vocabulary_size': len(sorted_freqs), 'total_tokens': sum(sorted_freqs), 'hapax_count': sum(1 for f in sorted_freqs if f == 1), 'hapax_ratio': sum(1 for f in sorted_freqs if f == 1) / len(sorted_freqs), } def analyze_coverage(self, vocab_sizes: List[int]) -> Dict[int, float]: """ Analyze what percentage of tokens are covered by top-k vocabulary. """ sorted_freqs = sorted(self.term_freq.values(), reverse=True) total_tokens = sum(sorted_freqs) cumsum = np.cumsum(sorted_freqs) coverage = {} for size in vocab_sizes: if size <= len(sorted_freqs): coverage[size] = cumsum[size - 1] / total_tokens else: coverage[size] = 1.0 return coverage def demonstrate_vocabulary_management(): """ Demonstrate vocabulary management concepts. """ # Simulated tokenized corpus np.random.seed(42) # Create corpus following Zipf distribution vocab = [f"word_{i}" for i in range(10000)] zipf_probs = 1 / np.arange(1, 10001) zipf_probs /= zipf_probs.sum() documents = [] for _ in range(1000): doc_length = np.random.randint(50, 200) doc = list(np.random.choice(vocab, size=doc_length, p=zipf_probs)) documents.append(doc) # Build vocabulary manager manager = VocabularyManager() manager.fit(documents) # Analyze Zipf distribution zipf_stats = manager.analyze_zipf_distribution() print("Zipf Distribution Analysis:") for key, value in zipf_stats.items(): print(f" {key}: {value:.4f}" if isinstance(value, float) else f" {key}: {value}") # Analyze coverage coverage = manager.analyze_coverage([100, 500, 1000, 2000, 5000]) print("\nVocabulary Coverage:") for size, cov in coverage.items(): print(f" Top {size} terms cover {cov*100:.1f}% of tokens") # Compare filtering strategies print("\nFiltering Strategy Comparison:") strategies = [ {'name': 'No Filtering', 'params': {}}, {'name': 'Min Doc Freq 5', 'params': {'min_doc_freq': 5}}, {'name': 'Max Doc Freq 50%', 'params': {'max_doc_freq_ratio': 0.5}}, {'name': 'Combined', 'params': {'min_doc_freq': 3, 'max_doc_freq_ratio': 0.8}}, {'name': 'Limited 1000', 'params': {'max_vocab_size': 1000}}, ] for strategy in strategies: vocab = manager.filter_vocabulary(**strategy['params']) print(f" {strategy['name']}: {len(vocab)} terms") if __name__ == "__main__": demonstrate_vocabulary_management()While understanding the underlying mechanics is essential, production systems typically use well-tested libraries like scikit-learn. The CountVectorizer class provides a complete unigram feature extraction pipeline with extensive configuration options.
Key Parameters:
max_features: Maximum vocabulary sizemin_df: Minimum document frequency (int) or ratio (float)max_df: Maximum document frequency ratiostop_words: Stop word list or language identifierbinary: Whether to use binary (True) or count (False) featurestokenizer: Custom tokenization functionpreprocessor: Custom preprocessing function123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipelineimport numpy as npfrom typing import List, Tuple def comprehensive_unigram_example(): """ Complete example of unigram-based text classification. """ # Sample dataset (sentiment analysis) texts = [ "This movie was absolutely fantastic and amazing", "I loved every moment of this wonderful film", "Terrible waste of time, completely boring", "The worst film I have ever seen, awful", "Great acting, brilliant storyline, highly recommend", "Disappointing, slow, and not worth watching", "A masterpiece of modern cinema, outstanding", "Complete garbage, do not waste your money", "Beautiful cinematography and excellent performances", "Boring plot with terrible acting throughout", ] labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0] # 1=positive, 0=negative # Basic unigram vectorizer basic_vectorizer = CountVectorizer() X_basic = basic_vectorizer.fit_transform(texts) print("Basic Vectorizer:") print(f" Vocabulary size: {len(basic_vectorizer.vocabulary_)}") print(f" Feature matrix shape: {X_basic.shape}") print(f" Sparsity: {1 - X_basic.nnz / (X_basic.shape[0] * X_basic.shape[1]):.2%}") # Configured vectorizer with filtering configured_vectorizer = CountVectorizer( lowercase=True, stop_words='english', min_df=2, # Appear in at least 2 documents max_df=0.8, # Appear in at most 80% of documents max_features=100, # Limit vocabulary binary=False, # Use counts, not binary ) X_configured = configured_vectorizer.fit_transform(texts) print("\nConfigured Vectorizer:") print(f" Vocabulary size: {len(configured_vectorizer.vocabulary_)}") print(f" Feature matrix shape: {X_configured.shape}") print(f" Vocabulary: {list(configured_vectorizer.vocabulary_.keys())}") # Build classification pipeline pipeline = Pipeline([ ('vectorizer', CountVectorizer( lowercase=True, stop_words='english', min_df=1, max_features=50, )), ('classifier', LogisticRegression(random_state=42)) ]) # Note: In practice, you'd use a proper train/test split # This is just for demonstration pipeline.fit(texts, labels) # Analyze feature importance feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out() coefficients = pipeline.named_steps['classifier'].coef_[0] # Sort features by coefficient magnitude feature_importance = sorted( zip(feature_names, coefficients), key=lambda x: abs(x[1]), reverse=True ) print("\nTop Features by Importance:") print(" Positive Indicators:") for name, coef in feature_importance[:5]: if coef > 0: print(f" '{name}': {coef:.3f}") print(" Negative Indicators:") for name, coef in feature_importance[:5]: if coef < 0: print(f" '{name}': {coef:.3f}") # Test on new examples test_texts = [ "This was a wonderful experience", "Absolutely terrible, would not recommend", ] predictions = pipeline.predict(test_texts) probabilities = pipeline.predict_proba(test_texts) print("\nPredictions on New Data:") for text, pred, prob in zip(test_texts, predictions, probabilities): sentiment = "Positive" if pred == 1 else "Negative" print(f" '{text[:40]}...'") print(f" Prediction: {sentiment} (confidence: {max(prob):.2%})") return pipeline def analyze_vocabulary_statistics(): """ Detailed analysis of vocabulary characteristics. """ # Larger sample for better statistics from sklearn.datasets import fetch_20newsgroups try: # Fetch a subset of 20 newsgroups newsgroups = fetch_20newsgroups( subset='train', categories=['sci.space', 'comp.graphics'], remove=('headers', 'footers', 'quotes') ) texts = newsgroups.data[:500] except Exception: # Fallback if dataset not available texts = ["Sample text " * 50] * 100 # Analyze at different vocabulary sizes vocab_sizes = [100, 500, 1000, 2000, 5000, None] print("Vocabulary Size Analysis:") for max_features in vocab_sizes: vectorizer = CountVectorizer( max_features=max_features, stop_words='english', min_df=2, ) X = vectorizer.fit_transform(texts) actual_vocab = len(vectorizer.vocabulary_) nonzero_ratio = X.nnz / (X.shape[0] * X.shape[1]) avg_terms = X.nnz / X.shape[0] print(f"\n Max Features: {max_features or 'Unlimited'}") print(f" Actual vocabulary: {actual_vocab}") print(f" Non-zero ratio: {nonzero_ratio:.2%}") print(f" Avg terms per doc: {avg_terms:.1f}") if __name__ == "__main__": comprehensive_unigram_example() print("\n" + "="*60 + "\n") analyze_vocabulary_statistics()Understanding when unigrams are appropriate—and when they fall short—is essential for effective feature engineering. Let's examine both sides with concrete examples.
Consider these sentences:
With unigrams, both have the same features: {the, movie, was, not, good} and {the, movie, was, good}. The only difference is the presence of "not"—but without context, the model must learn that "not" + "good" = negative. This works sometimes, but fails when negation is separated from its target or when context is more nuanced.
When Unigrams Work Well:
When Unigrams Struggle:
Deploying unigram-based systems in production introduces considerations that don't appear in toy examples. Memory efficiency, vocabulary consistency, and handling new words all require careful design.
Key Production Challenges:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194
import pickleimport hashlibfrom pathlib import Pathfrom typing import Dict, Optionalfrom sklearn.feature_extraction.text import CountVectorizer, HashingVectorizerimport numpy as np class ProductionUnigramVectorizer: """ Production-ready unigram vectorizer with versioning, serialization, and OOV handling. """ def __init__( self, max_features: int = 10000, min_df: int = 5, handle_oov: str = 'ignore', # 'ignore', 'unk', 'error' ): self.max_features = max_features self.min_df = min_df self.handle_oov = handle_oov self._vectorizer = CountVectorizer( max_features=max_features, min_df=min_df, lowercase=True, stop_words='english', ) self._version: Optional[str] = None self._is_fitted: bool = False def fit(self, documents: list) -> 'ProductionUnigramVectorizer': """Fit vectorizer and compute version hash.""" self._vectorizer.fit(documents) self._is_fitted = True # Create version hash from vocabulary vocab_str = '|'.join(sorted(self._vectorizer.vocabulary_.keys())) self._version = hashlib.sha256(vocab_str.encode()).hexdigest()[:12] return self def transform(self, documents: list) -> np.ndarray: """Transform with OOV handling.""" if not self._is_fitted: raise ValueError("Vectorizer must be fitted before transform") if self.handle_oov == 'error': # Check for OOV terms for doc in documents: tokens = doc.lower().split() for token in tokens: if token not in self._vectorizer.vocabulary_: raise ValueError(f"OOV term: {token}") return self._vectorizer.transform(documents) @property def version(self) -> str: """Get vocabulary version hash.""" return self._version @property def vocabulary_size(self) -> int: return len(self._vectorizer.vocabulary_) def save(self, path: Path) -> None: """Serialize vectorizer to disk.""" path = Path(path) # Save with version in filename filename = f"unigram_vectorizer_{self._version}.pkl" with open(path / filename, 'wb') as f: pickle.dump({ 'vectorizer': self._vectorizer, 'version': self._version, 'config': { 'max_features': self.max_features, 'min_df': self.min_df, 'handle_oov': self.handle_oov, } }, f) print(f"Saved vectorizer to {path / filename}") @classmethod def load(cls, path: Path) -> 'ProductionUnigramVectorizer': """Load serialized vectorizer.""" with open(path, 'rb') as f: data = pickle.load(f) instance = cls(**data['config']) instance._vectorizer = data['vectorizer'] instance._version = data['version'] instance._is_fitted = True return instance class FeatureHashingVectorizer: """ Feature hashing alternative for unbounded vocabulary. Advantages: - No explicit vocabulary storage - Handles OOV naturally - Constant memory regardless of corpus size Disadvantages: - Hash collisions reduce accuracy - Cannot reverse-map features to words - Non-interpretable features """ def __init__(self, n_features: int = 2**18): self.n_features = n_features self._vectorizer = HashingVectorizer( n_features=n_features, alternate_sign=True, # Reduce collision impact lowercase=True, stop_words='english', ) def transform(self, documents: list) -> np.ndarray: """Transform using feature hashing.""" return self._vectorizer.transform(documents) def estimate_collision_rate(self, vocabulary_size: int) -> float: """ Estimate probability of collision. Using birthday problem approximation: P(collision) ≈ 1 - e^(-n² / 2m) where n = vocabulary size, m = number of hash buckets """ n = vocabulary_size m = self.n_features # More accurate approximation collision_prob = 1 - np.exp(-n * (n - 1) / (2 * m)) return collision_prob def demonstrate_production_patterns(): """Demonstrate production deployment patterns.""" # Training phase training_corpus = [ "Machine learning models require careful feature engineering", "Deep learning has revolutionized natural language processing", "Feature engineering remains important for classical ML models", "Neural networks can learn representations automatically", "Traditional ML methods still outperform deep learning on small data", ] # Create and fit vectorizer vectorizer = ProductionUnigramVectorizer( max_features=100, min_df=1, handle_oov='ignore' ) vectorizer.fit(training_corpus) print(f"Vocabulary Version: {vectorizer.version}") print(f"Vocabulary Size: {vectorizer.vocabulary_size}") # Inference phase inference_texts = [ "Feature engineering is crucial for machine learning", "This contains completely unknown words like quantum and blockchain", ] X = vectorizer.transform(inference_texts) print(f"\nInference matrix shape: {X.shape}") print(f"Non-zero features doc 1: {X[0].nnz}") print(f"Non-zero features doc 2: {X[1].nnz}") # Fewer due to OOV # Feature hashing comparison print("\n--- Feature Hashing Alternative ---") hasher = FeatureHashingVectorizer(n_features=2**16) X_hashed = hasher.transform(inference_texts) print(f"Hashed matrix shape: {X_hashed.shape}") print(f"Estimated collision rate (1000 vocab): {hasher.estimate_collision_rate(1000):.4%}") print(f"Estimated collision rate (10000 vocab): {hasher.estimate_collision_rate(10000):.4%}") if __name__ == "__main__": demonstrate_production_patterns()We've explored unigrams comprehensively—from their mathematical foundations to production deployment considerations. Let's consolidate the key insights:
Looking Ahead: From Unigrams to Bigrams
Unigrams' fundamental limitation—ignoring word order and context—motivates the extension to bigrams: pairs of consecutive tokens. In the next page, we'll see how bigrams capture local context, enable phrase detection, and address many limitations of the pure unigram approach.
The progression from unigrams to bigrams to general n-grams represents the natural evolution of text feature engineering: each step captures more linguistic structure while increasing computational complexity. Understanding this tradeoff is essential for effective NLP system design.
You now have a deep understanding of unigrams as the foundation of text feature engineering. You understand their mathematical formalization, implementation patterns, vocabulary management strategies, and production considerations. Next, we'll explore bigrams and how they capture the local context that unigrams miss.