Machine LearningText Feature Engineering - N-grams

Text Feature Engineering: N-grams

LevelIntermediate

Duration75 mins

TopicText Feature Engineering - N-grams

1 / 5

Unigrams: The Foundation of Text Feature Engineering

The Building Blocks of Text

Every sophisticated natural language processing system, from spam filters to sentiment analyzers to machine translation engines, begins with a deceptively simple question: How do we convert human language into numbers that machines can process?

The answer to this question forms the foundation of all text-based machine learning, and at the very base of this foundation lies the concept of unigrams—the atomic units of text representation that serve as the starting point for virtually every text feature engineering approach.

Unigrams represent the simplest possible decomposition of text: treating each individual token (typically a word) as an independent feature. Despite their simplicity, unigrams remain extraordinarily powerful and continue to be used in production systems at companies like Google, Amazon, and Netflix. Understanding unigrams deeply is essential before progressing to more complex n-gram representations.

What You Will Learn

By the end of this page, you will understand the theoretical and practical foundations of unigram features, including their mathematical formalization, implementation patterns, advantages and limitations, and how they relate to the broader landscape of text feature engineering. You'll be equipped to implement unigram-based features and make informed decisions about when to use them.

What Are Unigrams?

A unigram is a single token extracted from a text sequence. In the context of n-gram analysis, where 'n' represents the number of consecutive tokens considered together, a unigram corresponds to n=1. Each unigram is treated as an independent, atomic unit with no consideration of the tokens that precede or follow it.

Formal Definition:

Given a text sequence T consisting of tokens t₁, t₂, ..., tₖ, the set of unigrams U(T) is simply:

U(T) = {t₁, t₂, ..., tₖ}

The vocabulary V of a corpus C (a collection of documents) is the union of all unique unigrams across all documents:

V = ⋃_{T ∈ C} U(T)

Example:

Consider the sentence: "The quick brown fox jumps over the lazy dog"

The unigrams extracted from this sentence are:

["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Note that "the" appears twice in the sequence. When constructing a unigram representation, we must decide whether to:

Count occurrences (frequency-based representation)
Use binary presence indicators
Apply some weighting scheme (like TF-IDF)

The Independence Assumption

Unigrams make a fundamental assumption known as the bag-of-words hypothesis: the order of words in a document does not matter for the task at hand. While this assumption is clearly false for many linguistic phenomena, it works surprisingly well for numerous practical applications, particularly document classification and information retrieval.

Unigrams vs. Words:

While we often use "unigram" and "word" interchangeably, they are not identical concepts:

Words are linguistic units defined by language rules
Unigrams are tokens defined by the tokenization strategy

Depending on the tokenizer used, a "unigram" might be:

A complete word ("running")
A subword unit ("run" + "##ning" in BERT tokenization)
A character ('r', 'u', 'n', ...)
A morpheme ("run" + "ing")

The choice of tokenization fundamentally shapes what constitutes a unigram in your system.

Mathematical Formalization

To build robust text processing systems, we need a rigorous mathematical framework for representing unigrams. This formalization enables us to reason precisely about text representations and their properties.

Vocabulary Construction:

Given a corpus C = {d₁, d₂, ..., dₙ} containing N documents, the vocabulary V is the set of all unique tokens appearing in the corpus:

V = {w₁, w₂, ..., wₘ}

where M = |V| is the vocabulary size.

Vector Space Representation:

Each document d can be represented as a vector in ℝᴹ, where each dimension corresponds to a vocabulary term. The most common representations are:

Unigram Vector Representations
Representation	Mathematical Definition	Properties
Binary (One-Hot)	x_i = 1 if w_i ∈ d, else 0	Simple; loses frequency information
Term Frequency (TF)	x_i = count(w_i, d)	Captures word importance; biased toward long documents
Normalized TF	x_i = count(w_i, d) / \|d\|	Length-independent; comparable across documents
Log-Normalized TF	x_i = 1 + log(count(w_i, d))	Sublinear scaling; reduces impact of very frequent terms
TF-IDF	x_i = tf(w_i, d) × idf(w_i)	Balances local and global importance

The Document-Term Matrix:

For a corpus of N documents and vocabulary size M, we construct a document-term matrix A ∈ ℝᴺˣᴹ where:

A[i,j] = representation of term wⱼ in document dᵢ

This matrix is the foundation of most classical text analysis techniques:

Each row represents a document as a vector
Each column represents how a term is distributed across documents
Document similarity can be computed as row vector similarity
Term co-occurrence can be analyzed via column correlations

Sparsity Characteristics:

Document-term matrices are typically extremely sparse. In natural language:

Individual documents use only a tiny fraction of the vocabulary
Zipf's Law states that word frequency follows a power-law distribution
Most entries in the matrix are zero

The sparsity ratio is often > 99%, which has critical implications for storage and computation.

unigram_vector_representation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple
from scipy.sparse import csr_matrix
 
class UnigramVectorizer:
    """
    A from-scratch implementation of unigram vectorization
    demonstrating the mathematical foundations.
    """
    
    def __init__(self, max_vocab_size: int = None, min_doc_freq: int = 1):
        self.vocabulary_: Dict[str, int] = {}
        self.doc_freq_: Dict[str, int] = {}
        self.max_vocab_size = max_vocab_size
        self.min_doc_freq = min_doc_freq
        
    def _tokenize(self, text: str) -> List[str]:
        """Simple whitespace tokenization with lowercasing."""
        return text.lower().split()
    
    def fit(self, documents: List[str]) -> 'UnigramVectorizer':
        """
        Build vocabulary from corpus.
        
        Time Complexity: O(N * L) where N = num docs, L = avg doc length
        Space Complexity: O(V) where V = vocabulary size
        """
        # Count document frequencies for each term
        term_doc_freq = Counter()
        
        for doc in documents:
            # Use set to count each term once per document
            unique_terms = set(self._tokenize(doc))
            term_doc_freq.update(unique_terms)
        
        # Filter by minimum document frequency
        filtered_terms = [
            term for term, freq in term_doc_freq.items()
            if freq >= self.min_doc_freq
        ]
        
        # Sort by frequency (descending) and limit vocabulary
        sorted_terms = sorted(
            filtered_terms,
            key=lambda t: term_doc_freq[t],
            reverse=True
        )
        
        if self.max_vocab_size:
            sorted_terms = sorted_terms[:self.max_vocab_size]
        
        # Build vocabulary mapping
        self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_terms)}
        self.doc_freq_ = {term: term_doc_freq[term] for term in sorted_terms}
        
        return self
    
    def transform_binary(self, documents: List[str]) -> csr_matrix:
        """Transform documents to binary unigram vectors."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            seen_terms = set()
            for term in self._tokenize(doc):
                if term in self.vocabulary_ and term not in seen_terms:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[term])
                    data.append(1)
                    seen_terms.add(term)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def transform_tf(self, documents: List[str]) -> csr_matrix:
        """Transform documents to term frequency vectors."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            term_counts = Counter(self._tokenize(doc))
            for term, count in term_counts.items():
                if term in self.vocabulary_:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[term])
                    data.append(count)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def analyze_sparsity(self, matrix: csr_matrix) -> Dict:
        """Analyze sparsity characteristics of the document-term matrix."""
        total_elements = matrix.shape[0] * matrix.shape[1]
        nonzero_elements = matrix.nnz
        
        return {
            'shape': matrix.shape,
            'total_elements': total_elements,
            'nonzero_elements': nonzero_elements,
            'sparsity_ratio': 1 - (nonzero_elements / total_elements),
            'avg_terms_per_doc': nonzero_elements / matrix.shape[0],
            'memory_dense_mb': (total_elements * 8) / (1024**2),  # float64
            'memory_sparse_mb': matrix.data.nbytes / (1024**2),
        }
 
 
# Demonstration
if __name__ == "__main__":
    # Sample corpus
    corpus = [
        "the quick brown fox jumps over the lazy dog",
        "the lazy dog sleeps all day",
        "the quick rabbit runs faster than the fox",
        "brown dogs are friendly animals",
    ]
    
    vectorizer = UnigramVectorizer(min_doc_freq=1)
    vectorizer.fit(corpus)
    
    print("Vocabulary:")
    for term, idx in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]):
        print(f"  {idx}: '{term}' (doc_freq={vectorizer.doc_freq_[term]})")
    
    # Binary representation
    binary_matrix = vectorizer.transform_binary(corpus)
    print(f"\nBinary Matrix Shape: {binary_matrix.shape}")
    print(f"Sparsity: {vectorizer.analyze_sparsity(binary_matrix)}")
    
    # TF representation
    tf_matrix = vectorizer.transform_tf(corpus)
    print(f"\nTF Matrix (dense for visualization):")
    print(tf_matrix.toarray())

The Tokenization Decision

The choice of tokenization strategy fundamentally determines what constitutes a unigram in your system. This decision has far-reaching implications for model performance, vocabulary size, and the ability to handle out-of-vocabulary terms.

Common Tokenization Strategies:

Whitespace Tokenization
- Split on whitespace characters
- Simple and fast
- Fails on punctuation-attached words ("hello," vs "hello")
Punctuation-Aware Tokenization
- Separate punctuation from words
- "don't" → ["don", "'", "t"] or ["don't"]
- Handles most common cases
Rule-Based Tokenization (e.g., NLTK, spaCy)
- Language-specific rules
- Handles contractions, abbreviations, special cases
- More accurate but slower
Subword Tokenization (BPE, WordPiece, SentencePiece)
- Breaks words into subword units
- Handles out-of-vocabulary words
- Used in modern transformer models

Word-Level Tokenization

•Interpretable: Each token is a meaningful word
•Efficient: Smaller vocabulary for common cases
•Linguistically motivated: Aligns with human understanding
•Simple similarity: Word overlap directly meaningful
•OOV Problem: Unknown words become invisible

Subword Tokenization

•Open vocabulary: Can represent any word
•Morphological awareness: Captures word structure
•Compact: Fixed vocabulary size
•Cross-lingual: Works across languages
•Less interpretable: Subwords may lack meaning

tokenization_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import re
from typing import List, Dict
from collections import Counter
 
class TokenizationStrategies:
    """
    Demonstration of different tokenization approaches
    and their impact on unigram representation.
    """
    
    @staticmethod
    def whitespace_tokenize(text: str) -> List[str]:
        """Simplest approach: split on whitespace."""
        return text.split()
    
    @staticmethod
    def punctuation_aware_tokenize(text: str) -> List[str]:
        """Separate punctuation from words."""
        # Pattern: word characters OR punctuation
        pattern = r"\w+|[^\w\s]"
        return re.findall(pattern, text.lower())
    
    @staticmethod
    def nltk_tokenize(text: str) -> List[str]:
        """Use NLTK's word tokenizer (rule-based)."""
        try:
            from nltk.tokenize import word_tokenize
            return word_tokenize(text.lower())
        except ImportError:
            print("NLTK not installed, falling back to regex")
            return TokenizationStrategies.punctuation_aware_tokenize(text)
    
    @staticmethod
    def simple_bpe_tokenize(text: str, vocab: Dict[str, int]) -> List[str]:
        """
        Simplified BPE-like subword tokenization.
        Actual BPE requires training on a corpus.
        """
        words = text.lower().split()
        tokens = []
        
        for word in words:
            # Add word boundary marker
            word = word + '</w>'
            
            # Greedily match longest subwords from vocabulary
            i = 0
            while i < len(word):
                longest_match = None
                for j in range(len(word), i, -1):
                    subword = word[i:j]
                    if subword in vocab:
                        longest_match = subword
                        break
                
                if longest_match:
                    tokens.append(longest_match)
                    i += len(longest_match)
                else:
                    # Unknown character, add as single token
                    tokens.append(word[i])
                    i += 1
        
        return tokens
 
 
def compare_tokenization_strategies():
    """
    Compare how different strategies handle various text patterns.
    """
    test_cases = [
        "The quick brown fox jumps!",
        "I can't believe it's not butter.",
        "The price is $19.99 (20% off).",
        "Email me at john@example.com",
        "Running, runs, ran, runner",
        "TensorFlow2.0 is awesome!!!",
    ]
    
    strategies = {
        'Whitespace': TokenizationStrategies.whitespace_tokenize,
        'Punctuation-Aware': TokenizationStrategies.punctuation_aware_tokenize,
        'NLTK': TokenizationStrategies.nltk_tokenize,
    }
    
    for text in test_cases:
        print(f"\nInput: '{text}'")
        for name, tokenizer in strategies.items():
            tokens = tokenizer(text)
            print(f"  {name}: {tokens}")
 
 
def analyze_vocabulary_impact():
    """
    Analyze how tokenization affects vocabulary size.
    """
    # Simulated corpus (in practice, use a real corpus)
    corpus = [
        "Machine learning is transforming industries.",
        "Deep learning neural networks are powerful.",
        "Natural language processing enables chatbots.",
        "Computer vision systems can detect objects.",
        "Reinforcement learning trains game-playing agents.",
    ] * 100  # Simulate larger corpus
    
    results = {}
    for name, tokenizer in [
        ('Whitespace', TokenizationStrategies.whitespace_tokenize),
        ('Punctuation-Aware', TokenizationStrategies.punctuation_aware_tokenize),
    ]:
        all_tokens = []
        for doc in corpus:
            all_tokens.extend(tokenizer(doc))
        
        vocab = set(all_tokens)
        token_freq = Counter(all_tokens)
        
        results[name] = {
            'vocabulary_size': len(vocab),
            'total_tokens': len(all_tokens),
            'avg_token_freq': len(all_tokens) / len(vocab),
            'hapax_legomena': sum(1 for t, c in token_freq.items() if c == 1),
        }
    
    print("\nVocabulary Analysis:")
    for name, stats in results.items():
        print(f"  {name}:")
        for key, value in stats.items():
            print(f"    {key}: {value}")
 
 
if __name__ == "__main__":
    compare_tokenization_strategies()
    analyze_vocabulary_impact()

Vocabulary Management

Managing the vocabulary is one of the most critical decisions in unigram-based systems. The vocabulary determines the dimensionality of your feature space, the model's memory footprint, and its ability to generalize to new text.

The Vocabulary Size Dilemma:

Natural language has a heavy-tailed distribution of word frequencies (Zipf's Law). This creates a fundamental tension:

Large Vocabulary: Captures more information but increases dimensionality, memory usage, and risk of overfitting
Small Vocabulary: More efficient but loses potentially important rare words

Vocabulary Filtering Strategies:

Vocabulary Filtering Techniques
Strategy	Description	Pros	Cons
Minimum Document Frequency	Remove terms appearing in fewer than k documents	Removes typos, very rare terms	May remove important domain terms
Maximum Document Frequency	Remove terms appearing in more than p% of documents	Removes non-discriminative terms	May remove important common words
Stop Word Removal	Remove predefined common words	Reduces noise, smaller vocabulary	May lose meaning ("not", "no")
Maximum Vocabulary Size	Keep only top-k most frequent terms	Controlled dimensionality	Arbitrary cutoff
Minimum Term Frequency	Remove terms with total count < k	Removes very rare words	Different from doc frequency

Zipf's Law and Its Implications:

Zipf's Law states that the frequency of a word is inversely proportional to its rank:

frequency(r) ∝ 1/r^α

where r is the rank and α ≈ 1 for natural language.

Implications for Vocabulary Management:

A small number of words account for most occurrences
Most words are rare (appear only once or twice)
The "long tail" of rare words is very long
Vocabulary growth is sublinear with corpus size (Heaps' Law)

Heaps' Law:

V(n) ≈ K × n^β

where V(n) is vocabulary size after n tokens, K ≈ 10-100, and β ≈ 0.4-0.6.

This means doubling corpus size does NOT double vocabulary size—vocabulary growth slows logarithmically.

vocabulary_management.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
import numpy as np
from collections import Counter
from typing import List, Set, Dict, Tuple
import matplotlib.pyplot as plt
 
class VocabularyManager:
    """
    Comprehensive vocabulary management for unigram systems.
    Implements various filtering strategies and analysis methods.
    """
    
    def __init__(self):
        self.term_freq: Counter = Counter()
        self.doc_freq: Counter = Counter()
        self.num_documents: int = 0
        
    def fit(self, documents: List[List[str]]) -> 'VocabularyManager':
        """
        Build vocabulary statistics from tokenized documents.
        
        Args:
            documents: List of tokenized documents (list of token lists)
        """
        self.num_documents = len(documents)
        
        for doc in documents:
            # Count term frequencies
            self.term_freq.update(doc)
            
            # Count document frequencies (each term once per doc)
            self.doc_freq.update(set(doc))
        
        return self
    
    def filter_vocabulary(
        self,
        min_doc_freq: int = 1,
        max_doc_freq_ratio: float = 1.0,
        min_term_freq: int = 1,
        max_vocab_size: int = None,
        stop_words: Set[str] = None,
    ) -> List[str]:
        """
        Apply multiple filtering strategies to build final vocabulary.
        
        Returns:
            Ordered list of vocabulary terms
        """
        candidates = set(self.term_freq.keys())
        
        # Apply filters
        if stop_words:
            candidates -= stop_words
        
        max_doc_freq = int(self.num_documents * max_doc_freq_ratio)
        
        candidates = {
            term for term in candidates
            if (self.doc_freq[term] >= min_doc_freq and
                self.doc_freq[term] <= max_doc_freq and
                self.term_freq[term] >= min_term_freq)
        }
        
        # Sort by frequency and limit size
        sorted_terms = sorted(
            candidates,
            key=lambda t: self.term_freq[t],
            reverse=True
        )
        
        if max_vocab_size:
            sorted_terms = sorted_terms[:max_vocab_size]
        
        return sorted_terms
    
    def analyze_zipf_distribution(self) -> Dict:
        """
        Analyze how well the vocabulary follows Zipf's Law.
        """
        # Sort terms by frequency
        sorted_freqs = sorted(self.term_freq.values(), reverse=True)
        ranks = np.arange(1, len(sorted_freqs) + 1)
        freqs = np.array(sorted_freqs)
        
        # Fit Zipf's Law: log(freq) = -α * log(rank) + log(C)
        log_ranks = np.log(ranks)
        log_freqs = np.log(freqs + 1)  # +1 to avoid log(0)
        
        # Linear regression in log-log space
        coeffs = np.polyfit(log_ranks, log_freqs, 1)
        alpha = -coeffs[0]
        
        # Calculate R² for fit quality
        predicted = coeffs[0] * log_ranks + coeffs[1]
        ss_res = np.sum((log_freqs - predicted) ** 2)
        ss_tot = np.sum((log_freqs - np.mean(log_freqs)) ** 2)
        r_squared = 1 - (ss_res / ss_tot)
        
        return {
            'zipf_alpha': alpha,
            'r_squared': r_squared,
            'vocabulary_size': len(sorted_freqs),
            'total_tokens': sum(sorted_freqs),
            'hapax_count': sum(1 for f in sorted_freqs if f == 1),
            'hapax_ratio': sum(1 for f in sorted_freqs if f == 1) / len(sorted_freqs),
        }
    
    def analyze_coverage(self, vocab_sizes: List[int]) -> Dict[int, float]:
        """
        Analyze what percentage of tokens are covered by top-k vocabulary.
        """
        sorted_freqs = sorted(self.term_freq.values(), reverse=True)
        total_tokens = sum(sorted_freqs)
        
        cumsum = np.cumsum(sorted_freqs)
        
        coverage = {}
        for size in vocab_sizes:
            if size <= len(sorted_freqs):
                coverage[size] = cumsum[size - 1] / total_tokens
            else:
                coverage[size] = 1.0
        
        return coverage
 
 
def demonstrate_vocabulary_management():
    """
    Demonstrate vocabulary management concepts.
    """
    # Simulated tokenized corpus
    np.random.seed(42)
    
    # Create corpus following Zipf distribution
    vocab = [f"word_{i}" for i in range(10000)]
    zipf_probs = 1 / np.arange(1, 10001)
    zipf_probs /= zipf_probs.sum()
    
    documents = []
    for _ in range(1000):
        doc_length = np.random.randint(50, 200)
        doc = list(np.random.choice(vocab, size=doc_length, p=zipf_probs))
        documents.append(doc)
    
    # Build vocabulary manager
    manager = VocabularyManager()
    manager.fit(documents)
    
    # Analyze Zipf distribution
    zipf_stats = manager.analyze_zipf_distribution()
    print("Zipf Distribution Analysis:")
    for key, value in zipf_stats.items():
        print(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")
    
    # Analyze coverage
    coverage = manager.analyze_coverage([100, 500, 1000, 2000, 5000])
    print("\nVocabulary Coverage:")
    for size, cov in coverage.items():
        print(f"  Top {size} terms cover {cov*100:.1f}% of tokens")
    
    # Compare filtering strategies
    print("\nFiltering Strategy Comparison:")
    
    strategies = [
        {'name': 'No Filtering', 'params': {}},
        {'name': 'Min Doc Freq 5', 'params': {'min_doc_freq': 5}},
        {'name': 'Max Doc Freq 50%', 'params': {'max_doc_freq_ratio': 0.5}},
        {'name': 'Combined', 'params': {'min_doc_freq': 3, 'max_doc_freq_ratio': 0.8}},
        {'name': 'Limited 1000', 'params': {'max_vocab_size': 1000}},
    ]
    
    for strategy in strategies:
        vocab = manager.filter_vocabulary(**strategy['params'])
        print(f"  {strategy['name']}: {len(vocab)} terms")
 
 
if __name__ == "__main__":
    demonstrate_vocabulary_management()

Practical Implementation with scikit-learn

While understanding the underlying mechanics is essential, production systems typically use well-tested libraries like scikit-learn. The CountVectorizer class provides a complete unigram feature extraction pipeline with extensive configuration options.

Key Parameters:

max_features: Maximum vocabulary size
min_df: Minimum document frequency (int) or ratio (float)
max_df: Maximum document frequency ratio
stop_words: Stop word list or language identifier
binary: Whether to use binary (True) or count (False) features
tokenizer: Custom tokenization function
preprocessor: Custom preprocessing function

sklearn_unigrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import numpy as np
from typing import List, Tuple
 
def comprehensive_unigram_example():
    """
    Complete example of unigram-based text classification.
    """
    
    # Sample dataset (sentiment analysis)
    texts = [
        "This movie was absolutely fantastic and amazing",
        "I loved every moment of this wonderful film",
        "Terrible waste of time, completely boring",
        "The worst film I have ever seen, awful",
        "Great acting, brilliant storyline, highly recommend",
        "Disappointing, slow, and not worth watching",
        "A masterpiece of modern cinema, outstanding",
        "Complete garbage, do not waste your money",
        "Beautiful cinematography and excellent performances",
        "Boring plot with terrible acting throughout",
    ]
    
    labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative
    
    # Basic unigram vectorizer
    basic_vectorizer = CountVectorizer()
    X_basic = basic_vectorizer.fit_transform(texts)
    
    print("Basic Vectorizer:")
    print(f"  Vocabulary size: {len(basic_vectorizer.vocabulary_)}")
    print(f"  Feature matrix shape: {X_basic.shape}")
    print(f"  Sparsity: {1 - X_basic.nnz / (X_basic.shape[0] * X_basic.shape[1]):.2%}")
    
    # Configured vectorizer with filtering
    configured_vectorizer = CountVectorizer(
        lowercase=True,
        stop_words='english',
        min_df=2,  # Appear in at least 2 documents
        max_df=0.8,  # Appear in at most 80% of documents
        max_features=100,  # Limit vocabulary
        binary=False,  # Use counts, not binary
    )
    
    X_configured = configured_vectorizer.fit_transform(texts)
    
    print("\nConfigured Vectorizer:")
    print(f"  Vocabulary size: {len(configured_vectorizer.vocabulary_)}")
    print(f"  Feature matrix shape: {X_configured.shape}")
    print(f"  Vocabulary: {list(configured_vectorizer.vocabulary_.keys())}")
    
    # Build classification pipeline
    pipeline = Pipeline([
        ('vectorizer', CountVectorizer(
            lowercase=True,
            stop_words='english',
            min_df=1,
            max_features=50,
        )),
        ('classifier', LogisticRegression(random_state=42))
    ])
    
    # Note: In practice, you'd use a proper train/test split
    # This is just for demonstration
    pipeline.fit(texts, labels)
    
    # Analyze feature importance
    feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out()
    coefficients = pipeline.named_steps['classifier'].coef_[0]
    
    # Sort features by coefficient magnitude
    feature_importance = sorted(
        zip(feature_names, coefficients),
        key=lambda x: abs(x[1]),
        reverse=True
    )
    
    print("\nTop Features by Importance:")
    print("  Positive Indicators:")
    for name, coef in feature_importance[:5]:
        if coef > 0:
            print(f"    '{name}': {coef:.3f}")
    
    print("  Negative Indicators:")
    for name, coef in feature_importance[:5]:
        if coef < 0:
            print(f"    '{name}': {coef:.3f}")
    
    # Test on new examples
    test_texts = [
        "This was a wonderful experience",
        "Absolutely terrible, would not recommend",
    ]
    
    predictions = pipeline.predict(test_texts)
    probabilities = pipeline.predict_proba(test_texts)
    
    print("\nPredictions on New Data:")
    for text, pred, prob in zip(test_texts, predictions, probabilities):
        sentiment = "Positive" if pred == 1 else "Negative"
        print(f"  '{text[:40]}...'")
        print(f"    Prediction: {sentiment} (confidence: {max(prob):.2%})")
    
    return pipeline
 
 
def analyze_vocabulary_statistics():
    """
    Detailed analysis of vocabulary characteristics.
    """
    # Larger sample for better statistics
    from sklearn.datasets import fetch_20newsgroups
    
    try:
        # Fetch a subset of 20 newsgroups
        newsgroups = fetch_20newsgroups(
            subset='train',
            categories=['sci.space', 'comp.graphics'],
            remove=('headers', 'footers', 'quotes')
        )
        texts = newsgroups.data[:500]
    except Exception:
        # Fallback if dataset not available
        texts = ["Sample text " * 50] * 100
    
    # Analyze at different vocabulary sizes
    vocab_sizes = [100, 500, 1000, 2000, 5000, None]
    
    print("Vocabulary Size Analysis:")
    for max_features in vocab_sizes:
        vectorizer = CountVectorizer(
            max_features=max_features,
            stop_words='english',
            min_df=2,
        )
        X = vectorizer.fit_transform(texts)
        
        actual_vocab = len(vectorizer.vocabulary_)
        nonzero_ratio = X.nnz / (X.shape[0] * X.shape[1])
        avg_terms = X.nnz / X.shape[0]
        
        print(f"\n  Max Features: {max_features or 'Unlimited'}")
        print(f"    Actual vocabulary: {actual_vocab}")
        print(f"    Non-zero ratio: {nonzero_ratio:.2%}")
        print(f"    Avg terms per doc: {avg_terms:.1f}")
 
 
if __name__ == "__main__":
    comprehensive_unigram_example()
    print("\n" + "="*60 + "\n")
    analyze_vocabulary_statistics()

Advantages and Limitations

Understanding when unigrams are appropriate—and when they fall short—is essential for effective feature engineering. Let's examine both sides with concrete examples.

Advantages of Unigrams

•Simplicity: Easy to implement, understand, and debug
•Efficiency: O(n) tokenization, sparse representations
•Interpretability: Features are human-readable words
•Robustness: Work well even with noisy text
•Baseline Performance: Often surprisingly competitive
•Scalability: Handle large vocabularies with sparse matrices
•No Training Required: Unlike embeddings, no pretraining needed

Limitations of Unigrams

•No Word Order: "dog bites man" = "man bites dog"
•No Context: Can't capture phrases like "not good"
•High Dimensionality: One dimension per vocabulary word
•Sparsity: Most documents have most dimensions as zero
•No Semantics: Synonyms treated as completely different
•OOV Problem: Unknown words have no representation
•Short Text Issues: Very short texts have sparse vectors

The Word Order Problem

Consider these sentences:

"The movie was not good"
"The movie was good"

With unigrams, both have the same features: {the, movie, was, not, good} and {the, movie, was, good}. The only difference is the presence of "not"—but without context, the model must learn that "not" + "good" = negative. This works sometimes, but fails when negation is separated from its target or when context is more nuanced.

When Unigrams Work Well:

Topic Classification: Documents about "sports" vs "politics" have different word distributions
Spam Detection: Spam vocabulary differs from legitimate email
Author Attribution: Authors have characteristic vocabulary choices
Language Identification: Different languages have different words
Document Retrieval: Finding documents containing specific terms

When Unigrams Struggle:

Sentiment with Negation: "not good", "hardly impressive"
Irony and Sarcasm: "Oh great, another meeting"
Comparative Statements: "A is better than B"
Named Entity Recognition: "New York" vs "York" and "New"
Question Classification: Word order determines question type
Semantic Similarity: "car" and "automobile" are unrelated

Production Considerations

Deploying unigram-based systems in production introduces considerations that don't appear in toy examples. Memory efficiency, vocabulary consistency, and handling new words all require careful design.

Key Production Challenges:

Production System Requirements

•Vocabulary Consistency: The vocabulary used at inference must match training exactly. Different tokenization or vocabulary filtering will produce incompatible feature vectors.
•Version Control: Track vocabulary versions alongside model versions. A model trained with vocabulary v1 cannot use vocabulary v2.
•Out-of-Vocabulary Handling: Decide how to handle words not seen during training. Options: ignore, map to UNK token, or use subword fallback.
•Memory Management: Large vocabularies consume significant memory. Consider memory-mapped files or distributed storage for production.
•Feature Hashing: For extreme scale, feature hashing (hashing trick) eliminates explicit vocabulary at the cost of some accuracy.

production_unigrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
import pickle
import hashlib
from pathlib import Path
from typing import Dict, Optional
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
import numpy as np
 
class ProductionUnigramVectorizer:
    """
    Production-ready unigram vectorizer with versioning,
    serialization, and OOV handling.
    """
    
    def __init__(
        self,
        max_features: int = 10000,
        min_df: int = 5,
        handle_oov: str = 'ignore',  # 'ignore', 'unk', 'error'
    ):
        self.max_features = max_features
        self.min_df = min_df
        self.handle_oov = handle_oov
        
        self._vectorizer = CountVectorizer(
            max_features=max_features,
            min_df=min_df,
            lowercase=True,
            stop_words='english',
        )
        
        self._version: Optional[str] = None
        self._is_fitted: bool = False
    
    def fit(self, documents: list) -> 'ProductionUnigramVectorizer':
        """Fit vectorizer and compute version hash."""
        self._vectorizer.fit(documents)
        self._is_fitted = True
        
        # Create version hash from vocabulary
        vocab_str = '|'.join(sorted(self._vectorizer.vocabulary_.keys()))
        self._version = hashlib.sha256(vocab_str.encode()).hexdigest()[:12]
        
        return self
    
    def transform(self, documents: list) -> np.ndarray:
        """Transform with OOV handling."""
        if not self._is_fitted:
            raise ValueError("Vectorizer must be fitted before transform")
        
        if self.handle_oov == 'error':
            # Check for OOV terms
            for doc in documents:
                tokens = doc.lower().split()
                for token in tokens:
                    if token not in self._vectorizer.vocabulary_:
                        raise ValueError(f"OOV term: {token}")
        
        return self._vectorizer.transform(documents)
    
    @property
    def version(self) -> str:
        """Get vocabulary version hash."""
        return self._version
    
    @property
    def vocabulary_size(self) -> int:
        return len(self._vectorizer.vocabulary_)
    
    def save(self, path: Path) -> None:
        """Serialize vectorizer to disk."""
        path = Path(path)
        
        # Save with version in filename
        filename = f"unigram_vectorizer_{self._version}.pkl"
        
        with open(path / filename, 'wb') as f:
            pickle.dump({
                'vectorizer': self._vectorizer,
                'version': self._version,
                'config': {
                    'max_features': self.max_features,
                    'min_df': self.min_df,
                    'handle_oov': self.handle_oov,
                }
            }, f)
        
        print(f"Saved vectorizer to {path / filename}")
    
    @classmethod
    def load(cls, path: Path) -> 'ProductionUnigramVectorizer':
        """Load serialized vectorizer."""
        with open(path, 'rb') as f:
            data = pickle.load(f)
        
        instance = cls(**data['config'])
        instance._vectorizer = data['vectorizer']
        instance._version = data['version']
        instance._is_fitted = True
        
        return instance
 
 
class FeatureHashingVectorizer:
    """
    Feature hashing alternative for unbounded vocabulary.
    
    Advantages:
    - No explicit vocabulary storage
    - Handles OOV naturally
    - Constant memory regardless of corpus size
    
    Disadvantages:
    - Hash collisions reduce accuracy
    - Cannot reverse-map features to words
    - Non-interpretable features
    """
    
    def __init__(self, n_features: int = 2**18):
        self.n_features = n_features
        self._vectorizer = HashingVectorizer(
            n_features=n_features,
            alternate_sign=True,  # Reduce collision impact
            lowercase=True,
            stop_words='english',
        )
    
    def transform(self, documents: list) -> np.ndarray:
        """Transform using feature hashing."""
        return self._vectorizer.transform(documents)
    
    def estimate_collision_rate(self, vocabulary_size: int) -> float:
        """
        Estimate probability of collision.
        
        Using birthday problem approximation:
        P(collision) ≈ 1 - e^(-n² / 2m)
        
        where n = vocabulary size, m = number of hash buckets
        """
        n = vocabulary_size
        m = self.n_features
        
        # More accurate approximation
        collision_prob = 1 - np.exp(-n * (n - 1) / (2 * m))
        
        return collision_prob
 
 
def demonstrate_production_patterns():
    """Demonstrate production deployment patterns."""
    
    # Training phase
    training_corpus = [
        "Machine learning models require careful feature engineering",
        "Deep learning has revolutionized natural language processing",
        "Feature engineering remains important for classical ML models",
        "Neural networks can learn representations automatically",
        "Traditional ML methods still outperform deep learning on small data",
    ]
    
    # Create and fit vectorizer
    vectorizer = ProductionUnigramVectorizer(
        max_features=100,
        min_df=1,
        handle_oov='ignore'
    )
    vectorizer.fit(training_corpus)
    
    print(f"Vocabulary Version: {vectorizer.version}")
    print(f"Vocabulary Size: {vectorizer.vocabulary_size}")
    
    # Inference phase
    inference_texts = [
        "Feature engineering is crucial for machine learning",
        "This contains completely unknown words like quantum and blockchain",
    ]
    
    X = vectorizer.transform(inference_texts)
    print(f"\nInference matrix shape: {X.shape}")
    print(f"Non-zero features doc 1: {X[0].nnz}")
    print(f"Non-zero features doc 2: {X[1].nnz}")  # Fewer due to OOV
    
    # Feature hashing comparison
    print("\n--- Feature Hashing Alternative ---")
    hasher = FeatureHashingVectorizer(n_features=2**16)
    X_hashed = hasher.transform(inference_texts)
    
    print(f"Hashed matrix shape: {X_hashed.shape}")
    print(f"Estimated collision rate (1000 vocab): {hasher.estimate_collision_rate(1000):.4%}")
    print(f"Estimated collision rate (10000 vocab): {hasher.estimate_collision_rate(10000):.4%}")
 
 
if __name__ == "__main__":
    demonstrate_production_patterns()

Summary: Unigrams as Foundation

We've explored unigrams comprehensively—from their mathematical foundations to production deployment considerations. Let's consolidate the key insights:

Key Takeaways

•Unigrams are atomic text units: Each token is treated independently, forming the basis of bag-of-words representations.
•Mathematical rigor matters: Understanding vector space representations, sparsity, and Zipf's Law informs practical decisions.
•Tokenization determines features: The choice of tokenizer fundamentally shapes what constitutes a unigram in your system.
•Vocabulary management is critical: Filtering strategies balance information preservation against dimensionality and noise.
•Simplicity is powerful: Despite limitations, unigrams provide strong baselines and remain useful in production.
•Context is the missing piece: The independence assumption sacrifices word order and context—the motivation for n-grams.

Looking Ahead: From Unigrams to Bigrams

Unigrams' fundamental limitation—ignoring word order and context—motivates the extension to bigrams: pairs of consecutive tokens. In the next page, we'll see how bigrams capture local context, enable phrase detection, and address many limitations of the pure unigram approach.

The progression from unigrams to bigrams to general n-grams represents the natural evolution of text feature engineering: each step captures more linguistic structure while increasing computational complexity. Understanding this tradeoff is essential for effective NLP system design.

Page Complete

You now have a deep understanding of unigrams as the foundation of text feature engineering. You understand their mathematical formalization, implementation patterns, vocabulary management strategies, and production considerations. Next, we'll explore bigrams and how they capture the local context that unigrams miss.

1 / 5

Loading learning content...

Machine LearningText Feature Engineering - N-grams

Text Feature Engineering: N-grams

LevelIntermediate

Duration75 mins

TopicText Feature Engineering - N-grams

1 / 5

Unigrams: The Foundation of Text Feature Engineering

The Building Blocks of Text

What You Will Learn

What Are Unigrams?

Formal Definition:

Given a text sequence T consisting of tokens t₁, t₂, ..., tₖ, the set of unigrams U(T) is simply:

U(T) = {t₁, t₂, ..., tₖ}

The vocabulary V of a corpus C (a collection of documents) is the union of all unique unigrams across all documents:

V = ⋃_{T ∈ C} U(T)

Example:

Consider the sentence: "The quick brown fox jumps over the lazy dog"

The unigrams extracted from this sentence are:

["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]

Note that "the" appears twice in the sequence. When constructing a unigram representation, we must decide whether to:

Count occurrences (frequency-based representation)
Use binary presence indicators
Apply some weighting scheme (like TF-IDF)

The Independence Assumption

Unigrams vs. Words:

While we often use "unigram" and "word" interchangeably, they are not identical concepts:

Words are linguistic units defined by language rules
Unigrams are tokens defined by the tokenization strategy

Depending on the tokenizer used, a "unigram" might be:

A complete word ("running")
A subword unit ("run" + "##ning" in BERT tokenization)
A character ('r', 'u', 'n', ...)
A morpheme ("run" + "ing")

The choice of tokenization fundamentally shapes what constitutes a unigram in your system.

Mathematical Formalization

Vocabulary Construction:

Given a corpus C = {d₁, d₂, ..., dₙ} containing N documents, the vocabulary V is the set of all unique tokens appearing in the corpus:

V = {w₁, w₂, ..., wₘ}

where M = |V| is the vocabulary size.

Vector Space Representation:

Each document d can be represented as a vector in ℝᴹ, where each dimension corresponds to a vocabulary term. The most common representations are:

Unigram Vector Representations
Representation	Mathematical Definition	Properties
Binary (One-Hot)	x_i = 1 if w_i ∈ d, else 0	Simple; loses frequency information
Term Frequency (TF)	x_i = count(w_i, d)	Captures word importance; biased toward long documents
Normalized TF	x_i = count(w_i, d) / \|d\|	Length-independent; comparable across documents
Log-Normalized TF	x_i = 1 + log(count(w_i, d))	Sublinear scaling; reduces impact of very frequent terms
TF-IDF	x_i = tf(w_i, d) × idf(w_i)	Balances local and global importance

The Document-Term Matrix:

For a corpus of N documents and vocabulary size M, we construct a document-term matrix A ∈ ℝᴺˣᴹ where:

A[i,j] = representation of term wⱼ in document dᵢ

This matrix is the foundation of most classical text analysis techniques:

Each row represents a document as a vector
Each column represents how a term is distributed across documents
Document similarity can be computed as row vector similarity
Term co-occurrence can be analyzed via column correlations

Sparsity Characteristics:

Document-term matrices are typically extremely sparse. In natural language:

Individual documents use only a tiny fraction of the vocabulary
Zipf's Law states that word frequency follows a power-law distribution
Most entries in the matrix are zero

The sparsity ratio is often > 99%, which has critical implications for storage and computation.

unigram_vector_representation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple
from scipy.sparse import csr_matrix
 
class UnigramVectorizer:
    """
    A from-scratch implementation of unigram vectorization
    demonstrating the mathematical foundations.
    """
    
    def __init__(self, max_vocab_size: int = None, min_doc_freq: int = 1):
        self.vocabulary_: Dict[str, int] = {}
        self.doc_freq_: Dict[str, int] = {}
        self.max_vocab_size = max_vocab_size
        self.min_doc_freq = min_doc_freq
        
    def _tokenize(self, text: str) -> List[str]:
        """Simple whitespace tokenization with lowercasing."""
        return text.lower().split()
    
    def fit(self, documents: List[str]) -> 'UnigramVectorizer':
        """
        Build vocabulary from corpus.
        
        Time Complexity: O(N * L) where N = num docs, L = avg doc length
        Space Complexity: O(V) where V = vocabulary size
        """
        # Count document frequencies for each term
        term_doc_freq = Counter()
        
        for doc in documents:
            # Use set to count each term once per document
            unique_terms = set(self._tokenize(doc))
            term_doc_freq.update(unique_terms)
        
        # Filter by minimum document frequency
        filtered_terms = [
            term for term, freq in term_doc_freq.items()
            if freq >= self.min_doc_freq
        ]
        
        # Sort by frequency (descending) and limit vocabulary
        sorted_terms = sorted(
            filtered_terms,
            key=lambda t: term_doc_freq[t],
            reverse=True
        )
        
        if self.max_vocab_size:
            sorted_terms = sorted_terms[:self.max_vocab_size]
        
        # Build vocabulary mapping
        self.vocabulary_ = {term: idx for idx, term in enumerate(sorted_terms)}
        self.doc_freq_ = {term: term_doc_freq[term] for term in sorted_terms}
        
        return self
    
    def transform_binary(self, documents: List[str]) -> csr_matrix:
        """Transform documents to binary unigram vectors."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            seen_terms = set()
            for term in self._tokenize(doc):
                if term in self.vocabulary_ and term not in seen_terms:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[term])
                    data.append(1)
                    seen_terms.add(term)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def transform_tf(self, documents: List[str]) -> csr_matrix:
        """Transform documents to term frequency vectors."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            term_counts = Counter(self._tokenize(doc))
            for term, count in term_counts.items():
                if term in self.vocabulary_:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[term])
                    data.append(count)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def analyze_sparsity(self, matrix: csr_matrix) -> Dict:
        """Analyze sparsity characteristics of the document-term matrix."""
        total_elements = matrix.shape[0] * matrix.shape[1]
        nonzero_elements = matrix.nnz
        
        return {
            'shape': matrix.shape,
            'total_elements': total_elements,
            'nonzero_elements': nonzero_elements,
            'sparsity_ratio': 1 - (nonzero_elements / total_elements),
            'avg_terms_per_doc': nonzero_elements / matrix.shape[0],
            'memory_dense_mb': (total_elements * 8) / (1024**2),  # float64
            'memory_sparse_mb': matrix.data.nbytes / (1024**2),
        }
 
 
# Demonstration
if __name__ == "__main__":
    # Sample corpus
    corpus = [
        "the quick brown fox jumps over the lazy dog",
        "the lazy dog sleeps all day",
        "the quick rabbit runs faster than the fox",
        "brown dogs are friendly animals",
    ]
    
    vectorizer = UnigramVectorizer(min_doc_freq=1)
    vectorizer.fit(corpus)
    
    print("Vocabulary:")
    for term, idx in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]):
        print(f"  {idx}: '{term}' (doc_freq={vectorizer.doc_freq_[term]})")
    
    # Binary representation
    binary_matrix = vectorizer.transform_binary(corpus)
    print(f"\nBinary Matrix Shape: {binary_matrix.shape}")
    print(f"Sparsity: {vectorizer.analyze_sparsity(binary_matrix)}")
    
    # TF representation
    tf_matrix = vectorizer.transform_tf(corpus)
    print(f"\nTF Matrix (dense for visualization):")
    print(tf_matrix.toarray())

The Tokenization Decision

Common Tokenization Strategies:

Whitespace Tokenization
- Split on whitespace characters
- Simple and fast
- Fails on punctuation-attached words ("hello," vs "hello")
Punctuation-Aware Tokenization
- Separate punctuation from words
- "don't" → ["don", "'", "t"] or ["don't"]
- Handles most common cases
Rule-Based Tokenization (e.g., NLTK, spaCy)
- Language-specific rules
- Handles contractions, abbreviations, special cases
- More accurate but slower
Subword Tokenization (BPE, WordPiece, SentencePiece)
- Breaks words into subword units
- Handles out-of-vocabulary words
- Used in modern transformer models

Word-Level Tokenization

•Interpretable: Each token is a meaningful word
•Efficient: Smaller vocabulary for common cases
•Linguistically motivated: Aligns with human understanding
•Simple similarity: Word overlap directly meaningful
•OOV Problem: Unknown words become invisible

Subword Tokenization

•Open vocabulary: Can represent any word
•Morphological awareness: Captures word structure
•Compact: Fixed vocabulary size
•Cross-lingual: Works across languages
•Less interpretable: Subwords may lack meaning

tokenization_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
import re
from typing import List, Dict
from collections import Counter
 
class TokenizationStrategies:
    """
    Demonstration of different tokenization approaches
    and their impact on unigram representation.
    """
    
    @staticmethod
    def whitespace_tokenize(text: str) -> List[str]:
        """Simplest approach: split on whitespace."""
        return text.split()
    
    @staticmethod
    def punctuation_aware_tokenize(text: str) -> List[str]:
        """Separate punctuation from words."""
        # Pattern: word characters OR punctuation
        pattern = r"\w+|[^\w\s]"
        return re.findall(pattern, text.lower())
    
    @staticmethod
    def nltk_tokenize(text: str) -> List[str]:
        """Use NLTK's word tokenizer (rule-based)."""
        try:
            from nltk.tokenize import word_tokenize
            return word_tokenize(text.lower())
        except ImportError:
            print("NLTK not installed, falling back to regex")
            return TokenizationStrategies.punctuation_aware_tokenize(text)
    
    @staticmethod
    def simple_bpe_tokenize(text: str, vocab: Dict[str, int]) -> List[str]:
        """
        Simplified BPE-like subword tokenization.
        Actual BPE requires training on a corpus.
        """
        words = text.lower().split()
        tokens = []
        
        for word in words:
            # Add word boundary marker
            word = word + '</w>'
            
            # Greedily match longest subwords from vocabulary
            i = 0
            while i < len(word):
                longest_match = None
                for j in range(len(word), i, -1):
                    subword = word[i:j]
                    if subword in vocab:
                        longest_match = subword
                        break
                
                if longest_match:
                    tokens.append(longest_match)
                    i += len(longest_match)
                else:
                    # Unknown character, add as single token
                    tokens.append(word[i])
                    i += 1
        
        return tokens
 
 
def compare_tokenization_strategies():
    """
    Compare how different strategies handle various text patterns.
    """
    test_cases = [
        "The quick brown fox jumps!",
        "I can't believe it's not butter.",
        "The price is $19.99 (20% off).",
        "Email me at john@example.com",
        "Running, runs, ran, runner",
        "TensorFlow2.0 is awesome!!!",
    ]
    
    strategies = {
        'Whitespace': TokenizationStrategies.whitespace_tokenize,
        'Punctuation-Aware': TokenizationStrategies.punctuation_aware_tokenize,
        'NLTK': TokenizationStrategies.nltk_tokenize,
    }
    
    for text in test_cases:
        print(f"\nInput: '{text}'")
        for name, tokenizer in strategies.items():
            tokens = tokenizer(text)
            print(f"  {name}: {tokens}")
 
 
def analyze_vocabulary_impact():
    """
    Analyze how tokenization affects vocabulary size.
    """
    # Simulated corpus (in practice, use a real corpus)
    corpus = [
        "Machine learning is transforming industries.",
        "Deep learning neural networks are powerful.",
        "Natural language processing enables chatbots.",
        "Computer vision systems can detect objects.",
        "Reinforcement learning trains game-playing agents.",
    ] * 100  # Simulate larger corpus
    
    results = {}
    for name, tokenizer in [
        ('Whitespace', TokenizationStrategies.whitespace_tokenize),
        ('Punctuation-Aware', TokenizationStrategies.punctuation_aware_tokenize),
    ]:
        all_tokens = []
        for doc in corpus:
            all_tokens.extend(tokenizer(doc))
        
        vocab = set(all_tokens)
        token_freq = Counter(all_tokens)
        
        results[name] = {
            'vocabulary_size': len(vocab),
            'total_tokens': len(all_tokens),
            'avg_token_freq': len(all_tokens) / len(vocab),
            'hapax_legomena': sum(1 for t, c in token_freq.items() if c == 1),
        }
    
    print("\nVocabulary Analysis:")
    for name, stats in results.items():
        print(f"  {name}:")
        for key, value in stats.items():
            print(f"    {key}: {value}")
 
 
if __name__ == "__main__":
    compare_tokenization_strategies()
    analyze_vocabulary_impact()

Vocabulary Management

The Vocabulary Size Dilemma:

Natural language has a heavy-tailed distribution of word frequencies (Zipf's Law). This creates a fundamental tension:

Large Vocabulary: Captures more information but increases dimensionality, memory usage, and risk of overfitting
Small Vocabulary: More efficient but loses potentially important rare words

Vocabulary Filtering Strategies:

Vocabulary Filtering Techniques
Strategy	Description	Pros	Cons
Minimum Document Frequency	Remove terms appearing in fewer than k documents	Removes typos, very rare terms	May remove important domain terms
Maximum Document Frequency	Remove terms appearing in more than p% of documents	Removes non-discriminative terms	May remove important common words
Stop Word Removal	Remove predefined common words	Reduces noise, smaller vocabulary	May lose meaning ("not", "no")
Maximum Vocabulary Size	Keep only top-k most frequent terms	Controlled dimensionality	Arbitrary cutoff
Minimum Term Frequency	Remove terms with total count < k	Removes very rare words	Different from doc frequency

Zipf's Law and Its Implications:

Zipf's Law states that the frequency of a word is inversely proportional to its rank:

frequency(r) ∝ 1/r^α

where r is the rank and α ≈ 1 for natural language.

Implications for Vocabulary Management:

A small number of words account for most occurrences
Most words are rare (appear only once or twice)
The "long tail" of rare words is very long
Vocabulary growth is sublinear with corpus size (Heaps' Law)

Heaps' Law:

V(n) ≈ K × n^β

where V(n) is vocabulary size after n tokens, K ≈ 10-100, and β ≈ 0.4-0.6.

This means doubling corpus size does NOT double vocabulary size—vocabulary growth slows logarithmically.

vocabulary_management.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
import numpy as np
from collections import Counter
from typing import List, Set, Dict, Tuple
import matplotlib.pyplot as plt
 
class VocabularyManager:
    """
    Comprehensive vocabulary management for unigram systems.
    Implements various filtering strategies and analysis methods.
    """
    
    def __init__(self):
        self.term_freq: Counter = Counter()
        self.doc_freq: Counter = Counter()
        self.num_documents: int = 0
        
    def fit(self, documents: List[List[str]]) -> 'VocabularyManager':
        """
        Build vocabulary statistics from tokenized documents.
        
        Args:
            documents: List of tokenized documents (list of token lists)
        """
        self.num_documents = len(documents)
        
        for doc in documents:
            # Count term frequencies
            self.term_freq.update(doc)
            
            # Count document frequencies (each term once per doc)
            self.doc_freq.update(set(doc))
        
        return self
    
    def filter_vocabulary(
        self,
        min_doc_freq: int = 1,
        max_doc_freq_ratio: float = 1.0,
        min_term_freq: int = 1,
        max_vocab_size: int = None,
        stop_words: Set[str] = None,
    ) -> List[str]:
        """
        Apply multiple filtering strategies to build final vocabulary.
        
        Returns:
            Ordered list of vocabulary terms
        """
        candidates = set(self.term_freq.keys())
        
        # Apply filters
        if stop_words:
            candidates -= stop_words
        
        max_doc_freq = int(self.num_documents * max_doc_freq_ratio)
        
        candidates = {
            term for term in candidates
            if (self.doc_freq[term] >= min_doc_freq and
                self.doc_freq[term] <= max_doc_freq and
                self.term_freq[term] >= min_term_freq)
        }
        
        # Sort by frequency and limit size
        sorted_terms = sorted(
            candidates,
            key=lambda t: self.term_freq[t],
            reverse=True
        )
        
        if max_vocab_size:
            sorted_terms = sorted_terms[:max_vocab_size]
        
        return sorted_terms
    
    def analyze_zipf_distribution(self) -> Dict:
        """
        Analyze how well the vocabulary follows Zipf's Law.
        """
        # Sort terms by frequency
        sorted_freqs = sorted(self.term_freq.values(), reverse=True)
        ranks = np.arange(1, len(sorted_freqs) + 1)
        freqs = np.array(sorted_freqs)
        
        # Fit Zipf's Law: log(freq) = -α * log(rank) + log(C)
        log_ranks = np.log(ranks)
        log_freqs = np.log(freqs + 1)  # +1 to avoid log(0)
        
        # Linear regression in log-log space
        coeffs = np.polyfit(log_ranks, log_freqs, 1)
        alpha = -coeffs[0]
        
        # Calculate R² for fit quality
        predicted = coeffs[0] * log_ranks + coeffs[1]
        ss_res = np.sum((log_freqs - predicted) ** 2)
        ss_tot = np.sum((log_freqs - np.mean(log_freqs)) ** 2)
        r_squared = 1 - (ss_res / ss_tot)
        
        return {
            'zipf_alpha': alpha,
            'r_squared': r_squared,
            'vocabulary_size': len(sorted_freqs),
            'total_tokens': sum(sorted_freqs),
            'hapax_count': sum(1 for f in sorted_freqs if f == 1),
            'hapax_ratio': sum(1 for f in sorted_freqs if f == 1) / len(sorted_freqs),
        }
    
    def analyze_coverage(self, vocab_sizes: List[int]) -> Dict[int, float]:
        """
        Analyze what percentage of tokens are covered by top-k vocabulary.
        """
        sorted_freqs = sorted(self.term_freq.values(), reverse=True)
        total_tokens = sum(sorted_freqs)
        
        cumsum = np.cumsum(sorted_freqs)
        
        coverage = {}
        for size in vocab_sizes:
            if size <= len(sorted_freqs):
                coverage[size] = cumsum[size - 1] / total_tokens
            else:
                coverage[size] = 1.0
        
        return coverage
 
 
def demonstrate_vocabulary_management():
    """
    Demonstrate vocabulary management concepts.
    """
    # Simulated tokenized corpus
    np.random.seed(42)
    
    # Create corpus following Zipf distribution
    vocab = [f"word_{i}" for i in range(10000)]
    zipf_probs = 1 / np.arange(1, 10001)
    zipf_probs /= zipf_probs.sum()
    
    documents = []
    for _ in range(1000):
        doc_length = np.random.randint(50, 200)
        doc = list(np.random.choice(vocab, size=doc_length, p=zipf_probs))
        documents.append(doc)
    
    # Build vocabulary manager
    manager = VocabularyManager()
    manager.fit(documents)
    
    # Analyze Zipf distribution
    zipf_stats = manager.analyze_zipf_distribution()
    print("Zipf Distribution Analysis:")
    for key, value in zipf_stats.items():
        print(f"  {key}: {value:.4f}" if isinstance(value, float) else f"  {key}: {value}")
    
    # Analyze coverage
    coverage = manager.analyze_coverage([100, 500, 1000, 2000, 5000])
    print("\nVocabulary Coverage:")
    for size, cov in coverage.items():
        print(f"  Top {size} terms cover {cov*100:.1f}% of tokens")
    
    # Compare filtering strategies
    print("\nFiltering Strategy Comparison:")
    
    strategies = [
        {'name': 'No Filtering', 'params': {}},
        {'name': 'Min Doc Freq 5', 'params': {'min_doc_freq': 5}},
        {'name': 'Max Doc Freq 50%', 'params': {'max_doc_freq_ratio': 0.5}},
        {'name': 'Combined', 'params': {'min_doc_freq': 3, 'max_doc_freq_ratio': 0.8}},
        {'name': 'Limited 1000', 'params': {'max_vocab_size': 1000}},
    ]
    
    for strategy in strategies:
        vocab = manager.filter_vocabulary(**strategy['params'])
        print(f"  {strategy['name']}: {len(vocab)} terms")
 
 
if __name__ == "__main__":
    demonstrate_vocabulary_management()

Practical Implementation with scikit-learn

Key Parameters:

max_features: Maximum vocabulary size
min_df: Minimum document frequency (int) or ratio (float)
max_df: Maximum document frequency ratio
stop_words: Stop word list or language identifier
binary: Whether to use binary (True) or count (False) features
tokenizer: Custom tokenization function
preprocessor: Custom preprocessing function

sklearn_unigrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import numpy as np
from typing import List, Tuple
 
def comprehensive_unigram_example():
    """
    Complete example of unigram-based text classification.
    """
    
    # Sample dataset (sentiment analysis)
    texts = [
        "This movie was absolutely fantastic and amazing",
        "I loved every moment of this wonderful film",
        "Terrible waste of time, completely boring",
        "The worst film I have ever seen, awful",
        "Great acting, brilliant storyline, highly recommend",
        "Disappointing, slow, and not worth watching",
        "A masterpiece of modern cinema, outstanding",
        "Complete garbage, do not waste your money",
        "Beautiful cinematography and excellent performances",
        "Boring plot with terrible acting throughout",
    ]
    
    labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0]  # 1=positive, 0=negative
    
    # Basic unigram vectorizer
    basic_vectorizer = CountVectorizer()
    X_basic = basic_vectorizer.fit_transform(texts)
    
    print("Basic Vectorizer:")
    print(f"  Vocabulary size: {len(basic_vectorizer.vocabulary_)}")
    print(f"  Feature matrix shape: {X_basic.shape}")
    print(f"  Sparsity: {1 - X_basic.nnz / (X_basic.shape[0] * X_basic.shape[1]):.2%}")
    
    # Configured vectorizer with filtering
    configured_vectorizer = CountVectorizer(
        lowercase=True,
        stop_words='english',
        min_df=2,  # Appear in at least 2 documents
        max_df=0.8,  # Appear in at most 80% of documents
        max_features=100,  # Limit vocabulary
        binary=False,  # Use counts, not binary
    )
    
    X_configured = configured_vectorizer.fit_transform(texts)
    
    print("\nConfigured Vectorizer:")
    print(f"  Vocabulary size: {len(configured_vectorizer.vocabulary_)}")
    print(f"  Feature matrix shape: {X_configured.shape}")
    print(f"  Vocabulary: {list(configured_vectorizer.vocabulary_.keys())}")
    
    # Build classification pipeline
    pipeline = Pipeline([
        ('vectorizer', CountVectorizer(
            lowercase=True,
            stop_words='english',
            min_df=1,
            max_features=50,
        )),
        ('classifier', LogisticRegression(random_state=42))
    ])
    
    # Note: In practice, you'd use a proper train/test split
    # This is just for demonstration
    pipeline.fit(texts, labels)
    
    # Analyze feature importance
    feature_names = pipeline.named_steps['vectorizer'].get_feature_names_out()
    coefficients = pipeline.named_steps['classifier'].coef_[0]
    
    # Sort features by coefficient magnitude
    feature_importance = sorted(
        zip(feature_names, coefficients),
        key=lambda x: abs(x[1]),
        reverse=True
    )
    
    print("\nTop Features by Importance:")
    print("  Positive Indicators:")
    for name, coef in feature_importance[:5]:
        if coef > 0:
            print(f"    '{name}': {coef:.3f}")
    
    print("  Negative Indicators:")
    for name, coef in feature_importance[:5]:
        if coef < 0:
            print(f"    '{name}': {coef:.3f}")
    
    # Test on new examples
    test_texts = [
        "This was a wonderful experience",
        "Absolutely terrible, would not recommend",
    ]
    
    predictions = pipeline.predict(test_texts)
    probabilities = pipeline.predict_proba(test_texts)
    
    print("\nPredictions on New Data:")
    for text, pred, prob in zip(test_texts, predictions, probabilities):
        sentiment = "Positive" if pred == 1 else "Negative"
        print(f"  '{text[:40]}...'")
        print(f"    Prediction: {sentiment} (confidence: {max(prob):.2%})")
    
    return pipeline
 
 
def analyze_vocabulary_statistics():
    """
    Detailed analysis of vocabulary characteristics.
    """
    # Larger sample for better statistics
    from sklearn.datasets import fetch_20newsgroups
    
    try:
        # Fetch a subset of 20 newsgroups
        newsgroups = fetch_20newsgroups(
            subset='train',
            categories=['sci.space', 'comp.graphics'],
            remove=('headers', 'footers', 'quotes')
        )
        texts = newsgroups.data[:500]
    except Exception:
        # Fallback if dataset not available
        texts = ["Sample text " * 50] * 100
    
    # Analyze at different vocabulary sizes
    vocab_sizes = [100, 500, 1000, 2000, 5000, None]
    
    print("Vocabulary Size Analysis:")
    for max_features in vocab_sizes:
        vectorizer = CountVectorizer(
            max_features=max_features,
            stop_words='english',
            min_df=2,
        )
        X = vectorizer.fit_transform(texts)
        
        actual_vocab = len(vectorizer.vocabulary_)
        nonzero_ratio = X.nnz / (X.shape[0] * X.shape[1])
        avg_terms = X.nnz / X.shape[0]
        
        print(f"\n  Max Features: {max_features or 'Unlimited'}")
        print(f"    Actual vocabulary: {actual_vocab}")
        print(f"    Non-zero ratio: {nonzero_ratio:.2%}")
        print(f"    Avg terms per doc: {avg_terms:.1f}")
 
 
if __name__ == "__main__":
    comprehensive_unigram_example()
    print("\n" + "="*60 + "\n")
    analyze_vocabulary_statistics()

Advantages and Limitations

Understanding when unigrams are appropriate—and when they fall short—is essential for effective feature engineering. Let's examine both sides with concrete examples.

Advantages of Unigrams

•Simplicity: Easy to implement, understand, and debug
•Efficiency: O(n) tokenization, sparse representations
•Interpretability: Features are human-readable words
•Robustness: Work well even with noisy text
•Baseline Performance: Often surprisingly competitive
•Scalability: Handle large vocabularies with sparse matrices
•No Training Required: Unlike embeddings, no pretraining needed

Limitations of Unigrams

•No Word Order: "dog bites man" = "man bites dog"
•No Context: Can't capture phrases like "not good"
•High Dimensionality: One dimension per vocabulary word
•Sparsity: Most documents have most dimensions as zero
•No Semantics: Synonyms treated as completely different
•OOV Problem: Unknown words have no representation
•Short Text Issues: Very short texts have sparse vectors

The Word Order Problem

Consider these sentences:

"The movie was not good"
"The movie was good"

When Unigrams Work Well:

Topic Classification: Documents about "sports" vs "politics" have different word distributions
Spam Detection: Spam vocabulary differs from legitimate email
Author Attribution: Authors have characteristic vocabulary choices
Language Identification: Different languages have different words
Document Retrieval: Finding documents containing specific terms

When Unigrams Struggle:

Sentiment with Negation: "not good", "hardly impressive"
Irony and Sarcasm: "Oh great, another meeting"
Comparative Statements: "A is better than B"
Named Entity Recognition: "New York" vs "York" and "New"
Question Classification: Word order determines question type
Semantic Similarity: "car" and "automobile" are unrelated

Production Considerations

Key Production Challenges:

Production System Requirements

•Vocabulary Consistency: The vocabulary used at inference must match training exactly. Different tokenization or vocabulary filtering will produce incompatible feature vectors.
•Version Control: Track vocabulary versions alongside model versions. A model trained with vocabulary v1 cannot use vocabulary v2.
•Out-of-Vocabulary Handling: Decide how to handle words not seen during training. Options: ignore, map to UNK token, or use subword fallback.
•Memory Management: Large vocabularies consume significant memory. Consider memory-mapped files or distributed storage for production.
•Feature Hashing: For extreme scale, feature hashing (hashing trick) eliminates explicit vocabulary at the cost of some accuracy.

production_unigrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
import pickle
import hashlib
from pathlib import Path
from typing import Dict, Optional
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
import numpy as np
 
class ProductionUnigramVectorizer:
    """
    Production-ready unigram vectorizer with versioning,
    serialization, and OOV handling.
    """
    
    def __init__(
        self,
        max_features: int = 10000,
        min_df: int = 5,
        handle_oov: str = 'ignore',  # 'ignore', 'unk', 'error'
    ):
        self.max_features = max_features
        self.min_df = min_df
        self.handle_oov = handle_oov
        
        self._vectorizer = CountVectorizer(
            max_features=max_features,
            min_df=min_df,
            lowercase=True,
            stop_words='english',
        )
        
        self._version: Optional[str] = None
        self._is_fitted: bool = False
    
    def fit(self, documents: list) -> 'ProductionUnigramVectorizer':
        """Fit vectorizer and compute version hash."""
        self._vectorizer.fit(documents)
        self._is_fitted = True
        
        # Create version hash from vocabulary
        vocab_str = '|'.join(sorted(self._vectorizer.vocabulary_.keys()))
        self._version = hashlib.sha256(vocab_str.encode()).hexdigest()[:12]
        
        return self
    
    def transform(self, documents: list) -> np.ndarray:
        """Transform with OOV handling."""
        if not self._is_fitted:
            raise ValueError("Vectorizer must be fitted before transform")
        
        if self.handle_oov == 'error':
            # Check for OOV terms
            for doc in documents:
                tokens = doc.lower().split()
                for token in tokens:
                    if token not in self._vectorizer.vocabulary_:
                        raise ValueError(f"OOV term: {token}")
        
        return self._vectorizer.transform(documents)
    
    @property
    def version(self) -> str:
        """Get vocabulary version hash."""
        return self._version
    
    @property
    def vocabulary_size(self) -> int:
        return len(self._vectorizer.vocabulary_)
    
    def save(self, path: Path) -> None:
        """Serialize vectorizer to disk."""
        path = Path(path)
        
        # Save with version in filename
        filename = f"unigram_vectorizer_{self._version}.pkl"
        
        with open(path / filename, 'wb') as f:
            pickle.dump({
                'vectorizer': self._vectorizer,
                'version': self._version,
                'config': {
                    'max_features': self.max_features,
                    'min_df': self.min_df,
                    'handle_oov': self.handle_oov,
                }
            }, f)
        
        print(f"Saved vectorizer to {path / filename}")
    
    @classmethod
    def load(cls, path: Path) -> 'ProductionUnigramVectorizer':
        """Load serialized vectorizer."""
        with open(path, 'rb') as f:
            data = pickle.load(f)
        
        instance = cls(**data['config'])
        instance._vectorizer = data['vectorizer']
        instance._version = data['version']
        instance._is_fitted = True
        
        return instance
 
 
class FeatureHashingVectorizer:
    """
    Feature hashing alternative for unbounded vocabulary.
    
    Advantages:
    - No explicit vocabulary storage
    - Handles OOV naturally
    - Constant memory regardless of corpus size
    
    Disadvantages:
    - Hash collisions reduce accuracy
    - Cannot reverse-map features to words
    - Non-interpretable features
    """
    
    def __init__(self, n_features: int = 2**18):
        self.n_features = n_features
        self._vectorizer = HashingVectorizer(
            n_features=n_features,
            alternate_sign=True,  # Reduce collision impact
            lowercase=True,
            stop_words='english',
        )
    
    def transform(self, documents: list) -> np.ndarray:
        """Transform using feature hashing."""
        return self._vectorizer.transform(documents)
    
    def estimate_collision_rate(self, vocabulary_size: int) -> float:
        """
        Estimate probability of collision.
        
        Using birthday problem approximation:
        P(collision) ≈ 1 - e^(-n² / 2m)
        
        where n = vocabulary size, m = number of hash buckets
        """
        n = vocabulary_size
        m = self.n_features
        
        # More accurate approximation
        collision_prob = 1 - np.exp(-n * (n - 1) / (2 * m))
        
        return collision_prob
 
 
def demonstrate_production_patterns():
    """Demonstrate production deployment patterns."""
    
    # Training phase
    training_corpus = [
        "Machine learning models require careful feature engineering",
        "Deep learning has revolutionized natural language processing",
        "Feature engineering remains important for classical ML models",
        "Neural networks can learn representations automatically",
        "Traditional ML methods still outperform deep learning on small data",
    ]
    
    # Create and fit vectorizer
    vectorizer = ProductionUnigramVectorizer(
        max_features=100,
        min_df=1,
        handle_oov='ignore'
    )
    vectorizer.fit(training_corpus)
    
    print(f"Vocabulary Version: {vectorizer.version}")
    print(f"Vocabulary Size: {vectorizer.vocabulary_size}")
    
    # Inference phase
    inference_texts = [
        "Feature engineering is crucial for machine learning",
        "This contains completely unknown words like quantum and blockchain",
    ]
    
    X = vectorizer.transform(inference_texts)
    print(f"\nInference matrix shape: {X.shape}")
    print(f"Non-zero features doc 1: {X[0].nnz}")
    print(f"Non-zero features doc 2: {X[1].nnz}")  # Fewer due to OOV
    
    # Feature hashing comparison
    print("\n--- Feature Hashing Alternative ---")
    hasher = FeatureHashingVectorizer(n_features=2**16)
    X_hashed = hasher.transform(inference_texts)
    
    print(f"Hashed matrix shape: {X_hashed.shape}")
    print(f"Estimated collision rate (1000 vocab): {hasher.estimate_collision_rate(1000):.4%}")
    print(f"Estimated collision rate (10000 vocab): {hasher.estimate_collision_rate(10000):.4%}")
 
 
if __name__ == "__main__":
    demonstrate_production_patterns()

Summary: Unigrams as Foundation

We've explored unigrams comprehensively—from their mathematical foundations to production deployment considerations. Let's consolidate the key insights:

Key Takeaways

•Unigrams are atomic text units: Each token is treated independently, forming the basis of bag-of-words representations.
•Mathematical rigor matters: Understanding vector space representations, sparsity, and Zipf's Law informs practical decisions.
•Tokenization determines features: The choice of tokenizer fundamentally shapes what constitutes a unigram in your system.
•Vocabulary management is critical: Filtering strategies balance information preservation against dimensionality and noise.
•Simplicity is powerful: Despite limitations, unigrams provide strong baselines and remain useful in production.
•Context is the missing piece: The independence assumption sacrifices word order and context—the motivation for n-grams.

Looking Ahead: From Unigrams to Bigrams

Page Complete

1 / 5