Machine LearningFeature Engineering & Selection

Text Feature Engineering - Basics

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

1 / 5

Bag of Words (BoW)

From Words to Numbers: The Fundamental Challenge

Machine learning algorithms operate in the realm of numbers—vectors, matrices, and tensors that can be manipulated through mathematical operations like dot products, matrix multiplication, and gradient descent. Yet human language is inherently symbolic: words, sentences, and documents are discrete sequences of characters with no obvious numerical structure.

This creates the fundamental challenge of text feature engineering: How do we transform the rich, complex, and inherently unstructured nature of human language into numerical representations that preserve semantic meaning while being computationally tractable?

The Bag of Words (BoW) model stands as one of the most elegant and historically significant solutions to this problem. Despite its simplicity—or perhaps because of it—BoW has powered decades of text classification, sentiment analysis, spam detection, and information retrieval systems. Understanding BoW deeply is not merely an academic exercise; it is foundational knowledge that illuminates why more sophisticated techniques like TF-IDF, word embeddings, and transformer-based representations work the way they do.

What You Will Master

By the end of this page, you will understand: (1) The theoretical foundations and intuition behind the Bag of Words model, (2) The mathematical formalization of document-term matrices, (3) The step-by-step process of constructing BoW representations, (4) Implementation considerations and computational tradeoffs, and (5) The assumptions BoW makes and when those assumptions are reasonable versus limiting.

The Bag of Words Intuition

Imagine you're sorting through a library of documents, trying to categorize them by topic. One simple but remarkably effective strategy would be to ignore how words are arranged and focus solely on which words appear and how often. A document about astronomy will likely contain words like "star," "galaxy," "telescope," and "orbit." A document about cooking will feature "recipe," "ingredient," "temperature," and "stir."

The Bag of Words model formalizes exactly this intuition. It represents each document as a multiset (or "bag") of its constituent words, discarding grammar, word order, and sentence structure entirely.

The term "bag" comes from mathematics: a bag (or multiset) is like a set, but it allows multiple instances of the same element. If a document contains the word "data" three times, the bag records three instances of "data"—unlike a set, which would record only one.

The Core Insight

BoW operates on the hypothesis that word occurrence patterns are sufficient for many text classification tasks. By counting what words appear—without caring about their order—we capture the 'aboutness' of a document. A document containing 'machine', 'learning', 'algorithm', and 'training' probably concerns ML, regardless of how those words are arranged syntactically.

A worked example:

Consider two simple documents:

Document 1: "The cat sat on the mat"
Document 2: "The dog ran in the park"

Under the BoW model, we represent each document by counting word occurrences:

Word	Document 1	Document 2
the	2	2
cat	1	0
sat	1	0
on	1	0
mat	1	0
dog	0	1
ran	0	1
in	0	1
park	0	1

Each document becomes a vector of counts—a numerical representation we can feed to any machine learning algorithm.

Historical Significance

The Bag of Words concept predates modern ML by decades. It emerged from information retrieval research in the 1950s-60s, particularly the work of Hans Peter Luhn at IBM. The model's longevity speaks to its fundamental soundness: when you want interpretable, fast, and reasonably effective text features, BoW remains a strong baseline even today.

Mathematical Formalization

Let us formalize the Bag of Words model mathematically, as precision here will clarify implementation and enable rigorous analysis.

Definitions:

Let D = {d₁, d₂, ..., dₙ} be a corpus (collection) of n documents.

Let V = {t₁, t₂, ..., t_m} be the vocabulary—the set of all unique terms (words) appearing across the entire corpus. The vocabulary size |V| = m.

For each document dᵢ, we define its BoW representation as a vector xᵢ ∈ ℕᵐ where:

xᵢ[j] = count(tⱼ, dᵢ) = number of times term tⱼ appears in document dᵢ

This yields a document-term matrix X ∈ ℕⁿˣᵐ where:

Rows correspond to documents (n rows)
Columns correspond to vocabulary terms (m columns)
Entry X[i,j] = frequency of term j in document i

bow_matrix_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple
 
def build_vocabulary(corpus: List[str]) -> Dict[str, int]:
    """
    Build vocabulary from corpus, mapping each unique term to an index.
    
    Args:
        corpus: List of documents (strings)
    
    Returns:
        Dictionary mapping term -> column index
    """
    vocabulary = {}
    idx = 0
    
    for document in corpus:
        # Tokenize: split on whitespace (simplified)
        tokens = document.lower().split()
        
        for token in tokens:
            if token not in vocabulary:
                vocabulary[token] = idx
                idx += 1
    
    return vocabulary
 
def create_bow_matrix(corpus: List[str], vocabulary: Dict[str, int]) -> np.ndarray:
    """
    Create the document-term matrix (Bag of Words representation).
    
    Args:
        corpus: List of documents
        vocabulary: Mapping from term to column index
    
    Returns:
        Document-term matrix of shape (n_documents, vocabulary_size)
    """
    n_docs = len(corpus)
    vocab_size = len(vocabulary)
    
    # Initialize matrix with zeros
    X = np.zeros((n_docs, vocab_size), dtype=np.int32)
    
    for doc_idx, document in enumerate(corpus):
        tokens = document.lower().split()
        
        # Count term frequencies in this document
        term_counts = Counter(tokens)
        
        for term, count in term_counts.items():
            if term in vocabulary:
                col_idx = vocabulary[term]
                X[doc_idx, col_idx] = count
    
    return X
 
# Example usage
corpus = [
    "The cat sat on the mat",
    "The dog ran in the park",
    "The cat and dog played together"
]
 
vocabulary = build_vocabulary(corpus)
print(f"Vocabulary ({len(vocabulary)} terms): {vocabulary}")
 
X = create_bow_matrix(corpus, vocabulary)
print(f"
Document-Term Matrix (shape {X.shape}):")
print(X)

The Vector Space Model:

The BoW representation situates documents in a high-dimensional vector space where:

Each dimension corresponds to a vocabulary term
Each document is a point (or equivalently, a vector from the origin)
Similar documents cluster together in this space

This geometric interpretation is powerful: it allows us to use distance metrics (Euclidean distance, cosine similarity) to measure document similarity, enabling tasks like document clustering, nearest-neighbor search, and information retrieval.

Cosine similarity is particularly important:

$$\text{cosine_similarity}(\mathbf{x}_i, \mathbf{x}_j) = \frac{\mathbf{x}_i \cdot \mathbf{x}_j}{|\mathbf{x}_i| |\mathbf{x}_j|}$$

Cosine similarity measures the angle between document vectors, making it invariant to document length—a crucial property since documents vary greatly in size.

Dimensionality Implications

A typical English corpus easily has a vocabulary of 50,000-100,000 unique terms. This means each document vector has 50,000-100,000 dimensions. While daunting, this high dimensionality is manageable because BoW vectors are extremely sparse—most entries are zero. We'll explore sparsity in detail later in this module.

The BoW Construction Pipeline

Constructing a production-quality Bag of Words representation involves multiple preprocessing steps, each with design decisions that affect downstream performance. Let's examine each stage in detail.

BoW Construction Stages

•Stage 1: Text Acquisition — Load documents from files, databases, APIs, or web scraping. Handle encoding issues (UTF-8, Latin-1), malformed characters, and format-specific extraction (HTML, PDF, DOCX).
•Stage 2: Tokenization — Split continuous text into discrete tokens (usually words). Handle punctuation, contractions, hyphenated words, and language-specific considerations.
•Stage 3: Normalization — Convert tokens to a canonical form: lowercasing, accent/diacritic removal, Unicode normalization (NFD/NFC). This reduces vocabulary size by merging equivalent forms.
•Stage 4: Filtering — Remove stop words, rare terms (appearing in < k documents), or overly frequent terms (appearing in > p% of documents). Balance noise reduction against information loss.
•Stage 5: Vocabulary Construction — Build the term-to-index mapping. Decide on vocabulary pruning strategies and handle out-of-vocabulary (OOV) terms at inference time.
•Stage 6: Vectorization — Count term occurrences and populate the document-term matrix. Choose between raw counts, binary indicators, or normalized frequencies.

complete_bow_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
import re
import numpy as np
from collections import Counter
from typing import List, Dict, Set, Optional
 
class BagOfWordsVectorizer:
    """
    A production-quality Bag of Words vectorizer with full preprocessing pipeline.
    """
    
    # Common English stop words
    DEFAULT_STOP_WORDS: Set[str] = {
        'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
        'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
        'should', 'may', 'might', 'must', 'shall', 'can', 'need', 'dare',
        'ought', 'used', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by',
        'from', 'as', 'into', 'through', 'during', 'before', 'after',
        'above', 'below', 'between', 'under', 'again', 'further', 'then',
        'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'each',
        'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
        'only', 'own', 'same', 'so', 'than', 'too', 'very', 'just', 'and',
        'but', 'if', 'or', 'because', 'until', 'while', 'although', 'though',
        'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it', 'we',
        'they', 'what', 'which', 'who', 'whom', 'me', 'him', 'her', 'us', 'them'
    }
    
    def __init__(
        self,
        lowercase: bool = True,
        remove_stop_words: bool = True,
        min_df: int = 1,
        max_df: float = 1.0,
        max_features: Optional[int] = None,
        binary: bool = False
    ):
        """
        Initialize the vectorizer.
        
        Args:
            lowercase: Convert all text to lowercase
            remove_stop_words: Remove common stop words
            min_df: Minimum document frequency (int: count, float: proportion)
            max_df: Maximum document frequency (int: count, float: proportion)
            max_features: Limit vocabulary to top N features by frequency
            binary: If True, use binary indicators instead of counts
        """
        self.lowercase = lowercase
        self.remove_stop_words = remove_stop_words
        self.min_df = min_df
        self.max_df = max_df
        self.max_features = max_features
        self.binary = binary
        
        self.vocabulary_: Optional[Dict[str, int]] = None
        self.feature_names_: Optional[List[str]] = None
    
    def _tokenize(self, text: str) -> List[str]:
        """
        Tokenize text into words.
        Removes punctuation and splits on whitespace.
        """
        # Convert to lowercase if specified
        if self.lowercase:
            text = text.lower()
        
        # Remove punctuation and split
        # Keep alphanumeric characters and spaces
        text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
        tokens = text.split()
        
        # Remove stop words if specified
        if self.remove_stop_words:
            tokens = [t for t in tokens if t not in self.DEFAULT_STOP_WORDS]
        
        return tokens
    
    def fit(self, corpus: List[str]) -> 'BagOfWordsVectorizer':
        """
        Learn vocabulary from the corpus.
        """
        n_docs = len(corpus)
        
        # Count document frequencies for each term
        doc_freq: Counter = Counter()
        
        for document in corpus:
            tokens = set(self._tokenize(document))  # Unique tokens per doc
            doc_freq.update(tokens)
        
        # Apply min_df and max_df filters
        min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs)
        max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs)
        
        # Filter terms by document frequency
        valid_terms = {
            term for term, count in doc_freq.items()
            if min_count <= count <= max_count
        }
        
        # Count total occurrences for feature limitation
        term_freq: Counter = Counter()
        for document in corpus:
            tokens = self._tokenize(document)
            for token in tokens:
                if token in valid_terms:
                    term_freq[token] += 1
        
        # Select top features if max_features is specified
        if self.max_features is not None:
            selected_terms = [term for term, _ in term_freq.most_common(self.max_features)]
        else:
            selected_terms = list(valid_terms)
        
        # Sort terms alphabetically for consistency
        selected_terms = sorted(selected_terms)
        
        # Build vocabulary mapping
        self.vocabulary_ = {term: idx for idx, term in enumerate(selected_terms)}
        self.feature_names_ = selected_terms
        
        return self
    
    def transform(self, corpus: List[str]) -> np.ndarray:
        """
        Transform documents to document-term matrix.
        """
        if self.vocabulary_ is None:
            raise ValueError("Vectorizer not fitted. Call fit() first.")
        
        n_docs = len(corpus)
        vocab_size = len(self.vocabulary_)
        
        X = np.zeros((n_docs, vocab_size), dtype=np.int32)
        
        for doc_idx, document in enumerate(corpus):
            tokens = self._tokenize(document)
            term_counts = Counter(tokens)
            
            for term, count in term_counts.items():
                if term in self.vocabulary_:
                    col_idx = self.vocabulary_[term]
                    X[doc_idx, col_idx] = 1 if self.binary else count
        
        return X
    
    def fit_transform(self, corpus: List[str]) -> np.ndarray:
        """Fit vocabulary and transform in one step."""
        return self.fit(corpus).transform(corpus)
    
    def get_feature_names(self) -> List[str]:
        """Return the feature names (vocabulary terms)."""
        return self.feature_names_ or []
 
 
# Demonstration
if __name__ == "__main__":
    corpus = [
        "Machine learning is a subset of artificial intelligence.",
        "Deep learning uses neural networks with many layers.",
        "Natural language processing enables machines to understand text.",
        "Supervised learning requires labeled training data.",
        "Unsupervised learning discovers patterns without labels."
    ]
    
    vectorizer = BagOfWordsVectorizer(
        lowercase=True,
        remove_stop_words=True,
        min_df=1,
        max_df=1.0,
        binary=False
    )
    
    X = vectorizer.fit_transform(corpus)
    
    print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
    print(f"Feature names: {vectorizer.get_feature_names()}")
    print(f"
Document-Term Matrix shape: {X.shape}")
    print(f"
Document-Term Matrix:
{X}")

Scikit-learn's CountVectorizer

In practice, you'll typically use sklearn.feature_extraction.text.CountVectorizer, which implements this entire pipeline efficiently. Understanding the internals—as we've done here—helps you configure it correctly and debug unexpected behavior.

BoW Weighting Schemes

The basic BoW model uses raw term counts as feature values. However, alternative weighting schemes can improve performance for specific tasks. Understanding these schemes provides a foundation for more sophisticated approaches like TF-IDF (covered in the next module).

Common BoW Weighting Schemes
Scheme	Formula	Use Case	Trade-offs
Raw Count	x[j] = count(tⱼ, d)	When word frequency directly indicates importance	Easily dominated by long documents; common words overweighted
Binary	x[j] = 1 if tⱼ ∈ d, else 0	When presence matters more than frequency	Loses frequency information; treats all terms equally
Term Frequency (TF)	x[j] = count(tⱼ, d) / \|d\|	Length-normalized comparison	Still overweights common terms across corpus
Log Frequency	x[j] = 1 + log(count) if count > 0	Dampens impact of repeated terms	Two occurrences ≠ twice as important as one
Boolean TF	x[j] = 1 if count > 0, else 0	Same as binary; explicit boolean representation	Maximum information compression; loses nuance

Choosing a Weighting Scheme:

The optimal weighting scheme depends on your task:

Spam detection: Binary encoding often works well. The presence of certain terms ("free," "winner," "click") strongly indicates spam, but whether they appear once or five times matters less.
Topic modeling: Raw counts or log frequencies work better because repeated occurrences of topic-related terms ("election," "candidate," "vote") genuinely indicate topicality.
Authorship attribution: Relative frequencies matter—different authors have different word usage patterns. Term frequency normalization helps compare authors who write at different lengths.
Semantic similarity: Log frequencies or TF-IDF (covered later) work best, as they balance local importance (within document) against global significance (across corpus).

weighting_schemes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from typing import Literal
 
WeightingScheme = Literal["raw", "binary", "tf", "log", "augmented"]
 
def apply_weighting(
    raw_counts: np.ndarray,
    scheme: WeightingScheme = "raw"
) -> np.ndarray:
    """
    Apply different weighting schemes to raw count matrix.
    
    Args:
        raw_counts: Document-term matrix with raw counts
        scheme: Weighting scheme to apply
    
    Returns:
        Weighted document-term matrix
    """
    if scheme == "raw":
        return raw_counts.astype(np.float64)
    
    elif scheme == "binary":
        return (raw_counts > 0).astype(np.float64)
    
    elif scheme == "tf":
        # Term Frequency: normalize by document length
        doc_lengths = raw_counts.sum(axis=1, keepdims=True)
        # Avoid division by zero for empty documents
        doc_lengths = np.maximum(doc_lengths, 1)
        return raw_counts.astype(np.float64) / doc_lengths
    
    elif scheme == "log":
        # Logarithmic weighting: 1 + log(count) for count > 0
        result = np.zeros_like(raw_counts, dtype=np.float64)
        mask = raw_counts > 0
        result[mask] = 1 + np.log(raw_counts[mask])
        return result
    
    elif scheme == "augmented":
        # Augmented TF: prevents bias toward longer documents
        # formula: 0.5 + 0.5 * (count / max_count_in_doc)
        max_counts = raw_counts.max(axis=1, keepdims=True)
        max_counts = np.maximum(max_counts, 1)  # Avoid division by zero
        return 0.5 + 0.5 * (raw_counts.astype(np.float64) / max_counts)
    
    else:
        raise ValueError(f"Unknown weighting scheme: {scheme}")
 
 
# Demonstration
if __name__ == "__main__":
    # Example raw count matrix
    raw_counts = np.array([
        [3, 0, 2, 1, 5],  # Document 1
        [0, 1, 0, 4, 2],  # Document 2
        [1, 1, 1, 1, 1],  # Document 3
    ])
    
    print("Raw counts:")
    print(raw_counts)
    
    for scheme in ["raw", "binary", "tf", "log", "augmented"]:
        weighted = apply_weighting(raw_counts, scheme)
        print(f"
{scheme.upper()} weighting:")
        print(np.round(weighted, 3))

The Independence Assumption

The Bag of Words model makes a strong simplifying assumption: words are treated as statistically independent of each other. This assumption is mathematically convenient and computationally tractable, but it's also demonstrably false.

In reality, words are highly dependent:

"New York" is a single concept, but BoW treats "New" and "York" as separate, independent features
"Not good" has opposite meaning to "good," but BoW doesn't capture negation
"Machine learning" collocates due to linguistic convention, not chance

Why make this assumption?

The independence assumption enables the Naive Bayes classifier, which assumes:

$$P(\text{class} | \text{document}) \propto P(\text{class}) \prod_{j=1}^{m} P(w_j | \text{class})^{count(w_j)}$$

This factorization reduces a complex joint distribution over millions of possible word combinations to simple per-word probabilities—a dramatic computational simplification.

"Naive" Doesn't Mean Useless

Despite its obviously wrong independence assumption, Naive Bayes with BoW features performs remarkably well on many text classification tasks. This success is partly explained by the 'explaining away' phenomenon and the robustness of discriminative predictions even when generative models are misspecified.

Consequences of the Independence Assumption:

Loss of phrase information: "hot dog" and "dog hot" produce identical BoW vectors, despite having completely different meanings.
No negation handling: "This movie is not good" will have the same representation as "This movie is good" (both contain "movie," "is," "good"), potentially confusing sentiment classifiers.
No syntactic structure: "The dog bit the man" and "The man bit the dog" are indistinguishable, though they describe opposite events.
Vocabulary explosion if using n-grams: We can partially recover word order with bigrams/trigrams (covered later), but vocabulary size explodes combinatorially.

When Independence Is Acceptable:

The independence assumption works well when:

Documents are long (bag smooths out local dependencies)
Classification is coarse-grained (topic classification vs. nuanced sentiment)
Key discriminative words occur independently (spam keywords, domain terminology)
Computational resources are limited (BoW + Naive Bayes is extremely fast)

Practical Considerations and Best Practices

Deploying BoW in production systems requires attention to several practical concerns that affect both correctness and performance.

Production BoW Checklist

•Vocabulary Consistency — The vocabulary must be identical between training and inference. Never recompute vocabulary on test data—this would create different feature indices and invalid predictions.
•Out-of-Vocabulary (OOV) Handling — New documents may contain words not seen during training. Standard practice: ignore OOV terms, or use a special <UNK> token. Do NOT add them to vocabulary at inference time.
•Memory Management — Dense BoW matrices for large vocabularies consume prohibitive memory. Always use sparse matrix representations (scipy.sparse.csr_matrix) in production.
•Serialization — Persist vocabulary alongside model. The vectorizer and classifier must be serialized together to ensure consistency across deployments.
•Reproducibility — Set random seeds if any stochastic elements exist. Sort vocabulary alphabetically to ensure deterministic feature ordering across runs.
•Incremental Updates — Adding new documents requires deciding whether to freeze vocabulary (simpler) or retrain with updated vocabulary (more comprehensive but breaks existing models).

Do: Best Practices

✓ Fit vectorizer on training data only ✓ Use sparse matrices for storage ✓ Apply consistent preprocessing ✓ Document vocabulary size and coverage ✓ Monitor OOV rates in production ✓ Version-control your preprocessing pipeline

Don't: Common Mistakes

✗ Fit on entire dataset (data leakage) ✗ Use dense matrices for large vocab ✗ Ignore preprocessing at inference ✗ Assume all words will be known ✗ Forget to lowercase consistently ✗ Mix different tokenization schemes

production_bow_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import joblib
 
# Sample corpus with labels
documents = [
    ("Machine learning models learn from data", "tech"),
    ("Deep neural networks have many layers", "tech"),
    ("The stock market crashed yesterday", "finance"),
    ("Investment portfolios need diversification", "finance"),
    ("Natural language processing is fascinating", "tech"),
    ("Bond yields are rising steadily", "finance"),
]
 
texts, labels = zip(*documents)
 
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42
)
 
# Create pipeline: vectorizer + classifier
# IMPORTANT: Vectorizer fits ONLY on training data
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        lowercase=True,
        stop_words='english',
        min_df=1,
        max_df=0.95,
        max_features=10000
    )),
    ('classifier', MultinomialNB())
])
 
# Fit on training data
pipeline.fit(X_train, y_train)
 
# Evaluate on test data
accuracy = pipeline.score(X_test, y_test)
print(f"Test accuracy: {accuracy:.2%}")
 
# Inspect vocabulary
vectorizer = pipeline.named_steps['vectorizer']
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
 
# Serialize for production
joblib.dump(pipeline, 'text_classifier_pipeline.joblib')
 
# Later, in production:
# loaded_pipeline = joblib.load('text_classifier_pipeline.joblib')
# prediction = loaded_pipeline.predict(["New financial regulations announced"])

BoW in the Modern ML Landscape

With the rise of deep learning and transformer-based language models like BERT and GPT, one might ask: Is Bag of Words still relevant?

The answer is a nuanced yes. While neural approaches dominate leaderboards, BoW remains valuable for several reasons:

BoW vs. Neural Approaches
Criterion	Bag of Words	Neural/Transformer Models
Interpretability	Fully interpretable; can inspect which words drove predictions	Black box; requires techniques like attention visualization
Computational Cost	CPU-friendly; trains in seconds	Requires GPUs; training takes hours/days
Data Requirements	Works with hundreds of examples	Often needs thousands to millions of examples
Inference Latency	Sub-millisecond predictions	Typically 10-100ms+ for transformer inference
Semantic Understanding	No semantic similarity; 'happy' ≠ 'joyful'	Captures semantics; synonyms cluster together
Sequence Modeling	No word order; 'not good' ≈ 'good'	Full context; understands negation and structure

When to Choose BoW:

Constrained environments: Edge devices, embedded systems, or contexts without GPU access benefit from BoW's minimal compute requirements.
Interpretability requirements: Regulated industries (healthcare, finance) often require explainable models. BoW + logistic regression provides direct feature attribution.
Rapid prototyping: BoW is the fastest path from raw text to working classifier. Start with BoW, establish a baseline, then upgrade if needed.
Limited labeled data: With only 100 labeled examples, a sophisticated neural model will likely overfit. BoW + simple classifiers generalize better from small data.
Real-time systems: When milliseconds matter (high-frequency trading, real-time moderation), BoW's speed is unmatched.

BoW as Baseline:

Professional ML practice always establishes a baseline before developing complex solutions. BoW + logistic regression is the standard text classification baseline. If you can't beat it, your sophisticated model isn't justified.

The Baseline Rule

Before training a BERT model for text classification, train a CountVectorizer + LogisticRegression. If your neural model doesn't substantially outperform this baseline, you're paying GPU costs for no benefit. BoW remains the honest yardstick against which progress is measured.

Summary: The Bag of Words Foundation

We've explored the Bag of Words model in depth—from intuition through mathematical formalization to production considerations. Let's consolidate the key insights:

Key Takeaways

•BoW transforms text to vectors — Documents become numerical vectors by counting word occurrences, enabling standard ML algorithms to process text.
•Words are treated independently — BoW ignores word order and grammatical structure, trading linguistic fidelity for computational tractability.
•The vocabulary defines feature space — Each unique term becomes a dimension; vocabulary construction decisions directly impact model quality.
•Weighting schemes matter — Raw counts, binary, TF, and log weighting each suit different tasks; there's no universal best choice.
•Sparsity is essential — Real-world BoW matrices are extremely sparse; efficient sparse representations are mandatory for scalability.
•Production requires discipline — Vocabulary consistency, OOV handling, and proper train/test separation are critical for reliable deployment.
•BoW remains relevant — Despite neural advances, BoW excels in interpretability, speed, low-data regimes, and as a baseline.

Looking Ahead:

The Bag of Words model provides the conceptual foundation for understanding all text feature engineering techniques. In the following pages, we'll build on this foundation:

Term Frequency: Formalizing how word counts are normalized and compared
Vocabulary Construction: Strategies for building effective, compact vocabularies
Sparse Representation: The data structures that make BoW computationally feasible
BoW Limitations: Where BoW fails and what alternatives address those failures

Mastering these topics will give you the knowledge to choose the right text representation for any NLP task.

Page Complete

You now have a deep, rigorous understanding of the Bag of Words model—its mechanics, mathematics, assumptions, and practical deployment. This knowledge forms the bedrock of classical NLP and remains essential even in the age of transformers. Next, we'll formalize Term Frequency and its role in document representation.

1 / 5

Loading learning content...

Machine LearningFeature Engineering & Selection

Text Feature Engineering - Basics

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

1 / 5

Bag of Words (BoW)

From Words to Numbers: The Fundamental Challenge

What You Will Master

The Bag of Words Intuition

The Core Insight

A worked example:

Consider two simple documents:

Document 1: "The cat sat on the mat"
Document 2: "The dog ran in the park"

Under the BoW model, we represent each document by counting word occurrences:

Word	Document 1	Document 2
the	2	2
cat	1	0
sat	1	0
on	1	0
mat	1	0
dog	0	1
ran	0	1
in	0	1
park	0	1

Each document becomes a vector of counts—a numerical representation we can feed to any machine learning algorithm.

Historical Significance

Mathematical Formalization

Let us formalize the Bag of Words model mathematically, as precision here will clarify implementation and enable rigorous analysis.

Definitions:

Let D = {d₁, d₂, ..., dₙ} be a corpus (collection) of n documents.

Let V = {t₁, t₂, ..., t_m} be the vocabulary—the set of all unique terms (words) appearing across the entire corpus. The vocabulary size |V| = m.

For each document dᵢ, we define its BoW representation as a vector xᵢ ∈ ℕᵐ where:

xᵢ[j] = count(tⱼ, dᵢ) = number of times term tⱼ appears in document dᵢ

This yields a document-term matrix X ∈ ℕⁿˣᵐ where:

Rows correspond to documents (n rows)
Columns correspond to vocabulary terms (m columns)
Entry X[i,j] = frequency of term j in document i

bow_matrix_construction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple
 
def build_vocabulary(corpus: List[str]) -> Dict[str, int]:
    """
    Build vocabulary from corpus, mapping each unique term to an index.
    
    Args:
        corpus: List of documents (strings)
    
    Returns:
        Dictionary mapping term -> column index
    """
    vocabulary = {}
    idx = 0
    
    for document in corpus:
        # Tokenize: split on whitespace (simplified)
        tokens = document.lower().split()
        
        for token in tokens:
            if token not in vocabulary:
                vocabulary[token] = idx
                idx += 1
    
    return vocabulary
 
def create_bow_matrix(corpus: List[str], vocabulary: Dict[str, int]) -> np.ndarray:
    """
    Create the document-term matrix (Bag of Words representation).
    
    Args:
        corpus: List of documents
        vocabulary: Mapping from term to column index
    
    Returns:
        Document-term matrix of shape (n_documents, vocabulary_size)
    """
    n_docs = len(corpus)
    vocab_size = len(vocabulary)
    
    # Initialize matrix with zeros
    X = np.zeros((n_docs, vocab_size), dtype=np.int32)
    
    for doc_idx, document in enumerate(corpus):
        tokens = document.lower().split()
        
        # Count term frequencies in this document
        term_counts = Counter(tokens)
        
        for term, count in term_counts.items():
            if term in vocabulary:
                col_idx = vocabulary[term]
                X[doc_idx, col_idx] = count
    
    return X
 
# Example usage
corpus = [
    "The cat sat on the mat",
    "The dog ran in the park",
    "The cat and dog played together"
]
 
vocabulary = build_vocabulary(corpus)
print(f"Vocabulary ({len(vocabulary)} terms): {vocabulary}")
 
X = create_bow_matrix(corpus, vocabulary)
print(f"
Document-Term Matrix (shape {X.shape}):")
print(X)

The Vector Space Model:

The BoW representation situates documents in a high-dimensional vector space where:

Each dimension corresponds to a vocabulary term
Each document is a point (or equivalently, a vector from the origin)
Similar documents cluster together in this space

Cosine similarity is particularly important:

$$\text{cosine_similarity}(\mathbf{x}_i, \mathbf{x}_j) = \frac{\mathbf{x}_i \cdot \mathbf{x}_j}{|\mathbf{x}_i| |\mathbf{x}_j|}$$

Cosine similarity measures the angle between document vectors, making it invariant to document length—a crucial property since documents vary greatly in size.

Dimensionality Implications

The BoW Construction Pipeline

Constructing a production-quality Bag of Words representation involves multiple preprocessing steps, each with design decisions that affect downstream performance. Let's examine each stage in detail.

BoW Construction Stages

•Stage 1: Text Acquisition — Load documents from files, databases, APIs, or web scraping. Handle encoding issues (UTF-8, Latin-1), malformed characters, and format-specific extraction (HTML, PDF, DOCX).
•Stage 2: Tokenization — Split continuous text into discrete tokens (usually words). Handle punctuation, contractions, hyphenated words, and language-specific considerations.
•Stage 3: Normalization — Convert tokens to a canonical form: lowercasing, accent/diacritic removal, Unicode normalization (NFD/NFC). This reduces vocabulary size by merging equivalent forms.
•Stage 4: Filtering — Remove stop words, rare terms (appearing in < k documents), or overly frequent terms (appearing in > p% of documents). Balance noise reduction against information loss.
•Stage 5: Vocabulary Construction — Build the term-to-index mapping. Decide on vocabulary pruning strategies and handle out-of-vocabulary (OOV) terms at inference time.
•Stage 6: Vectorization — Count term occurrences and populate the document-term matrix. Choose between raw counts, binary indicators, or normalized frequencies.

complete_bow_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
import re
import numpy as np
from collections import Counter
from typing import List, Dict, Set, Optional
 
class BagOfWordsVectorizer:
    """
    A production-quality Bag of Words vectorizer with full preprocessing pipeline.
    """
    
    # Common English stop words
    DEFAULT_STOP_WORDS: Set[str] = {
        'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
        'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
        'should', 'may', 'might', 'must', 'shall', 'can', 'need', 'dare',
        'ought', 'used', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by',
        'from', 'as', 'into', 'through', 'during', 'before', 'after',
        'above', 'below', 'between', 'under', 'again', 'further', 'then',
        'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'each',
        'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not',
        'only', 'own', 'same', 'so', 'than', 'too', 'very', 'just', 'and',
        'but', 'if', 'or', 'because', 'until', 'while', 'although', 'though',
        'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it', 'we',
        'they', 'what', 'which', 'who', 'whom', 'me', 'him', 'her', 'us', 'them'
    }
    
    def __init__(
        self,
        lowercase: bool = True,
        remove_stop_words: bool = True,
        min_df: int = 1,
        max_df: float = 1.0,
        max_features: Optional[int] = None,
        binary: bool = False
    ):
        """
        Initialize the vectorizer.
        
        Args:
            lowercase: Convert all text to lowercase
            remove_stop_words: Remove common stop words
            min_df: Minimum document frequency (int: count, float: proportion)
            max_df: Maximum document frequency (int: count, float: proportion)
            max_features: Limit vocabulary to top N features by frequency
            binary: If True, use binary indicators instead of counts
        """
        self.lowercase = lowercase
        self.remove_stop_words = remove_stop_words
        self.min_df = min_df
        self.max_df = max_df
        self.max_features = max_features
        self.binary = binary
        
        self.vocabulary_: Optional[Dict[str, int]] = None
        self.feature_names_: Optional[List[str]] = None
    
    def _tokenize(self, text: str) -> List[str]:
        """
        Tokenize text into words.
        Removes punctuation and splits on whitespace.
        """
        # Convert to lowercase if specified
        if self.lowercase:
            text = text.lower()
        
        # Remove punctuation and split
        # Keep alphanumeric characters and spaces
        text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
        tokens = text.split()
        
        # Remove stop words if specified
        if self.remove_stop_words:
            tokens = [t for t in tokens if t not in self.DEFAULT_STOP_WORDS]
        
        return tokens
    
    def fit(self, corpus: List[str]) -> 'BagOfWordsVectorizer':
        """
        Learn vocabulary from the corpus.
        """
        n_docs = len(corpus)
        
        # Count document frequencies for each term
        doc_freq: Counter = Counter()
        
        for document in corpus:
            tokens = set(self._tokenize(document))  # Unique tokens per doc
            doc_freq.update(tokens)
        
        # Apply min_df and max_df filters
        min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs)
        max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs)
        
        # Filter terms by document frequency
        valid_terms = {
            term for term, count in doc_freq.items()
            if min_count <= count <= max_count
        }
        
        # Count total occurrences for feature limitation
        term_freq: Counter = Counter()
        for document in corpus:
            tokens = self._tokenize(document)
            for token in tokens:
                if token in valid_terms:
                    term_freq[token] += 1
        
        # Select top features if max_features is specified
        if self.max_features is not None:
            selected_terms = [term for term, _ in term_freq.most_common(self.max_features)]
        else:
            selected_terms = list(valid_terms)
        
        # Sort terms alphabetically for consistency
        selected_terms = sorted(selected_terms)
        
        # Build vocabulary mapping
        self.vocabulary_ = {term: idx for idx, term in enumerate(selected_terms)}
        self.feature_names_ = selected_terms
        
        return self
    
    def transform(self, corpus: List[str]) -> np.ndarray:
        """
        Transform documents to document-term matrix.
        """
        if self.vocabulary_ is None:
            raise ValueError("Vectorizer not fitted. Call fit() first.")
        
        n_docs = len(corpus)
        vocab_size = len(self.vocabulary_)
        
        X = np.zeros((n_docs, vocab_size), dtype=np.int32)
        
        for doc_idx, document in enumerate(corpus):
            tokens = self._tokenize(document)
            term_counts = Counter(tokens)
            
            for term, count in term_counts.items():
                if term in self.vocabulary_:
                    col_idx = self.vocabulary_[term]
                    X[doc_idx, col_idx] = 1 if self.binary else count
        
        return X
    
    def fit_transform(self, corpus: List[str]) -> np.ndarray:
        """Fit vocabulary and transform in one step."""
        return self.fit(corpus).transform(corpus)
    
    def get_feature_names(self) -> List[str]:
        """Return the feature names (vocabulary terms)."""
        return self.feature_names_ or []
 
 
# Demonstration
if __name__ == "__main__":
    corpus = [
        "Machine learning is a subset of artificial intelligence.",
        "Deep learning uses neural networks with many layers.",
        "Natural language processing enables machines to understand text.",
        "Supervised learning requires labeled training data.",
        "Unsupervised learning discovers patterns without labels."
    ]
    
    vectorizer = BagOfWordsVectorizer(
        lowercase=True,
        remove_stop_words=True,
        min_df=1,
        max_df=1.0,
        binary=False
    )
    
    X = vectorizer.fit_transform(corpus)
    
    print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
    print(f"Feature names: {vectorizer.get_feature_names()}")
    print(f"
Document-Term Matrix shape: {X.shape}")
    print(f"
Document-Term Matrix:
{X}")

Scikit-learn's CountVectorizer

BoW Weighting Schemes

Common BoW Weighting Schemes
Scheme	Formula	Use Case	Trade-offs
Raw Count	x[j] = count(tⱼ, d)	When word frequency directly indicates importance	Easily dominated by long documents; common words overweighted
Binary	x[j] = 1 if tⱼ ∈ d, else 0	When presence matters more than frequency	Loses frequency information; treats all terms equally
Term Frequency (TF)	x[j] = count(tⱼ, d) / \|d\|	Length-normalized comparison	Still overweights common terms across corpus
Log Frequency	x[j] = 1 + log(count) if count > 0	Dampens impact of repeated terms	Two occurrences ≠ twice as important as one
Boolean TF	x[j] = 1 if count > 0, else 0	Same as binary; explicit boolean representation	Maximum information compression; loses nuance

Choosing a Weighting Scheme:

The optimal weighting scheme depends on your task:

Spam detection: Binary encoding often works well. The presence of certain terms ("free," "winner," "click") strongly indicates spam, but whether they appear once or five times matters less.
Topic modeling: Raw counts or log frequencies work better because repeated occurrences of topic-related terms ("election," "candidate," "vote") genuinely indicate topicality.
Authorship attribution: Relative frequencies matter—different authors have different word usage patterns. Term frequency normalization helps compare authors who write at different lengths.
Semantic similarity: Log frequencies or TF-IDF (covered later) work best, as they balance local importance (within document) against global significance (across corpus).

weighting_schemes.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
from typing import Literal
 
WeightingScheme = Literal["raw", "binary", "tf", "log", "augmented"]
 
def apply_weighting(
    raw_counts: np.ndarray,
    scheme: WeightingScheme = "raw"
) -> np.ndarray:
    """
    Apply different weighting schemes to raw count matrix.
    
    Args:
        raw_counts: Document-term matrix with raw counts
        scheme: Weighting scheme to apply
    
    Returns:
        Weighted document-term matrix
    """
    if scheme == "raw":
        return raw_counts.astype(np.float64)
    
    elif scheme == "binary":
        return (raw_counts > 0).astype(np.float64)
    
    elif scheme == "tf":
        # Term Frequency: normalize by document length
        doc_lengths = raw_counts.sum(axis=1, keepdims=True)
        # Avoid division by zero for empty documents
        doc_lengths = np.maximum(doc_lengths, 1)
        return raw_counts.astype(np.float64) / doc_lengths
    
    elif scheme == "log":
        # Logarithmic weighting: 1 + log(count) for count > 0
        result = np.zeros_like(raw_counts, dtype=np.float64)
        mask = raw_counts > 0
        result[mask] = 1 + np.log(raw_counts[mask])
        return result
    
    elif scheme == "augmented":
        # Augmented TF: prevents bias toward longer documents
        # formula: 0.5 + 0.5 * (count / max_count_in_doc)
        max_counts = raw_counts.max(axis=1, keepdims=True)
        max_counts = np.maximum(max_counts, 1)  # Avoid division by zero
        return 0.5 + 0.5 * (raw_counts.astype(np.float64) / max_counts)
    
    else:
        raise ValueError(f"Unknown weighting scheme: {scheme}")
 
 
# Demonstration
if __name__ == "__main__":
    # Example raw count matrix
    raw_counts = np.array([
        [3, 0, 2, 1, 5],  # Document 1
        [0, 1, 0, 4, 2],  # Document 2
        [1, 1, 1, 1, 1],  # Document 3
    ])
    
    print("Raw counts:")
    print(raw_counts)
    
    for scheme in ["raw", "binary", "tf", "log", "augmented"]:
        weighted = apply_weighting(raw_counts, scheme)
        print(f"
{scheme.upper()} weighting:")
        print(np.round(weighted, 3))

The Independence Assumption

In reality, words are highly dependent:

"New York" is a single concept, but BoW treats "New" and "York" as separate, independent features
"Not good" has opposite meaning to "good," but BoW doesn't capture negation
"Machine learning" collocates due to linguistic convention, not chance

Why make this assumption?

The independence assumption enables the Naive Bayes classifier, which assumes:

$$P(\text{class} | \text{document}) \propto P(\text{class}) \prod_{j=1}^{m} P(w_j | \text{class})^{count(w_j)}$$

This factorization reduces a complex joint distribution over millions of possible word combinations to simple per-word probabilities—a dramatic computational simplification.

"Naive" Doesn't Mean Useless

Consequences of the Independence Assumption:

Loss of phrase information: "hot dog" and "dog hot" produce identical BoW vectors, despite having completely different meanings.
No negation handling: "This movie is not good" will have the same representation as "This movie is good" (both contain "movie," "is," "good"), potentially confusing sentiment classifiers.
No syntactic structure: "The dog bit the man" and "The man bit the dog" are indistinguishable, though they describe opposite events.
Vocabulary explosion if using n-grams: We can partially recover word order with bigrams/trigrams (covered later), but vocabulary size explodes combinatorially.

When Independence Is Acceptable:

The independence assumption works well when:

Documents are long (bag smooths out local dependencies)
Classification is coarse-grained (topic classification vs. nuanced sentiment)
Key discriminative words occur independently (spam keywords, domain terminology)
Computational resources are limited (BoW + Naive Bayes is extremely fast)

Practical Considerations and Best Practices

Deploying BoW in production systems requires attention to several practical concerns that affect both correctness and performance.

Production BoW Checklist

•Vocabulary Consistency — The vocabulary must be identical between training and inference. Never recompute vocabulary on test data—this would create different feature indices and invalid predictions.
•Out-of-Vocabulary (OOV) Handling — New documents may contain words not seen during training. Standard practice: ignore OOV terms, or use a special <UNK> token. Do NOT add them to vocabulary at inference time.
•Memory Management — Dense BoW matrices for large vocabularies consume prohibitive memory. Always use sparse matrix representations (scipy.sparse.csr_matrix) in production.
•Serialization — Persist vocabulary alongside model. The vectorizer and classifier must be serialized together to ensure consistency across deployments.
•Reproducibility — Set random seeds if any stochastic elements exist. Sort vocabulary alphabetically to ensure deterministic feature ordering across runs.
•Incremental Updates — Adding new documents requires deciding whether to freeze vocabulary (simpler) or retrain with updated vocabulary (more comprehensive but breaks existing models).

Do: Best Practices

Don't: Common Mistakes

production_bow_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import joblib
 
# Sample corpus with labels
documents = [
    ("Machine learning models learn from data", "tech"),
    ("Deep neural networks have many layers", "tech"),
    ("The stock market crashed yesterday", "finance"),
    ("Investment portfolios need diversification", "finance"),
    ("Natural language processing is fascinating", "tech"),
    ("Bond yields are rising steadily", "finance"),
]
 
texts, labels = zip(*documents)
 
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42
)
 
# Create pipeline: vectorizer + classifier
# IMPORTANT: Vectorizer fits ONLY on training data
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(
        lowercase=True,
        stop_words='english',
        min_df=1,
        max_df=0.95,
        max_features=10000
    )),
    ('classifier', MultinomialNB())
])
 
# Fit on training data
pipeline.fit(X_train, y_train)
 
# Evaluate on test data
accuracy = pipeline.score(X_test, y_test)
print(f"Test accuracy: {accuracy:.2%}")
 
# Inspect vocabulary
vectorizer = pipeline.named_steps['vectorizer']
print(f"Vocabulary size: {len(vectorizer.vocabulary_)}")
 
# Serialize for production
joblib.dump(pipeline, 'text_classifier_pipeline.joblib')
 
# Later, in production:
# loaded_pipeline = joblib.load('text_classifier_pipeline.joblib')
# prediction = loaded_pipeline.predict(["New financial regulations announced"])

BoW in the Modern ML Landscape

With the rise of deep learning and transformer-based language models like BERT and GPT, one might ask: Is Bag of Words still relevant?

The answer is a nuanced yes. While neural approaches dominate leaderboards, BoW remains valuable for several reasons:

BoW vs. Neural Approaches
Criterion	Bag of Words	Neural/Transformer Models
Interpretability	Fully interpretable; can inspect which words drove predictions	Black box; requires techniques like attention visualization
Computational Cost	CPU-friendly; trains in seconds	Requires GPUs; training takes hours/days
Data Requirements	Works with hundreds of examples	Often needs thousands to millions of examples
Inference Latency	Sub-millisecond predictions	Typically 10-100ms+ for transformer inference
Semantic Understanding	No semantic similarity; 'happy' ≠ 'joyful'	Captures semantics; synonyms cluster together
Sequence Modeling	No word order; 'not good' ≈ 'good'	Full context; understands negation and structure

When to Choose BoW:

Constrained environments: Edge devices, embedded systems, or contexts without GPU access benefit from BoW's minimal compute requirements.
Interpretability requirements: Regulated industries (healthcare, finance) often require explainable models. BoW + logistic regression provides direct feature attribution.
Rapid prototyping: BoW is the fastest path from raw text to working classifier. Start with BoW, establish a baseline, then upgrade if needed.
Limited labeled data: With only 100 labeled examples, a sophisticated neural model will likely overfit. BoW + simple classifiers generalize better from small data.
Real-time systems: When milliseconds matter (high-frequency trading, real-time moderation), BoW's speed is unmatched.

BoW as Baseline:

The Baseline Rule

Summary: The Bag of Words Foundation

We've explored the Bag of Words model in depth—from intuition through mathematical formalization to production considerations. Let's consolidate the key insights:

Key Takeaways

•BoW transforms text to vectors — Documents become numerical vectors by counting word occurrences, enabling standard ML algorithms to process text.
•Words are treated independently — BoW ignores word order and grammatical structure, trading linguistic fidelity for computational tractability.
•The vocabulary defines feature space — Each unique term becomes a dimension; vocabulary construction decisions directly impact model quality.
•Weighting schemes matter — Raw counts, binary, TF, and log weighting each suit different tasks; there's no universal best choice.
•Sparsity is essential — Real-world BoW matrices are extremely sparse; efficient sparse representations are mandatory for scalability.
•Production requires discipline — Vocabulary consistency, OOV handling, and proper train/test separation are critical for reliable deployment.
•BoW remains relevant — Despite neural advances, BoW excels in interpretability, speed, low-data regimes, and as a baseline.

Looking Ahead:

The Bag of Words model provides the conceptual foundation for understanding all text feature engineering techniques. In the following pages, we'll build on this foundation:

Term Frequency: Formalizing how word counts are normalized and compared
Vocabulary Construction: Strategies for building effective, compact vocabularies
Sparse Representation: The data structures that make BoW computationally feasible
BoW Limitations: Where BoW fails and what alternatives address those failures

Mastering these topics will give you the knowledge to choose the right text representation for any NLP task.

Page Complete

1 / 5