Loading learning content...
Machine learning algorithms operate in the realm of numbers—vectors, matrices, and tensors that can be manipulated through mathematical operations like dot products, matrix multiplication, and gradient descent. Yet human language is inherently symbolic: words, sentences, and documents are discrete sequences of characters with no obvious numerical structure.
This creates the fundamental challenge of text feature engineering: How do we transform the rich, complex, and inherently unstructured nature of human language into numerical representations that preserve semantic meaning while being computationally tractable?
The Bag of Words (BoW) model stands as one of the most elegant and historically significant solutions to this problem. Despite its simplicity—or perhaps because of it—BoW has powered decades of text classification, sentiment analysis, spam detection, and information retrieval systems. Understanding BoW deeply is not merely an academic exercise; it is foundational knowledge that illuminates why more sophisticated techniques like TF-IDF, word embeddings, and transformer-based representations work the way they do.
By the end of this page, you will understand: (1) The theoretical foundations and intuition behind the Bag of Words model, (2) The mathematical formalization of document-term matrices, (3) The step-by-step process of constructing BoW representations, (4) Implementation considerations and computational tradeoffs, and (5) The assumptions BoW makes and when those assumptions are reasonable versus limiting.
Imagine you're sorting through a library of documents, trying to categorize them by topic. One simple but remarkably effective strategy would be to ignore how words are arranged and focus solely on which words appear and how often. A document about astronomy will likely contain words like "star," "galaxy," "telescope," and "orbit." A document about cooking will feature "recipe," "ingredient," "temperature," and "stir."
The Bag of Words model formalizes exactly this intuition. It represents each document as a multiset (or "bag") of its constituent words, discarding grammar, word order, and sentence structure entirely.
The term "bag" comes from mathematics: a bag (or multiset) is like a set, but it allows multiple instances of the same element. If a document contains the word "data" three times, the bag records three instances of "data"—unlike a set, which would record only one.
BoW operates on the hypothesis that word occurrence patterns are sufficient for many text classification tasks. By counting what words appear—without caring about their order—we capture the 'aboutness' of a document. A document containing 'machine', 'learning', 'algorithm', and 'training' probably concerns ML, regardless of how those words are arranged syntactically.
A worked example:
Consider two simple documents:
Under the BoW model, we represent each document by counting word occurrences:
| Word | Document 1 | Document 2 |
|---|---|---|
| the | 2 | 2 |
| cat | 1 | 0 |
| sat | 1 | 0 |
| on | 1 | 0 |
| mat | 1 | 0 |
| dog | 0 | 1 |
| ran | 0 | 1 |
| in | 0 | 1 |
| park | 0 | 1 |
Each document becomes a vector of counts—a numerical representation we can feed to any machine learning algorithm.
The Bag of Words concept predates modern ML by decades. It emerged from information retrieval research in the 1950s-60s, particularly the work of Hans Peter Luhn at IBM. The model's longevity speaks to its fundamental soundness: when you want interpretable, fast, and reasonably effective text features, BoW remains a strong baseline even today.
Let us formalize the Bag of Words model mathematically, as precision here will clarify implementation and enable rigorous analysis.
Definitions:
Let D = {d₁, d₂, ..., dₙ} be a corpus (collection) of n documents.
Let V = {t₁, t₂, ..., t_m} be the vocabulary—the set of all unique terms (words) appearing across the entire corpus. The vocabulary size |V| = m.
For each document dᵢ, we define its BoW representation as a vector xᵢ ∈ ℕᵐ where:
xᵢ[j] = count(tⱼ, dᵢ) = number of times term tⱼ appears in document dᵢ
This yields a document-term matrix X ∈ ℕⁿˣᵐ where:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom collections import Counterfrom typing import List, Dict, Tuple def build_vocabulary(corpus: List[str]) -> Dict[str, int]: """ Build vocabulary from corpus, mapping each unique term to an index. Args: corpus: List of documents (strings) Returns: Dictionary mapping term -> column index """ vocabulary = {} idx = 0 for document in corpus: # Tokenize: split on whitespace (simplified) tokens = document.lower().split() for token in tokens: if token not in vocabulary: vocabulary[token] = idx idx += 1 return vocabulary def create_bow_matrix(corpus: List[str], vocabulary: Dict[str, int]) -> np.ndarray: """ Create the document-term matrix (Bag of Words representation). Args: corpus: List of documents vocabulary: Mapping from term to column index Returns: Document-term matrix of shape (n_documents, vocabulary_size) """ n_docs = len(corpus) vocab_size = len(vocabulary) # Initialize matrix with zeros X = np.zeros((n_docs, vocab_size), dtype=np.int32) for doc_idx, document in enumerate(corpus): tokens = document.lower().split() # Count term frequencies in this document term_counts = Counter(tokens) for term, count in term_counts.items(): if term in vocabulary: col_idx = vocabulary[term] X[doc_idx, col_idx] = count return X # Example usagecorpus = [ "The cat sat on the mat", "The dog ran in the park", "The cat and dog played together"] vocabulary = build_vocabulary(corpus)print(f"Vocabulary ({len(vocabulary)} terms): {vocabulary}") X = create_bow_matrix(corpus, vocabulary)print(f"\nDocument-Term Matrix (shape {X.shape}):")print(X)The Vector Space Model:
The BoW representation situates documents in a high-dimensional vector space where:
This geometric interpretation is powerful: it allows us to use distance metrics (Euclidean distance, cosine similarity) to measure document similarity, enabling tasks like document clustering, nearest-neighbor search, and information retrieval.
Cosine similarity is particularly important:
$$\text{cosine_similarity}(\mathbf{x}_i, \mathbf{x}_j) = \frac{\mathbf{x}_i \cdot \mathbf{x}_j}{|\mathbf{x}_i| |\mathbf{x}_j|}$$
Cosine similarity measures the angle between document vectors, making it invariant to document length—a crucial property since documents vary greatly in size.
A typical English corpus easily has a vocabulary of 50,000-100,000 unique terms. This means each document vector has 50,000-100,000 dimensions. While daunting, this high dimensionality is manageable because BoW vectors are extremely sparse—most entries are zero. We'll explore sparsity in detail later in this module.
Constructing a production-quality Bag of Words representation involves multiple preprocessing steps, each with design decisions that affect downstream performance. Let's examine each stage in detail.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
import reimport numpy as npfrom collections import Counterfrom typing import List, Dict, Set, Optional class BagOfWordsVectorizer: """ A production-quality Bag of Words vectorizer with full preprocessing pipeline. """ # Common English stop words DEFAULT_STOP_WORDS: Set[str] = { 'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might', 'must', 'shall', 'can', 'need', 'dare', 'ought', 'used', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from', 'as', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'between', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'just', 'and', 'but', 'if', 'or', 'because', 'until', 'while', 'although', 'though', 'this', 'that', 'these', 'those', 'i', 'you', 'he', 'she', 'it', 'we', 'they', 'what', 'which', 'who', 'whom', 'me', 'him', 'her', 'us', 'them' } def __init__( self, lowercase: bool = True, remove_stop_words: bool = True, min_df: int = 1, max_df: float = 1.0, max_features: Optional[int] = None, binary: bool = False ): """ Initialize the vectorizer. Args: lowercase: Convert all text to lowercase remove_stop_words: Remove common stop words min_df: Minimum document frequency (int: count, float: proportion) max_df: Maximum document frequency (int: count, float: proportion) max_features: Limit vocabulary to top N features by frequency binary: If True, use binary indicators instead of counts """ self.lowercase = lowercase self.remove_stop_words = remove_stop_words self.min_df = min_df self.max_df = max_df self.max_features = max_features self.binary = binary self.vocabulary_: Optional[Dict[str, int]] = None self.feature_names_: Optional[List[str]] = None def _tokenize(self, text: str) -> List[str]: """ Tokenize text into words. Removes punctuation and splits on whitespace. """ # Convert to lowercase if specified if self.lowercase: text = text.lower() # Remove punctuation and split # Keep alphanumeric characters and spaces text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text) tokens = text.split() # Remove stop words if specified if self.remove_stop_words: tokens = [t for t in tokens if t not in self.DEFAULT_STOP_WORDS] return tokens def fit(self, corpus: List[str]) -> 'BagOfWordsVectorizer': """ Learn vocabulary from the corpus. """ n_docs = len(corpus) # Count document frequencies for each term doc_freq: Counter = Counter() for document in corpus: tokens = set(self._tokenize(document)) # Unique tokens per doc doc_freq.update(tokens) # Apply min_df and max_df filters min_count = self.min_df if isinstance(self.min_df, int) else int(self.min_df * n_docs) max_count = self.max_df if isinstance(self.max_df, int) else int(self.max_df * n_docs) # Filter terms by document frequency valid_terms = { term for term, count in doc_freq.items() if min_count <= count <= max_count } # Count total occurrences for feature limitation term_freq: Counter = Counter() for document in corpus: tokens = self._tokenize(document) for token in tokens: if token in valid_terms: term_freq[token] += 1 # Select top features if max_features is specified if self.max_features is not None: selected_terms = [term for term, _ in term_freq.most_common(self.max_features)] else: selected_terms = list(valid_terms) # Sort terms alphabetically for consistency selected_terms = sorted(selected_terms) # Build vocabulary mapping self.vocabulary_ = {term: idx for idx, term in enumerate(selected_terms)} self.feature_names_ = selected_terms return self def transform(self, corpus: List[str]) -> np.ndarray: """ Transform documents to document-term matrix. """ if self.vocabulary_ is None: raise ValueError("Vectorizer not fitted. Call fit() first.") n_docs = len(corpus) vocab_size = len(self.vocabulary_) X = np.zeros((n_docs, vocab_size), dtype=np.int32) for doc_idx, document in enumerate(corpus): tokens = self._tokenize(document) term_counts = Counter(tokens) for term, count in term_counts.items(): if term in self.vocabulary_: col_idx = self.vocabulary_[term] X[doc_idx, col_idx] = 1 if self.binary else count return X def fit_transform(self, corpus: List[str]) -> np.ndarray: """Fit vocabulary and transform in one step.""" return self.fit(corpus).transform(corpus) def get_feature_names(self) -> List[str]: """Return the feature names (vocabulary terms).""" return self.feature_names_ or [] # Demonstrationif __name__ == "__main__": corpus = [ "Machine learning is a subset of artificial intelligence.", "Deep learning uses neural networks with many layers.", "Natural language processing enables machines to understand text.", "Supervised learning requires labeled training data.", "Unsupervised learning discovers patterns without labels." ] vectorizer = BagOfWordsVectorizer( lowercase=True, remove_stop_words=True, min_df=1, max_df=1.0, binary=False ) X = vectorizer.fit_transform(corpus) print(f"Vocabulary size: {len(vectorizer.vocabulary_)}") print(f"Feature names: {vectorizer.get_feature_names()}") print(f"\nDocument-Term Matrix shape: {X.shape}") print(f"\nDocument-Term Matrix:\n{X}")In practice, you'll typically use sklearn.feature_extraction.text.CountVectorizer, which implements this entire pipeline efficiently. Understanding the internals—as we've done here—helps you configure it correctly and debug unexpected behavior.
The basic BoW model uses raw term counts as feature values. However, alternative weighting schemes can improve performance for specific tasks. Understanding these schemes provides a foundation for more sophisticated approaches like TF-IDF (covered in the next module).
| Scheme | Formula | Use Case | Trade-offs |
|---|---|---|---|
| Raw Count | x[j] = count(tⱼ, d) | When word frequency directly indicates importance | Easily dominated by long documents; common words overweighted |
| Binary | x[j] = 1 if tⱼ ∈ d, else 0 | When presence matters more than frequency | Loses frequency information; treats all terms equally |
| Term Frequency (TF) | x[j] = count(tⱼ, d) / |d| | Length-normalized comparison | Still overweights common terms across corpus |
| Log Frequency | x[j] = 1 + log(count) if count > 0 | Dampens impact of repeated terms | Two occurrences ≠ twice as important as one |
| Boolean TF | x[j] = 1 if count > 0, else 0 | Same as binary; explicit boolean representation | Maximum information compression; loses nuance |
Choosing a Weighting Scheme:
The optimal weighting scheme depends on your task:
Spam detection: Binary encoding often works well. The presence of certain terms ("free," "winner," "click") strongly indicates spam, but whether they appear once or five times matters less.
Topic modeling: Raw counts or log frequencies work better because repeated occurrences of topic-related terms ("election," "candidate," "vote") genuinely indicate topicality.
Authorship attribution: Relative frequencies matter—different authors have different word usage patterns. Term frequency normalization helps compare authors who write at different lengths.
Semantic similarity: Log frequencies or TF-IDF (covered later) work best, as they balance local importance (within document) against global significance (across corpus).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npfrom typing import Literal WeightingScheme = Literal["raw", "binary", "tf", "log", "augmented"] def apply_weighting( raw_counts: np.ndarray, scheme: WeightingScheme = "raw") -> np.ndarray: """ Apply different weighting schemes to raw count matrix. Args: raw_counts: Document-term matrix with raw counts scheme: Weighting scheme to apply Returns: Weighted document-term matrix """ if scheme == "raw": return raw_counts.astype(np.float64) elif scheme == "binary": return (raw_counts > 0).astype(np.float64) elif scheme == "tf": # Term Frequency: normalize by document length doc_lengths = raw_counts.sum(axis=1, keepdims=True) # Avoid division by zero for empty documents doc_lengths = np.maximum(doc_lengths, 1) return raw_counts.astype(np.float64) / doc_lengths elif scheme == "log": # Logarithmic weighting: 1 + log(count) for count > 0 result = np.zeros_like(raw_counts, dtype=np.float64) mask = raw_counts > 0 result[mask] = 1 + np.log(raw_counts[mask]) return result elif scheme == "augmented": # Augmented TF: prevents bias toward longer documents # formula: 0.5 + 0.5 * (count / max_count_in_doc) max_counts = raw_counts.max(axis=1, keepdims=True) max_counts = np.maximum(max_counts, 1) # Avoid division by zero return 0.5 + 0.5 * (raw_counts.astype(np.float64) / max_counts) else: raise ValueError(f"Unknown weighting scheme: {scheme}") # Demonstrationif __name__ == "__main__": # Example raw count matrix raw_counts = np.array([ [3, 0, 2, 1, 5], # Document 1 [0, 1, 0, 4, 2], # Document 2 [1, 1, 1, 1, 1], # Document 3 ]) print("Raw counts:") print(raw_counts) for scheme in ["raw", "binary", "tf", "log", "augmented"]: weighted = apply_weighting(raw_counts, scheme) print(f"\n{scheme.upper()} weighting:") print(np.round(weighted, 3))The Bag of Words model makes a strong simplifying assumption: words are treated as statistically independent of each other. This assumption is mathematically convenient and computationally tractable, but it's also demonstrably false.
In reality, words are highly dependent:
Why make this assumption?
The independence assumption enables the Naive Bayes classifier, which assumes:
$$P(\text{class} | \text{document}) \propto P(\text{class}) \prod_{j=1}^{m} P(w_j | \text{class})^{count(w_j)}$$
This factorization reduces a complex joint distribution over millions of possible word combinations to simple per-word probabilities—a dramatic computational simplification.
Despite its obviously wrong independence assumption, Naive Bayes with BoW features performs remarkably well on many text classification tasks. This success is partly explained by the 'explaining away' phenomenon and the robustness of discriminative predictions even when generative models are misspecified.
Consequences of the Independence Assumption:
Loss of phrase information: "hot dog" and "dog hot" produce identical BoW vectors, despite having completely different meanings.
No negation handling: "This movie is not good" will have the same representation as "This movie is good" (both contain "movie," "is," "good"), potentially confusing sentiment classifiers.
No syntactic structure: "The dog bit the man" and "The man bit the dog" are indistinguishable, though they describe opposite events.
Vocabulary explosion if using n-grams: We can partially recover word order with bigrams/trigrams (covered later), but vocabulary size explodes combinatorially.
When Independence Is Acceptable:
The independence assumption works well when:
Deploying BoW in production systems requires attention to several practical concerns that affect both correctness and performance.
✓ Fit vectorizer on training data only ✓ Use sparse matrices for storage ✓ Apply consistent preprocessing ✓ Document vocabulary size and coverage ✓ Monitor OOV rates in production ✓ Version-control your preprocessing pipeline
✗ Fit on entire dataset (data leakage) ✗ Use dense matrices for large vocab ✗ Ignore preprocessing at inference ✗ Assume all words will be known ✗ Forget to lowercase consistently ✗ Mix different tokenization schemes
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.naive_bayes import MultinomialNBfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import train_test_splitimport joblib # Sample corpus with labelsdocuments = [ ("Machine learning models learn from data", "tech"), ("Deep neural networks have many layers", "tech"), ("The stock market crashed yesterday", "finance"), ("Investment portfolios need diversification", "finance"), ("Natural language processing is fascinating", "tech"), ("Bond yields are rising steadily", "finance"),] texts, labels = zip(*documents) # Split dataX_train, X_test, y_train, y_test = train_test_split( texts, labels, test_size=0.3, random_state=42) # Create pipeline: vectorizer + classifier# IMPORTANT: Vectorizer fits ONLY on training datapipeline = Pipeline([ ('vectorizer', CountVectorizer( lowercase=True, stop_words='english', min_df=1, max_df=0.95, max_features=10000 )), ('classifier', MultinomialNB())]) # Fit on training datapipeline.fit(X_train, y_train) # Evaluate on test dataaccuracy = pipeline.score(X_test, y_test)print(f"Test accuracy: {accuracy:.2%}") # Inspect vocabularyvectorizer = pipeline.named_steps['vectorizer']print(f"Vocabulary size: {len(vectorizer.vocabulary_)}") # Serialize for productionjoblib.dump(pipeline, 'text_classifier_pipeline.joblib') # Later, in production:# loaded_pipeline = joblib.load('text_classifier_pipeline.joblib')# prediction = loaded_pipeline.predict(["New financial regulations announced"])With the rise of deep learning and transformer-based language models like BERT and GPT, one might ask: Is Bag of Words still relevant?
The answer is a nuanced yes. While neural approaches dominate leaderboards, BoW remains valuable for several reasons:
| Criterion | Bag of Words | Neural/Transformer Models |
|---|---|---|
| Interpretability | Fully interpretable; can inspect which words drove predictions | Black box; requires techniques like attention visualization |
| Computational Cost | CPU-friendly; trains in seconds | Requires GPUs; training takes hours/days |
| Data Requirements | Works with hundreds of examples | Often needs thousands to millions of examples |
| Inference Latency | Sub-millisecond predictions | Typically 10-100ms+ for transformer inference |
| Semantic Understanding | No semantic similarity; 'happy' ≠ 'joyful' | Captures semantics; synonyms cluster together |
| Sequence Modeling | No word order; 'not good' ≈ 'good' | Full context; understands negation and structure |
When to Choose BoW:
Constrained environments: Edge devices, embedded systems, or contexts without GPU access benefit from BoW's minimal compute requirements.
Interpretability requirements: Regulated industries (healthcare, finance) often require explainable models. BoW + logistic regression provides direct feature attribution.
Rapid prototyping: BoW is the fastest path from raw text to working classifier. Start with BoW, establish a baseline, then upgrade if needed.
Limited labeled data: With only 100 labeled examples, a sophisticated neural model will likely overfit. BoW + simple classifiers generalize better from small data.
Real-time systems: When milliseconds matter (high-frequency trading, real-time moderation), BoW's speed is unmatched.
BoW as Baseline:
Professional ML practice always establishes a baseline before developing complex solutions. BoW + logistic regression is the standard text classification baseline. If you can't beat it, your sophisticated model isn't justified.
Before training a BERT model for text classification, train a CountVectorizer + LogisticRegression. If your neural model doesn't substantially outperform this baseline, you're paying GPU costs for no benefit. BoW remains the honest yardstick against which progress is measured.
We've explored the Bag of Words model in depth—from intuition through mathematical formalization to production considerations. Let's consolidate the key insights:
Looking Ahead:
The Bag of Words model provides the conceptual foundation for understanding all text feature engineering techniques. In the following pages, we'll build on this foundation:
Mastering these topics will give you the knowledge to choose the right text representation for any NLP task.
You now have a deep, rigorous understanding of the Bag of Words model—its mechanics, mathematics, assumptions, and practical deployment. This knowledge forms the bedrock of classical NLP and remains essential even in the age of transformers. Next, we'll formalize Term Frequency and its role in document representation.