Multinomial Bernoulli Naive Bayes - Learning Module

Loading content...

0/245

Text Classification

The Language of Probability

Every day, billions of text documents flow through computational systems: emails sorted into spam and inbox, customer reviews categorized by sentiment, news articles routed to appropriate topics, and support tickets triaged by urgency. Behind these seemingly magical systems lies a surprisingly elegant mathematical framework that treats documents as probability distributions and classifies them using centuries-old principles discovered by Reverend Thomas Bayes.

Text classification is the task of automatically assigning predefined categories to text documents based on their content. Despite the advent of deep learning and transformer architectures, Naive Bayes classifiers remain remarkably effective for text classification—often matching or exceeding the performance of far more complex models while requiring orders of magnitude less computation and training data.

What You Will Learn

By the end of this page, you will understand the fundamental principles of text classification, how documents are represented mathematically for machine learning, the bag-of-words model and its variants, why Naive Bayes excels at text classification, and the practical considerations that make probabilistic text classifiers indispensable in production systems.

The Text Classification Problem

Text classification is a supervised learning problem where the goal is to learn a function that maps documents to categories. Given a training set of labeled documents, we seek to build a classifier that can accurately predict the category of previously unseen documents.

Formal Definition:

Let $\mathcal{D} = {d_1, d_2, \ldots, d_n}$ be a collection of documents and $\mathcal{C} = {c_1, c_2, \ldots, c_k}$ be a fixed set of categories. The text classification task is to learn a classifier $\gamma: \mathcal{D} \rightarrow \mathcal{C}$ that assigns the correct category to each document.

More formally, given a document $d$ represented as a feature vector $\mathbf{x}$, we want to find:

$$\hat{c} = \arg\max_{c \in \mathcal{C}} P(c | \mathbf{x})$$

This is where Bayes' theorem becomes essential. Rather than directly modeling $P(c|\mathbf{x})$—which would require understanding the complex conditional distributions of text given categories—we can apply Bayes' theorem to invert the problem:

$$P(c | \mathbf{x}) = \frac{P(\mathbf{x} | c) P(c)}{P(\mathbf{x})}$$

Common Text Classification Tasks
Task	Categories	Example Applications	Key Challenges
Spam Detection	Binary (spam/ham)	Email filtering, SMS spam, comment moderation	Adversarial adaptation, evolving language
Sentiment Analysis	Binary/Multi-class	Product reviews, social media monitoring	Sarcasm, negation, domain specificity
Topic Classification	Multi-class	News categorization, document routing	Hierarchical topics, multi-label scenarios
Intent Detection	Multi-class	Chatbots, virtual assistants, IVR systems	Short texts, similar intents, context
Language Identification	Multi-class	Multilingual systems, content routing	Similar languages, code-switching
Author Attribution	Multi-class	Forensic linguistics, plagiarism detection	Style variation, limited training data

Why Not Just Use Deep Learning?

While deep learning models like BERT and GPT achieve state-of-the-art results on many NLP benchmarks, Naive Bayes remains valuable for several reasons: (1) it requires far less training data—often performing well with just hundreds of examples, (2) training is extremely fast—seconds rather than hours, (3) it's highly interpretable—you can examine which words drive predictions, (4) it scales linearly with vocabulary size, and (5) it provides calibrated probability estimates out of the box. For many practical applications, especially those with limited data or computational resources, Naive Bayes is the optimal choice.

Document Representation: The Bag-of-Words Model

Before we can apply any machine learning algorithm to text, we must transform documents from variable-length sequences of characters into fixed-dimension numerical representations. The most fundamental approach is the bag-of-words (BoW) model, which represents each document as a vector of word frequencies.

The Bag-of-Words Assumption:

The bag-of-words model makes a critical simplifying assumption: the order of words in a document does not matter. A document is treated as an unordered collection (or "bag") of words, where only the frequency of each word is retained.

Consider the sentence: "The cat sat on the mat"

In the bag-of-words representation, this becomes:

the: 2
cat: 1
sat: 1
on: 1
mat: 1

The same representation would apply to "The mat sat on the cat"—clearly a different meaning, but identical BoW vectors. Despite this loss of information, the bag-of-words model works remarkably well for text classification because category-relevant information is often conveyed by the presence and frequency of specific words rather than their precise arrangement.

bag_of_words.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import numpy as np
from collections import Counter
from typing import List, Dict, Tuple
 
class BagOfWordsVectorizer:
    """
    A complete implementation of Bag-of-Words vectorization.
    
    This class transforms text documents into numerical feature vectors
    using word frequency counts.
    """
    
    def __init__(self, min_df: int = 1, max_df: float = 1.0):
        """
        Initialize the vectorizer.
        
        Args:
            min_df: Minimum document frequency for a word to be included
            max_df: Maximum document frequency (as fraction) for inclusion
        """
        self.vocabulary_: Dict[str, int] = {}
        self.idf_: np.ndarray = None
        self.min_df = min_df
        self.max_df = max_df
        
    def _tokenize(self, document: str) -> List[str]:
        """
        Simple whitespace tokenization with lowercasing.
        In production, you'd use more sophisticated tokenization.
        """
        # Convert to lowercase and split on whitespace
        tokens = document.lower().split()
        # Remove punctuation (simplified)
        tokens = [token.strip('.,!?;:"\'()[]{}') for token in tokens]
        # Remove empty tokens
        return [token for token in tokens if token]
    
    def fit(self, documents: List[str]) -> 'BagOfWordsVectorizer':
        """
        Learn vocabulary from training documents.
        
        Args:
            documents: List of text documents
            
        Returns:
            self for method chaining
        """
        n_docs = len(documents)
        
        # Count document frequency for each word
        doc_freq = Counter()
        for doc in documents:
            tokens = set(self._tokenize(doc))  # Unique tokens per doc
            doc_freq.update(tokens)
        
        # Filter by document frequency thresholds
        max_count = int(self.max_df * n_docs)
        vocabulary = {
            word: idx 
            for idx, (word, count) in enumerate(doc_freq.items())
            if self.min_df <= count <= max_count
        }
        
        self.vocabulary_ = vocabulary
        
        # Compute IDF weights for TF-IDF (optional enhancement)
        self.idf_ = np.array([
            np.log((n_docs + 1) / (doc_freq[word] + 1)) + 1
            for word in vocabulary
        ])
        
        return self
    
    def transform(self, documents: List[str]) -> np.ndarray:
        """
        Transform documents to BoW feature vectors.
        
        Args:
            documents: List of text documents
            
        Returns:
            Document-term matrix of shape (n_docs, n_features)
        """
        n_docs = len(documents)
        n_features = len(self.vocabulary_)
        
        # Initialize sparse-like dense matrix (for simplicity)
        X = np.zeros((n_docs, n_features), dtype=np.float64)
        
        for i, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            token_counts = Counter(tokens)
            
            for token, count in token_counts.items():
                if token in self.vocabulary_:
                    j = self.vocabulary_[token]
                    X[i, j] = count
                    
        return X
    
    def fit_transform(self, documents: List[str]) -> np.ndarray:
        """Fit and transform in one step."""
        return self.fit(documents).transform(documents)
 
 
# Example usage demonstrating the complete pipeline
documents = [
    "The quick brown fox jumps over the lazy dog",
    "The lazy dog sleeps all day",
    "Quick brown foxes are faster than lazy dogs",
    "The dog chased the cat across the yard"
]
 
vectorizer = BagOfWordsVectorizer(min_df=1)
X = vectorizer.fit_transform(documents)
 
print("Vocabulary:")
for word, idx in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]):
    print(f"  {word}: index {idx}")
 
print(f"\nDocument-term matrix shape: {X.shape}")
print(f"\nFirst document vector (non-zero entries):")
for word, idx in vectorizer.vocabulary_.items():
    if X[0, idx] > 0:
        print(f"  {word}: {X[0, idx]}")

Vocabulary Construction:

The vocabulary $V$ is the set of all unique words (or tokens) that appear in the training corpus. Each document is then represented as a vector $\mathbf{x} \in \mathbb{R}^{|V|}$, where $x_i$ represents the count (or presence) of word $i$ in the document.

For a vocabulary of size $|V| = 10,000$ words, each document becomes a 10,000-dimensional vector. Most entries are zero—document vectors are sparse—because any given document uses only a small fraction of the total vocabulary.

Key Design Decisions in BoW:

Critical Preprocessing Choices

•Tokenization: How do we split text into words? Handle contractions, hyphenation, numbers, URLs, emojis? Different tokenization strategies can significantly impact classifier performance.
•Case normalization: Converting to lowercase reduces vocabulary size but loses information (e.g., 'Apple' the company vs 'apple' the fruit).
•Stopword removal: Common words like 'the', 'is', 'at' carry little discriminative information and can be removed to reduce dimensionality.
•Stemming/Lemmatization: Reducing words to their root form ('running' → 'run') groups related words but may lose meaning distinctions.
•N-grams: Including word sequences (bigrams, trigrams) captures some word order information ('not good' vs 'good').
•Vocabulary thresholds: Removing rare words (appearing in < k documents) and overly common words (appearing in > p% of documents) improves generalization.

Feature Extraction Variants

The raw bag-of-words representation uses word counts as features, but several variants exist that can improve classification performance. Understanding these representations is crucial because they connect directly to different Naive Bayes model variants.

Three Fundamental Representations:

Document Representation Strategies
Representation	Feature Value	Mathematical Form	Best Used For
Term Frequency (TF)	Raw word count	$tf(w, d) = count(w, d)$	Multinomial Naive Bayes, when word frequency matters
Binary	Word presence (0/1)	$x_w = \mathbb{1}[w \in d]$	Bernoulli Naive Bayes, short documents
TF-IDF	Frequency × rarity	$tfidf(w, d) = tf(w, d) \cdot \log\frac{N}{df(w)}$	When rare words are more informative

Term Frequency (TF) — The Multinomial Approach:

The simplest approach counts how many times each word appears in each document. If document $d$ contains the word "machine" 5 times, then $tf(\text{machine}, d) = 5$.

This representation assumes that word frequency conveys information—a document mentioning "profit" ten times is more likely to be about business than one mentioning it once. This is the foundation of the Multinomial Naive Bayes model.

Binary (Presence/Absence) — The Bernoulli Approach:

An alternative representation simply records whether each word is present or absent:

$$x_w = \begin{cases} 1 & \text{if word } w \text{ appears in document } d \ 0 & \text{otherwise} \end{cases}$$

This representation ignores word frequency, treating a word that appears once identically to one appearing a hundred times. Surprisingly, this often works well for short documents where word repetition is rare. This is the foundation of the Bernoulli Naive Bayes model.

TF-IDF — Inverse Document Frequency Weighting:

TF-IDF combines term frequency with a measure of how rare a word is across the corpus:

$$\text{tfidf}(w, d) = tf(w, d) \times \log\frac{N}{df(w)}$$

Where $df(w)$ is the document frequency (number of documents containing word $w$) and $N$ is the total number of documents.

Intuition: Words that appear frequently in one document but rarely across the corpus are likely to be highly discriminative. The word "the" might appear 50 times, but it appears in every document, so $\log(N/df) \approx 0$ and its TF-IDF weight is low. Conversely, a technical term like "mitochondria" might appear only in biology documents, giving it a high IDF weight.

tfidf_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from collections import Counter
from typing import List, Dict
 
class TFIDFVectorizer:
    """
    TF-IDF Vectorizer with multiple normalization options.
    
    Demonstrates the mathematical transformation from raw counts
    to weighted representations that emphasize discriminative words.
    """
    
    def __init__(self, 
                 norm: str = 'l2',
                 use_idf: bool = True,
                 smooth_idf: bool = True,
                 sublinear_tf: bool = False):
        """
        Args:
            norm: Normalization method ('l1', 'l2', or None)
            use_idf: Whether to apply IDF weighting
            smooth_idf: Add 1 to document frequencies (prevents division by zero)
            sublinear_tf: Apply sublinear TF scaling (1 + log(tf))
        """
        self.norm = norm
        self.use_idf = use_idf
        self.smooth_idf = smooth_idf
        self.sublinear_tf = sublinear_tf
        self.vocabulary_: Dict[str, int] = {}
        self.idf_: np.ndarray = None
        
    def _tokenize(self, text: str) -> List[str]:
        """Simple tokenization."""
        return text.lower().split()
    
    def fit(self, documents: List[str]) -> 'TFIDFVectorizer':
        """Learn vocabulary and IDF weights from documents."""
        n_docs = len(documents)
        
        # Build vocabulary and count document frequencies
        doc_freq = Counter()
        word_to_idx = {}
        idx = 0
        
        for doc in documents:
            unique_tokens = set(self._tokenize(doc))
            for token in unique_tokens:
                if token not in word_to_idx:
                    word_to_idx[token] = idx
                    idx += 1
            doc_freq.update(unique_tokens)
        
        self.vocabulary_ = word_to_idx
        n_features = len(word_to_idx)
        
        # Compute IDF weights
        if self.use_idf:
            # idf(t) = log[(N + 1) / (df(t) + 1)] + 1 (smoothed)
            # or idf(t) = log[N / df(t)] + 1 (unsmoothed)
            df = np.ones(n_features)  # Default 1 for smoothing
            for word, count in doc_freq.items():
                df[word_to_idx[word]] = count
                
            if self.smooth_idf:
                self.idf_ = np.log((n_docs + 1) / (df + 1)) + 1
            else:
                self.idf_ = np.log(n_docs / df) + 1
        else:
            self.idf_ = np.ones(n_features)
            
        return self
    
    def transform(self, documents: List[str]) -> np.ndarray:
        """Transform documents to TF-IDF representation."""
        n_docs = len(documents)
        n_features = len(self.vocabulary_)
        
        # Compute term frequencies
        X = np.zeros((n_docs, n_features))
        
        for i, doc in enumerate(documents):
            tokens = self._tokenize(doc)
            counts = Counter(tokens)
            
            for token, count in counts.items():
                if token in self.vocabulary_:
                    j = self.vocabulary_[token]
                    if self.sublinear_tf:
                        # Sublinear scaling: 1 + log(tf) if tf > 0
                        X[i, j] = 1 + np.log(count) if count > 0 else 0
                    else:
                        X[i, j] = count
        
        # Apply IDF weighting
        X = X * self.idf_
        
        # Apply normalization
        if self.norm == 'l2':
            # L2 normalization: divide each row by its L2 norm
            norms = np.sqrt(np.sum(X ** 2, axis=1, keepdims=True))
            norms[norms == 0] = 1  # Prevent division by zero
            X = X / norms
        elif self.norm == 'l1':
            # L1 normalization: divide each row by its L1 norm
            norms = np.sum(np.abs(X), axis=1, keepdims=True)
            norms[norms == 0] = 1
            X = X / norms
            
        return X
 
 
# Demonstrate how TF-IDF distinguishes discriminative words
documents = [
    "the cat sat on the mat",           # Common words dominate
    "the dog sat on the rug",           # Similar structure
    "machine learning is transforming artificial intelligence",
    "deep learning neural networks artificial intelligence"
]
 
vectorizer = TFIDFVectorizer(norm='l2', smooth_idf=True)
X = vectorizer.fit_transform(documents)
 
print("TF-IDF Analysis:")
print("-" * 50)
for word, idx in sorted(vectorizer.vocabulary_.items(), 
                        key=lambda x: vectorizer.idf_[x[1]], 
                        reverse=True)[:10]:
    print(f"  {word:20s} IDF: {vectorizer.idf_[idx]:.4f}")
 
print(f"\nDocument similarity matrix (cosine similarity):")
similarity = X @ X.T  # Cosine similarity (vectors are L2-normalized)
print(similarity.round(3))

Why Naive Bayes Excels at Text Classification

Given the simplicity of the Naive Bayes algorithm and its strong independence assumption—which is clearly violated in natural language—it's remarkable that it works so well for text classification. Understanding why requires examining both the nature of text data and the properties of the classification task.

The Independence Assumption Revisited:

Naive Bayes assumes that features (words) are conditionally independent given the class:

$$P(\mathbf{x} | c) = P(x_1, x_2, \ldots, x_n | c) = \prod_{i=1}^{n} P(x_i | c)$$

This assumption is clearly false for natural language. The word "learning" is far more likely to appear if "machine" has already appeared. The word "intelligence" correlates strongly with "artificial." Word dependencies are the very fabric of language.

So why does Naive Bayes still work?

Why Wrong Assumptions Lead to Right Answers

•Classification, not estimation: We only need to correctly rank classes by posterior probability, not estimate accurate probabilities. If Naive Bayes overestimates P(spam|x) and underestimates P(ham|x) by the same factor, the classification is still correct.
•Dependencies often cancel: Positive dependencies between words ("New York") may be balanced by negative dependencies with other words. On average, the independence assumption errors may partially cancel out.
•High-dimensional features dilute errors: With thousands of word features, errors from independence assumption in any single word-pair have limited impact on the final product.
•Strong class-conditional signals: For text classification, certain words are highly predictive of categories regardless of context. "Free" and "winner" predict spam whether they appear with "lottery" or "contest."
•Robust estimation: Word frequency estimation from training data is reliable, unlike estimating complex joint distributions which would require exponentially more data.

The Domingos & Pazzani Insight

Pedro Domingos and Michael Pazzani's seminal 1997 paper 'On the Optimality of the Simple Bayesian Classifier under Zero-One Loss' proved that the independence assumption doesn't need to be true for Naive Bayes to return the optimal decision. The classifier only needs the class with maximum posterior probability to be correctly identified—not the actual probability values. This theoretical insight explains why Naive Bayes often outperforms sophisticated methods that estimate dependencies.

Practical Advantages for Text:

Beyond the theoretical justification, Naive Bayes offers compelling practical advantages for text classification:

Naive Bayes Advantages for Text Classification
Advantage	Why It Matters	Comparison to Alternatives
Linear Training Time	Training requires a single pass through the data, counting word occurrences per class	SVMs: O(n²) to O(n³); Neural networks: multiple epochs over data
Works with Small Data	Good performance with hundreds of labeled examples due to strong generative assumptions	Deep learning often requires 10,000+ examples for NLP tasks
No Hyperparameter Tuning	Only smoothing parameter to set (Laplace correction α, typically 1)	Neural networks have learning rate, architecture, regularization, etc.
Incremental Learning	Can update model with new documents without full retraining	Most discriminative models require complete retraining
Interpretable	P(word\|class) directly shows which words drive predictions	Neural network decisions are opaque without specialized techniques
Handles High Dimensions	Works naturally with vocabularies of 100,000+ words	Many algorithms struggle with such high dimensionality

The Document Generation Model

Naive Bayes for text classification takes a generative approach: rather than directly modeling which category a document belongs to, it models how documents are generated from each category. This probabilistic story provides both intuition and a framework for deriving concrete algorithms.

The Generative Story:

Imagine you want to write a document about machine learning. According to the Naive Bayes generative model, you would:

Choose a class $c$ with probability $P(c)$ (the prior)
Generate words independently from the class-conditional distribution $P(w|c)$

For the Multinomial model, you generate each word position by sampling from the multinomial distribution over the vocabulary. For the Bernoulli model, you flip a coin for each vocabulary word to decide whether to include it.

The Classification Decision:

Given an observed document $d$, we apply Bayes' theorem to compute the posterior probability of each class:

$$P(c | d) = \frac{P(d | c) P(c)}{P(d)} \propto P(d | c) P(c)$$

Since $P(d)$ is constant across all classes (it's the same document!), we can ignore it for classification purposes. We classify to the class that maximizes:

$$\hat{c} = \arg\max_{c} P(c) P(d | c)$$

generative_model.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import numpy as np
from collections import defaultdict
from typing import List, Dict, Tuple
 
class GenerativeTextModel:
    """
    Illustrates the generative model perspective of Naive Bayes.
    
    This class shows how we can:
    1. Estimate P(c) and P(w|c) from training data
    2. Generate synthetic documents from learned distributions
    3. Classify new documents using Bayes' rule
    
    This demonstrates that classification and generation are two
    sides of the same probabilistic model.
    """
    
    def __init__(self, smoothing: float = 1.0):
        self.smoothing = smoothing
        self.class_priors: Dict[str, float] = {}
        self.word_probs: Dict[str, Dict[str, float]] = {}
        self.vocabulary: List[str] = []
        self.classes: List[str] = []
        
    def fit(self, documents: List[str], labels: List[str]):
        """Estimate model parameters from labeled documents."""
        # Count class frequencies
        class_counts = defaultdict(int)
        for label in labels:
            class_counts[label] += 1
            
        n_docs = len(labels)
        self.classes = list(set(labels))
        
        # Estimate class priors: P(c) = count(c) / N
        self.class_priors = {
            c: count / n_docs 
            for c, count in class_counts.items()
        }
        
        # Count word occurrences per class
        word_counts_per_class = defaultdict(lambda: defaultdict(int))
        all_words = set()
        
        for doc, label in zip(documents, labels):
            words = doc.lower().split()
            all_words.update(words)
            for word in words:
                word_counts_per_class[label][word] += 1
                
        self.vocabulary = sorted(all_words)
        vocab_size = len(self.vocabulary)
        
        # Estimate word probabilities: P(w|c)
        # Using Laplace smoothing: (count(w,c) + α) / (Σ count(w',c) + α|V|)
        self.word_probs = {}
        for c in self.classes:
            total_words = sum(word_counts_per_class[c].values())
            self.word_probs[c] = {
                word: (word_counts_per_class[c][word] + self.smoothing) / 
                      (total_words + self.smoothing * vocab_size)
                for word in self.vocabulary
            }
            
        return self
    
    def generate_document(self, class_label: str, length: int = 20) -> str:
        """
        Generate a document from the learned distribution.
        
        This demonstrates the 'generative' aspect of the model.
        We sample words according to P(w|c).
        """
        if class_label not in self.classes:
            raise ValueError(f"Unknown class: {class_label}")
            
        # Sample words according to P(w|class_label)
        words = list(self.word_probs[class_label].keys())
        probs = list(self.word_probs[class_label].values())
        probs = np.array(probs) / np.sum(probs)  # Ensure normalized
        
        sampled_words = np.random.choice(words, size=length, p=probs)
        return ' '.join(sampled_words)
    
    def classify(self, document: str) -> Tuple[str, Dict[str, float]]:
        """
        Classify a document using Bayes' rule.
        
        Returns the predicted class and log-probabilities for all classes.
        """
        words = document.lower().split()
        
        log_posteriors = {}
        for c in self.classes:
            # Start with log prior
            log_prob = np.log(self.class_priors[c])
            
            # Add log-likelihood of each word
            for word in words:
                if word in self.word_probs[c]:
                    log_prob += np.log(self.word_probs[c][word])
                else:
                    # Handle unknown words with minimal probability
                    log_prob += np.log(self.smoothing / 
                                       (sum(self.word_probs[c].values()) * len(self.vocabulary)))
                    
            log_posteriors[c] = log_prob
            
        # Convert to normalized probabilities for interpretation
        max_log = max(log_posteriors.values())
        posteriors = {
            c: np.exp(lp - max_log) 
            for c, lp in log_posteriors.items()
        }
        total = sum(posteriors.values())
        posteriors = {c: p/total for c, p in posteriors.items()}
        
        predicted_class = max(log_posteriors, key=log_posteriors.get)
        return predicted_class, posteriors
 
 
# Demonstration: Training, Generation, and Classification
training_docs = [
    ("great product love it highly recommend", "positive"),
    ("excellent quality best purchase ever", "positive"),
    ("amazing value wonderful experience fantastic", "positive"),
    ("terrible waste of money avoid", "negative"),
    ("horrible quality worst experience ever", "negative"),
    ("disappointing product poor service bad", "negative"),
]
 
documents = [doc for doc, _ in training_docs]
labels = [label for _, label in training_docs]
 
model = GenerativeTextModel(smoothing=1.0)
model.fit(documents, labels)
 
print("Learned Class Priors:")
for c, prior in model.class_priors.items():
    print(f"  P({c}) = {prior:.4f}")
 
print("\nTop Words per Class:")
for c in model.classes:
    top_words = sorted(model.word_probs[c].items(), 
                       key=lambda x: x[1], reverse=True)[:5]
    print(f"  {c}: {', '.join(w for w, _ in top_words)}")
 
print("\nGenerating Sample Documents:")
for c in model.classes:
    synthetic = model.generate_document(c, length=8)
    print(f"  {c}: '{synthetic}'")
 
print("\nClassifying New Documents:")
test_docs = [
    "this is an amazing product",
    "terrible quality very disappointed",
    "good value but could be better"
]
for doc in test_docs:
    pred, probs = model.classify(doc)
    print(f"  '{doc[:40]}...'")
    print(f"    -> {pred} (positive: {probs.get('positive', 0):.3f}, "
          f"negative: {probs.get('negative', 0):.3f})")

Log Probabilities Are Essential

When computing P(d|c), we multiply probabilities for each word: P(w₁|c) × P(w₂|c) × ... × P(wₙ|c). For documents with hundreds of words, this product of small probabilities underflows to zero. The solution is to work in log-space: log P(d|c) = Σᵢ log P(wᵢ|c). Addition in log-space corresponds to multiplication in probability space, and log of small numbers remains numerically stable.

A Complete Text Classification Pipeline

Building an effective text classifier requires more than just implementing the Naive Bayes algorithm. The preprocessing and feature engineering decisions often have as much impact as the choice of model. Here we present a complete pipeline from raw text to predictions.

The Pipeline Stages:

Text Cleaning: Remove HTML tags, normalize unicode, handle special characters
Tokenization: Split text into words or subword units
Normalization: Lowercase, stemming/lemmatization
Feature Extraction: Convert to BoW, binary, or TF-IDF representation
Model Training: Fit Naive Bayes parameters
Inference: Classify new documents

complete_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
import re
import numpy as np
from collections import Counter, defaultdict
from typing import List, Dict, Tuple, Optional
import string
 
class TextPreprocessor:
    """
    Complete text preprocessing pipeline.
    
    Handles the messy reality of text data: HTML, unicode,
    punctuation, case normalization, and more.
    """
    
    # Common English stopwords
    STOPWORDS = {
        'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 'being',
        'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could',
        'should', 'may', 'might', 'must', 'shall', 'can', 'need', 'dare',
        'ought', 'used', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by',
        'from', 'as', 'into', 'through', 'during', 'before', 'after', 'above',
        'below', 'between', 'under', 'again', 'further', 'then', 'once',
        'here', 'there', 'when', 'where', 'why', 'how', 'all', 'each', 'few',
        'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only',
        'own', 'same', 'so', 'than', 'too', 'very', 'just', 'and', 'but',
        'if', 'or', 'because', 'until', 'while', 'this', 'that', 'these',
        'those', 'am', 'it', 'its', 'itself', 'i', 'me', 'my', 'myself',
        'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself',
        'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself',
        'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who'
    }
    
    def __init__(self, 
                 lowercase: bool = True,
                 remove_stopwords: bool = True,
                 remove_punctuation: bool = True,
                 remove_numbers: bool = False,
                 min_word_length: int = 2):
        self.lowercase = lowercase
        self.remove_stopwords = remove_stopwords
        self.remove_punctuation = remove_punctuation
        self.remove_numbers = remove_numbers
        self.min_word_length = min_word_length
        
    def clean_html(self, text: str) -> str:
        """Remove HTML tags."""
        return re.sub(r'<[^>]+>', ' ', text)
    
    def clean_urls(self, text: str) -> str:
        """Remove URLs."""
        return re.sub(r'http\S+|www\S+', ' ', text)
    
    def normalize_unicode(self, text: str) -> str:
        """Normalize unicode characters."""
        # Replace common unicode characters with ASCII equivalents
        replacements = {
            '"': '"', '"': '"', ''': "'", ''': "'",
            '—': '-', '–': '-', '…': '...'
        }
        for unicode_char, ascii_char in replacements.items():
            text = text.replace(unicode_char, ascii_char)
        return text
    
    def process(self, text: str) -> List[str]:
        """Full preprocessing pipeline."""
        # Clean
        text = self.clean_html(text)
        text = self.clean_urls(text)
        text = self.normalize_unicode(text)
        
        # Lowercase
        if self.lowercase:
            text = text.lower()
            
        # Tokenize (simple whitespace + punctuation splitting)
        if self.remove_punctuation:
            text = text.translate(str.maketrans('', '', string.punctuation))
        tokens = text.split()
        
        # Filter
        if self.remove_numbers:
            tokens = [t for t in tokens if not t.isdigit()]
        if self.remove_stopwords:
            tokens = [t for t in tokens if t not in self.STOPWORDS]
        tokens = [t for t in tokens if len(t) >= self.min_word_length]
        
        return tokens
 
 
class NaiveBayesTextClassifier:
    """
    Complete Naive Bayes text classifier with preprocessing.
    
    Supports both Multinomial and Bernoulli variants.
    """
    
    def __init__(self, 
                 model_type: str = 'multinomial',
                 smoothing: float = 1.0,
                 preprocessor: Optional[TextPreprocessor] = None):
        """
        Args:
            model_type: 'multinomial' or 'bernoulli'
            smoothing: Laplace smoothing parameter
            preprocessor: TextPreprocessor instance (creates default if None)
        """
        self.model_type = model_type
        self.smoothing = smoothing
        self.preprocessor = preprocessor or TextPreprocessor()
        
        self.classes_: List[str] = []
        self.class_log_prior_: Dict[str, float] = {}
        self.feature_log_prob_: Dict[str, Dict[str, float]] = {}
        self.vocabulary_: Dict[str, int] = {}
        
    def fit(self, documents: List[str], labels: List[str]) -> 'NaiveBayesTextClassifier':
        """Fit the model on labeled documents."""
        # Preprocess all documents
        tokenized_docs = [self.preprocessor.process(doc) for doc in documents]
        
        # Build vocabulary
        all_words = set()
        for tokens in tokenized_docs:
            all_words.update(tokens)
        self.vocabulary_ = {word: idx for idx, word in enumerate(sorted(all_words))}
        vocab_size = len(self.vocabulary_)
        
        # Count classes
        self.classes_ = list(set(labels))
        class_counts = Counter(labels)
        n_docs = len(labels)
        
        # Compute log priors
        self.class_log_prior_ = {
            c: np.log(count / n_docs)
            for c, count in class_counts.items()
        }
        
        # Compute feature probabilities per class
        self.feature_log_prob_ = {}
        
        for c in self.classes_:
            # Get documents in this class
            class_docs = [
                tokens for tokens, label in zip(tokenized_docs, labels)
                if label == c
            ]
            
            if self.model_type == 'multinomial':
                # Count total word occurrences
                word_counts = Counter()
                for tokens in class_docs:
                    word_counts.update(tokens)
                total_words = sum(word_counts.values())
                
                # P(w|c) = (count(w,c) + α) / (Σcount + α|V|)
                self.feature_log_prob_[c] = {
                    word: np.log(
                        (word_counts.get(word, 0) + self.smoothing) /
                        (total_words + self.smoothing * vocab_size)
                    )
                    for word in self.vocabulary_
                }
                
            else:  # bernoulli
                # Count documents containing each word
                doc_word_counts = Counter()
                for tokens in class_docs:
                    doc_word_counts.update(set(tokens))  # Unique per doc
                n_class_docs = len(class_docs)
                
                # P(w|c) = (count_docs_with_w + α) / (n_docs_in_class + 2α)
                self.feature_log_prob_[c] = {}
                for word in self.vocabulary_:
                    p_word = (doc_word_counts.get(word, 0) + self.smoothing) /                              (n_class_docs + 2 * self.smoothing)
                    self.feature_log_prob_[c][word] = (
                        np.log(p_word),       # log P(w=1|c)
                        np.log(1 - p_word)    # log P(w=0|c)
                    )
                    
        return self
    
    def predict_log_proba(self, document: str) -> Dict[str, float]:
        """Compute log probabilities for each class."""
        tokens = self.preprocessor.process(document)
        
        log_probs = {}
        for c in self.classes_:
            log_prob = self.class_log_prior_[c]
            
            if self.model_type == 'multinomial':
                # Sum log probabilities of observed words
                for token in tokens:
                    if token in self.feature_log_prob_[c]:
                        log_prob += self.feature_log_prob_[c][token]
            else:  # bernoulli
                # Consider presence AND absence of vocabulary words
                observed_words = set(tokens)
                for word in self.vocabulary_:
                    log_p_present, log_p_absent = self.feature_log_prob_[c][word]
                    if word in observed_words:
                        log_prob += log_p_present
                    else:
                        log_prob += log_p_absent
                        
            log_probs[c] = log_prob
            
        return log_probs
    
    def predict(self, document: str) -> str:
        """Predict the class of a document."""
        log_probs = self.predict_log_proba(document)
        return max(log_probs, key=log_probs.get)
    
    def predict_batch(self, documents: List[str]) -> List[str]:
        """Predict classes for multiple documents."""
        return [self.predict(doc) for doc in documents]
 
 
# Example: Spam Classification
print("=" * 60)
print("SPAM CLASSIFICATION EXAMPLE")
print("=" * 60)
 
training_data = [
    ("Get rich quick! Make money fast with this amazing offer", "spam"),
    ("Congratulations! You've won a free lottery prize click now", "spam"),
    ("LIMITED TIME: Buy now and get 90% discount on watches", "spam"),
    ("Your order has been shipped and will arrive Tuesday", "ham"),
    ("Meeting scheduled for 3pm tomorrow in conference room B", "ham"),
    ("Please review the attached quarterly report before Friday", "ham"),
    ("Free trial! No credit card required claim your prize", "spam"),
    ("Your password has been reset successfully", "ham"),
]
 
train_docs = [doc for doc, _ in training_data]
train_labels = [label for _, label in training_data]
 
# Train classifier
clf = NaiveBayesTextClassifier(model_type='multinomial', smoothing=1.0)
clf.fit(train_docs, train_labels)
 
# Test on new documents
test_docs = [
    "Claim your free prize now! Limited time offer",
    "Please find the meeting notes attached",
    "You've been selected as a lucky winner click here",
    "The project deadline has been extended to next week"
]
 
print("\nPredictions:")
for doc in test_docs:
    pred = clf.predict(doc)
    log_probs = clf.predict_log_proba(doc)
    print(f"  '{doc[:50]}...'")
    print(f"    -> {pred.upper()} (log-odds: {log_probs['spam'] - log_probs['ham']:.2f})")

Summary: Foundations of Text Classification

We have established the foundational concepts that underpin probabilistic text classification. These concepts are essential for understanding the specific Naive Bayes variants we'll explore in the following pages.

Key Takeaways

•Text classification is the task of assigning predefined categories to documents, with applications from spam filtering to sentiment analysis to intent detection.
•The bag-of-words model represents documents as vectors of word frequencies, discarding word order but retaining sufficient information for many classification tasks.
•Feature representations include term frequency (for Multinomial NB), binary presence (for Bernoulli NB), and TF-IDF (for emphasizing discriminative words).
•Naive Bayes excels at text classification despite its independence assumption because: (1) classification doesn't require accurate probability estimates, (2) errors from independence often cancel, and (3) strong class-conditional signals dominate.
•The generative model perspective views documents as generated by first choosing a class, then sampling words from that class's distribution—classification inverts this process using Bayes' theorem.
•Practical pipelines require careful attention to preprocessing: tokenization, case normalization, stopword removal, and vocabulary construction all impact performance.

What's Next:

With the foundations in place, we're ready to dive deep into the two dominant Naive Bayes variants for text:

Page 2: Multinomial Naive Bayes — Models documents as word frequency distributions, ideal when word counts carry information
Page 3: Bernoulli Naive Bayes — Models documents as binary word presence/absence patterns, often superior for short texts
Page 4: Laplace Smoothing — Handles the critical problem of zero-frequency words
Page 5: Comparison — When to use which model, and how to choose for your application

Page Complete

You now understand the fundamental principles of text classification and why Naive Bayes is particularly well-suited for this domain. In the next page, we'll derive the Multinomial Naive Bayes model from first principles and see how it models documents as samples from a categorical distribution over vocabulary.