Multinomial Bernoulli Naive Bayes - Learning Module

Loading content...

0/245

Multinomial Naive Bayes Model

Documents as Word Frequency Distributions

Imagine you have two urns, each filled with colored balls. The "spam" urn contains mostly red balls (words like "free", "winner", "urgent"), while the "ham" urn contains mostly blue balls (words like "meeting", "report", "schedule"). When you receive an email, it's as if someone secretly chose an urn, drew several balls (words) from it, and showed you the result. Your task: figure out which urn the balls came from.

This simple analogy captures the essence of Multinomial Naive Bayes—the most widely used variant for text classification. The model treats each document as a sequence of word draws from a class-specific multinomial distribution, and classification becomes the problem of inferring which distribution most likely generated the observed document.

What You Will Learn

By the end of this page, you will understand the probabilistic foundations of Multinomial Naive Bayes, derive the parameter estimation formulas from maximum likelihood, implement the complete algorithm from scratch, analyze its computational complexity, and understand the conditions under which it excels or fails.

The Multinomial Distribution: Mathematical Foundation

Before diving into Naive Bayes, we must understand the multinomial distribution—the probability distribution that underpins the entire model.

Definition:

The multinomial distribution is the generalization of the binomial distribution to multiple outcomes. If we have $k$ possible outcomes with probabilities $\theta_1, \theta_2, \ldots, \theta_k$ (where $\sum_{i=1}^{k} \theta_i = 1$), and we conduct $n$ independent trials, the probability of observing exactly $x_1$ occurrences of outcome 1, $x_2$ occurrences of outcome 2, etc., is:

$$P(x_1, x_2, \ldots, x_k | n, \boldsymbol{\theta}) = \frac{n!}{x_1! x_2! \cdots x_k!} \prod_{i=1}^{k} \theta_i^{x_i}$$

where $\sum_{i=1}^{k} x_i = n$.

The Document Generation Analogy:

In the context of text classification:

Outcomes = vocabulary words ($k = |V|$, the vocabulary size)
Trials = word positions in the document ($n = |d|$, the document length)
Probabilities = $P(w_i | c)$, the probability of word $w_i$ given class $c$
Counts = $x_i$, how many times word $w_i$ appears in the document

Multinomial Distribution in Context
Statistical Concept	Text Classification Interpretation	Example
Number of trials $n$	Document length (total words)	An email with 150 words has $n = 150$
Number of outcomes $k$	Vocabulary size	$k = 50,000$ for a typical corpus
Outcome probabilities $\boldsymbol{\theta}$	Word distribution for class $c$	$P(\text{"free"} \| \text{spam}) = 0.05$
Observation counts $\mathbf{x}$	Word frequencies in document	"free" appears 3 times → $x_{\text{free}} = 3$
Multinomial coefficient	Number of ways to arrange words	Same for all classes, so cancels in ratio

Why Multinomial Makes Sense for Documents:

The multinomial model captures a fundamental intuition about document generation: word choice is a random process with class-dependent probabilities. A spam email doesn't contain "free" by accident—it appears because spam messages systematically use certain words more frequently.

However, the multinomial model makes strong assumptions:

Exchangeability: The probability of a document depends only on word counts, not word order. "The cat sat on the mat" and "mat the sat cat on the" have identical probabilities.
Independence: Each word is drawn independently. The appearance of "machine" doesn't affect the probability of "learning" appearing next.
Fixed vocabulary: All possible words are known a priori, and their total probability sums to 1.

These assumptions are violated by natural language—yet the model works remarkably well for classification because we only need to correctly rank class probabilities, not estimate them accurately.

Deriving Multinomial Naive Bayes from First Principles

Now we derive the complete Multinomial Naive Bayes classifier, starting from Bayes' theorem and the multinomial assumption.

Step 1: Apply Bayes' Theorem

For a document $d$ represented as a vector of word counts $\mathbf{x} = (x_1, x_2, \ldots, x_{|V|})$:

$$P(c | \mathbf{x}) = \frac{P(\mathbf{x} | c) P(c)}{P(\mathbf{x})}$$

Since $P(\mathbf{x})$ is constant across classes, classification requires:

$$\hat{c} = \arg\max_{c} P(c) P(\mathbf{x} | c)$$

Step 2: Model the Likelihood Using Multinomial

Under the multinomial model:

$$P(\mathbf{x} | c) = \frac{n!}{\prod_i x_i!} \prod_{i=1}^{|V|} P(w_i | c)^{x_i}$$

where $n = \sum_i x_i$ is the document length.

The multinomial coefficient $\frac{n!}{\prod_i x_i!}$ is the same for all classes (it depends only on the document, not the class), so we can ignore it for classification:

$$\hat{c} = \arg\max_{c} P(c) \prod_{i=1}^{|V|} P(w_i | c)^{x_i}$$

Step 3: Convert to Log-Space

Multiplying many small probabilities causes numerical underflow. We take logarithms:

$$\hat{c} = \arg\max_{c} \left[ \log P(c) + \sum_{i=1}^{|V|} x_i \log P(w_i | c) \right]$$

This is the fundamental equation of Multinomial Naive Bayes classification.

The Classification Rule in Words

To classify a document: (1) Start with the log-prior for each class, (2) For each word in the document, add its count times the log-probability of that word given the class, (3) Predict the class with the highest total. This is simply a weighted sum of word log-probabilities, where weights are word frequencies.

Step 4: Geometric Interpretation

The classification decision can be understood geometrically. For binary classification with classes $c_1$ and $c_2$, the decision boundary is:

$$\log P(c_1) + \sum_i x_i \log P(w_i | c_1) = \log P(c_2) + \sum_i x_i \log P(w_i | c_2)$$

Rearranging:

$$\sum_i x_i \left[ \log P(w_i | c_1) - \log P(w_i | c_2) \right] = \log P(c_2) - \log P(c_1)$$

Defining $w_i^* = \log P(w_i | c_1) - \log P(w_i | c_2)$ and $b = \log P(c_2) - \log P(c_1)$:

$$\sum_i x_i w_i^* = b$$

This is a linear decision boundary in the space of word counts! Multinomial Naive Bayes is fundamentally a linear classifier, with weights determined by the log-odds ratios of word probabilities.

Maximum Likelihood Parameter Estimation

The Multinomial Naive Bayes model has two sets of parameters to estimate from training data:

Class priors: $P(c)$ for each class $c$
Word likelihoods: $P(w | c)$ for each word $w$ and class $c$

Estimating Class Priors:

The maximum likelihood estimate (MLE) for the prior probability of class $c$ is simply the fraction of training documents belonging to that class:

$$\hat{P}(c) = \frac{N_c}{N}$$

where $N_c$ is the number of documents in class $c$ and $N$ is the total number of training documents.

Estimating Word Likelihoods:

The MLE for the probability of word $w$ given class $c$ is the fraction of word occurrences in class $c$ that are word $w$:

$$\hat{P}(w | c) = \frac{\text{count}(w, c)}{\sum_{w' \in V} \text{count}(w', c)} = \frac{\text{count}(w, c)}{\text{total words in class } c}$$

where $\text{count}(w, c)$ is the number of times word $w$ appears across all documents of class $c$.

The Frequency Table:

Word Frequency Table for Parameter Estimation
Word	Count in Spam	Count in Ham	$P(w\|\text{spam})$	$P(w\|\text{ham})$
free	150	5	150/3000 = 0.050	5/4000 = 0.001
meeting	10	200	10/3000 = 0.003	200/4000 = 0.050
winner	120	8	120/3000 = 0.040	8/4000 = 0.002
report	15	180	15/3000 = 0.005	180/4000 = 0.045
...	...	...	...	...
Total	3000	4000	1.000	1.000

multinomial_nb_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
import numpy as np
from collections import Counter, defaultdict
from typing import List, Dict, Tuple
import math
 
class MultinomialNaiveBayes:
    """
    Complete implementation of Multinomial Naive Bayes from scratch.
    
    This implementation prioritizes clarity and understanding over
    computational efficiency. Every step is explicitly documented.
    """
    
    def __init__(self, alpha: float = 1.0):
        """
        Initialize the classifier.
        
        Args:
            alpha: Laplace smoothing parameter. 
                   α = 1 is add-one smoothing (Laplace)
                   α = 0 is no smoothing (pure MLE)
                   α < 1 is Lidstone smoothing
        """
        self.alpha = alpha
        
        # Model parameters (learned during fit)
        self.classes_: List[str] = []
        self.log_priors_: Dict[str, float] = {}        # log P(c)
        self.log_likelihoods_: Dict[str, Dict[str, float]] = {}  # log P(w|c)
        self.vocabulary_: set = set()
        self.vocab_size_: int = 0
        
        # Statistics for interpretability
        self.class_word_counts_: Dict[str, int] = {}
        
    def _tokenize(self, document: str) -> List[str]:
        """Simple whitespace tokenization with lowercasing."""
        return document.lower().split()
    
    def fit(self, documents: List[str], labels: List[str]) -> 'MultinomialNaiveBayes':
        """
        Fit the model using Maximum Likelihood Estimation with smoothing.
        
        The fitting process:
        1. Build vocabulary from all training documents
        2. Count class frequencies → compute log P(c)
        3. Count word-class co-occurrences → compute log P(w|c)
        
        Args:
            documents: List of text documents
            labels: List of class labels (same length as documents)
            
        Returns:
            self (fitted model)
        """
        n_docs = len(documents)
        
        # Step 1: Tokenize all documents and build vocabulary
        tokenized_docs = [self._tokenize(doc) for doc in documents]
        for tokens in tokenized_docs:
            self.vocabulary_.update(tokens)
        self.vocab_size_ = len(self.vocabulary_)
        
        # Step 2: Count class frequencies
        self.classes_ = sorted(set(labels))
        class_doc_counts = Counter(labels)
        
        # Compute log priors: log P(c) = log(N_c / N)
        self.log_priors_ = {
            c: math.log(class_doc_counts[c] / n_docs)
            for c in self.classes_
        }
        
        # Step 3: Count word occurrences per class
        # word_counts[class][word] = count of word in all docs of class
        word_counts = {c: defaultdict(int) for c in self.classes_}
        
        for tokens, label in zip(tokenized_docs, labels):
            for token in tokens:
                word_counts[label][token] += 1
        
        # Total words per class (for normalization)
        self.class_word_counts_ = {
            c: sum(word_counts[c].values())
            for c in self.classes_
        }
        
        # Step 4: Compute log-likelihoods with Laplace smoothing
        # log P(w|c) = log[(count(w,c) + α) / (total_c + α|V|)]
        self.log_likelihoods_ = {}
        
        for c in self.classes_:
            total_words_c = self.class_word_counts_[c]
            denominator = total_words_c + self.alpha * self.vocab_size_
            
            self.log_likelihoods_[c] = {
                word: math.log((word_counts[c][word] + self.alpha) / denominator)
                for word in self.vocabulary_
            }
        
        return self
    
    def predict_log_proba(self, document: str) -> Dict[str, float]:
        """
        Compute log-probabilities for each class.
        
        For each class c:
        log P(c|d) ∝ log P(c) + Σ_w count(w,d) * log P(w|c)
        
        Args:
            document: Text document to classify
            
        Returns:
            Dictionary mapping class names to log-probabilities
        """
        tokens = self._tokenize(document)
        word_counts = Counter(tokens)
        
        log_probs = {}
        
        for c in self.classes_:
            # Start with log prior
            log_prob = self.log_priors_[c]
            
            # Add log-likelihood for each word (weighted by count)
            for word, count in word_counts.items():
                if word in self.log_likelihoods_[c]:
                    log_prob += count * self.log_likelihoods_[c][word]
                else:
                    # Unknown word: use smoothing estimate
                    # P(unknown|c) ≈ α / (total_c + α|V|)
                    log_unknown = math.log(
                        self.alpha / 
                        (self.class_word_counts_[c] + self.alpha * self.vocab_size_)
                    )
                    log_prob += count * log_unknown
                    
            log_probs[c] = log_prob
            
        return log_probs
    
    def predict_proba(self, document: str) -> Dict[str, float]:
        """
        Compute normalized probabilities for each class.
        
        Uses the log-sum-exp trick for numerical stability:
        P(c|d) = exp(log P(c|d) - log Σ_c' exp(log P(c'|d)))
        """
        log_probs = self.predict_log_proba(document)
        
        # Log-sum-exp for numerical stability
        max_log_prob = max(log_probs.values())
        log_sum = max_log_prob + math.log(
            sum(math.exp(lp - max_log_prob) for lp in log_probs.values())
        )
        
        return {
            c: math.exp(lp - log_sum)
            for c, lp in log_probs.items()
        }
    
    def predict(self, document: str) -> str:
        """Predict the most likely class for a document."""
        log_probs = self.predict_log_proba(document)
        return max(log_probs, key=log_probs.get)
    
    def get_feature_importance(self, class_name: str, top_n: int = 10) -> List[Tuple[str, float]]:
        """
        Get the most important features (words) for a class.
        
        Returns words with highest P(w|c), which are most indicative
        of that class.
        """
        if class_name not in self.classes_:
            raise ValueError(f"Unknown class: {class_name}")
            
        word_probs = [
            (word, math.exp(log_prob))
            for word, log_prob in self.log_likelihoods_[class_name].items()
        ]
        return sorted(word_probs, key=lambda x: x[1], reverse=True)[:top_n]
 
 
# ============================================================
# COMPLETE EXAMPLE: Email Classification
# ============================================================
 
# Training data: emails labeled as spam or ham
training_emails = [
    # Spam examples
    ("FREE FREE FREE! You have won $1000 lottery prize! Claim now!", "spam"),
    ("Congratulations! You've been selected for a free vacation", "spam"),
    ("URGENT: Your account will be suspended unless you verify", "spam"),
    ("Make money fast! Work from home and earn $5000 daily", "spam"),
    ("Limited time offer! Get 90% discount on luxury watches now", "spam"),
    ("You are a winner! Click here to claim your prize money", "spam"),
    
    # Ham examples
    ("Meeting scheduled for tomorrow at 3pm in conference room B", "ham"),
    ("Please review the attached quarterly report before Friday", "ham"),
    ("Can we reschedule our lunch meeting to next week?", "ham"),
    ("The project deadline has been extended to December 15th", "ham"),
    ("Your order has been shipped and will arrive on Tuesday", "ham"),
    ("Here are the meeting notes from today's discussion", "ham"),
]
 
# Separate features and labels
docs = [email for email, _ in training_emails]
labels = [label for _, label in training_emails]
 
# Train the model
model = MultinomialNaiveBayes(alpha=1.0)
model.fit(docs, labels)
 
# Print model statistics
print("=" * 60)
print("MULTINOMIAL NAIVE BAYES MODEL SUMMARY")
print("=" * 60)
print(f"\nVocabulary size: {model.vocab_size_} words")
print(f"Classes: {model.classes_}")
 
print("\nClass Priors:")
for c in model.classes_:
    print(f"  P({c}) = exp({model.log_priors_[c]:.4f}) = {math.exp(model.log_priors_[c]):.4f}")
 
print("\nTop 5 Words per Class:")
for c in model.classes_:
    top_words = model.get_feature_importance(c, top_n=5)
    print(f"\n  {c.upper()}:")
    for word, prob in top_words:
        print(f"    {word:15s} P(w|{c}) = {prob:.4f}")
 
# Test on new emails
print("\n" + "=" * 60)
print("CLASSIFICATION RESULTS")
print("=" * 60)
 
test_emails = [
    "Claim your free prize now limited time offer winner",
    "The meeting has been rescheduled to Monday afternoon",
    "URGENT winner notification verify your account immediately",
    "Please find attached the updated project schedule"
]
 
for email in test_emails:
    prediction = model.predict(email)
    probs = model.predict_proba(email)
    
    print(f"\nEmail: '{email[:50]}...'")
    print(f"  Prediction: {prediction.upper()}")
    print(f"  P(spam|email) = {probs['spam']:.4f}")
    print(f"  P(ham|email)  = {probs['ham']:.4f}")

The Log-Linear Form and Feature Weights

A crucial insight is that Multinomial Naive Bayes can be rewritten in log-linear form, revealing its connection to logistic regression and other linear classifiers.

Log-Linear Representation:

The classification score for class $c$ is:

$$\text{score}(c, d) = \log P(c) + \sum_{w \in V} \text{count}(w, d) \cdot \log P(w | c)$$

This can be written as a dot product:

$$\text{score}(c, d) = b_c + \mathbf{w}_c^T \mathbf{x}$$

where:

$b_c = \log P(c)$ is the bias term for class $c$
$\mathbf{w}_c = [\log P(w_1 | c), \log P(w_2 | c), \ldots]$ is the weight vector for class $c$
$\mathbf{x} = [\text{count}(w_1, d), \text{count}(w_2, d), \ldots]$ is the feature vector (word counts)

Interpretation of Weights:

The weight $w_{c,i} = \log P(w_i | c)$ captures how much the presence of word $w_i$ supports class $c$. But absolute values are less informative than relative weights (log-odds ratios):

Log-Odds Ratio Analysis:

For binary classification (e.g., spam vs. ham), the most informative quantity is the log-odds ratio:

$$\text{LOR}(w) = \log \frac{P(w | \text{spam})}{P(w | \text{ham})}$$

$\text{LOR}(w) > 0$: word is more common in spam → evidence for spam
$\text{LOR}(w) < 0$: word is more common in ham → evidence for ham
$\text{LOR}(w) \approx 0$: word occurs similarly in both → little discriminative value

The magnitude of LOR indicates strength of evidence: $|\text{LOR}(w)| = 2$ means the word is $e^2 \approx 7.4$ times more likely in one class than the other.

log_odds_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
import math
from typing import List, Tuple, Dict
 
def analyze_log_odds(model: 'MultinomialNaiveBayes', 
                     class_pos: str = 'spam',
                     class_neg: str = 'ham') -> Dict[str, float]:
    """
    Compute log-odds ratios for all words.
    
    LOR(w) = log[P(w|pos) / P(w|neg)]
    
    Returns dictionary mapping words to their LOR.
    """
    log_odds = {}
    
    for word in model.vocabulary_:
        log_p_pos = model.log_likelihoods_[class_pos][word]
        log_p_neg = model.log_likelihoods_[class_neg][word]
        log_odds[word] = log_p_pos - log_p_neg
        
    return log_odds
 
 
def get_most_discriminative_words(log_odds: Dict[str, float], 
                                   n: int = 10) -> Tuple[List, List]:
    """
    Get most discriminative words for each class.
    
    Returns:
        (words_for_spam, words_for_ham) - each is list of (word, LOR)
    """
    sorted_words = sorted(log_odds.items(), key=lambda x: x[1], reverse=True)
    
    spam_words = sorted_words[:n]  # Highest LOR = most spam-indicative
    ham_words = sorted_words[-n:][::-1]  # Lowest LOR = most ham-indicative
    
    return spam_words, ham_words
 
 
def explain_prediction(model: 'MultinomialNaiveBayes',
                       document: str,
                       class_pos: str = 'spam',
                       class_neg: str = 'ham') -> None:
    """
    Explain a classification decision by showing word contributions.
    
    For each word, shows how much it pushes toward spam vs ham.
    """
    from collections import Counter
    
    tokens = document.lower().split()
    word_counts = Counter(tokens)
    log_odds = analyze_log_odds(model, class_pos, class_neg)
    
    print(f"\nAnalyzing: '{document[:60]}...'")
    print("-" * 60)
    
    # Prior contribution
    prior_contribution = model.log_priors_[class_pos] - model.log_priors_[class_neg]
    print(f"Prior (log-odds): {prior_contribution:+.4f}")
    
    # Word contributions
    contributions = []
    for word, count in word_counts.items():
        if word in log_odds:
            contribution = count * log_odds[word]
            contributions.append((word, count, log_odds[word], contribution))
    
    # Sort by absolute contribution
    contributions.sort(key=lambda x: abs(x[3]), reverse=True)
    
    print("\nWord contributions (sorted by impact):")
    print(f"{'Word':15s} {'Count':>6s} {'LOR':>8s} {'Impact':>10s}")
    print("-" * 45)
    
    total_word_contribution = 0
    for word, count, lor, impact in contributions[:10]:
        direction = "→spam" if impact > 0 else "→ham"
        print(f"{word:15s} {count:>6d} {lor:>+8.3f} {impact:>+10.3f} {direction}")
        total_word_contribution += impact
    
    if len(contributions) > 10:
        remaining = sum(c[3] for c in contributions[10:])
        print(f"{'(other words)':15s} {'':>6s} {'':>8s} {remaining:>+10.3f}")
        total_word_contribution += remaining
    
    total_log_odds = prior_contribution + total_word_contribution
    print("-" * 45)
    print(f"{'TOTAL':15s} {'':>6s} {'':>8s} {total_log_odds:>+10.3f}")
    
    # Convert to probability
    prob_spam = 1 / (1 + math.exp(-total_log_odds))
    print(f"\nP(spam|document) = {prob_spam:.4f}")
    print(f"Prediction: {'SPAM' if total_log_odds > 0 else 'HAM'}")
 
 
# Example analysis (continuing from previous code)
# Assuming 'model' is already trained...
 
print("=" * 60)
print("LOG-ODDS RATIO ANALYSIS")
print("=" * 60)
 
# Analyze word discriminability
log_odds = analyze_log_odds(model, 'spam', 'ham')
spam_words, ham_words = get_most_discriminative_words(log_odds, n=5)
 
print("\nMost Spam-Indicative Words:")
print(f"{'Word':15s} {'LOR':>10s} {'Odds Ratio':>12s}")
print("-" * 40)
for word, lor in spam_words:
    print(f"{word:15s} {lor:>+10.3f} {math.exp(lor):>12.2f}x")
 
print("\nMost Ham-Indicative Words:")
print(f"{'Word':15s} {'LOR':>10s} {'Odds Ratio':>12s}")
print("-" * 40)
for word, lor in ham_words:
    print(f"{word:15s} {lor:>+10.3f} {math.exp(lor):>12.2f}x")
 
# Explain specific predictions
print("\n" + "=" * 60)
print("PREDICTION EXPLANATIONS")
print("=" * 60)
 
explain_prediction(model, "Free prize winner claim your lottery money now")
explain_prediction(model, "Meeting tomorrow to discuss the project deadline")

Interpretability is a Key Advantage

Unlike deep learning models that operate as black boxes, Multinomial Naive Bayes is fully interpretable. You can examine P(w|c) for every word, compute log-odds ratios, and trace exactly why the classifier made any particular decision. This interpretability is invaluable for debugging, explaining predictions to stakeholders, and identifying dataset biases.

Computational Complexity and Efficiency

One of Multinomial Naive Bayes's most compelling advantages is its computational efficiency. Both training and prediction are remarkably fast, scaling linearly with data size.

Training Complexity:

Let:

$N$ = number of training documents
$L$ = average document length (in words)
$|V|$ = vocabulary size
$|C|$ = number of classes

Training requires:

Building vocabulary: $O(N \cdot L)$ - scan all documents
Counting class frequencies: $O(N)$ - count labels
Counting word-class co-occurrences: $O(N \cdot L)$ - scan all documents
Computing log-probabilities: $O(|C| \cdot |V|)$ - normalize counts

Total training complexity: $O(N \cdot L + |C| \cdot |V|)$

Since typically $|V| < N \cdot L$ (vocabulary grows sublinearly with corpus size), training is effectively $O(N \cdot L)$ — linear in the total number of word tokens in the training data.

Prediction Complexity:

For a document of length $L_d$:

Tokenization: $O(L_d)$
Word count aggregation: $O(L_d)$
Score computation per class: $O(L_d)$ (only non-zero counts contribute)
Total over all classes: $O(|C| \cdot L_d)$

Total prediction complexity: $O(|C| \cdot L_d)$ — linear in document length times number of classes.

Comparative Computational Complexity
Algorithm	Training	Prediction	Memory
Multinomial NB	$O(N \cdot L)$	$O(\|C\| \cdot L_d)$	$O(\|C\| \cdot \|V\|)$
Logistic Regression	$O(N \cdot L \cdot \|V\| \cdot I)$*	$O(\|C\| \cdot \|V\|)$	$O(\|C\| \cdot \|V\|)$
SVM (linear)	$O(N \cdot L \cdot \|V\|)$	$O(\|C\| \cdot \|V\|)$	$O(\|C\| \cdot \|V\|)$
Random Forest	$O(N \cdot L \cdot \|V\| \cdot T \cdot D)$**	$O(T \cdot D)$	$O(T \cdot 2^D)$
Neural Network	$O(N \cdot L \cdot \|V\| \cdot H \cdot E)$***	$O(\|V\| \cdot H)$	$O(\|V\| \cdot H + H^2)$

*$I$ = iterations, **$T$ = trees, $D$ = depth, ***$H$ = hidden units, $E$ = epochs

Memory Efficiency:

Multinomial NB stores only:

$|C|$ class log-priors (constant)
$|C| \times |V|$ word log-probabilities

For $|C| = 2$ and $|V| = 50,000$, this is only 100,000 floating-point numbers ≈ 400 KB. Compare this to neural networks that may require hundreds of megabytes.

Incremental Learning:

A remarkable property of Multinomial NB is its support for incremental updates. Since the model is based entirely on word counts, new documents can be incorporated without retraining from scratch:

Update class document counts: $N_c \leftarrow N_c + 1$
Update word counts: $\text{count}(w, c) \leftarrow \text{count}(w, c) + \text{count}(w, d)$
Recompute affected log-probabilities: $O(L_d)$

This makes Multinomial NB ideal for streaming applications where models must be updated continuously.

Practical Training Time

In practice, Multinomial NB can train on millions of documents in seconds. The 20 Newsgroups dataset (18,846 documents, ~100K vocabulary) trains in under 1 second on commodity hardware. Compare this to BERT fine-tuning, which requires hours on GPU hardware for similar-sized datasets.

Document Length Normalization

A subtle but important issue with Multinomial Naive Bayes is its sensitivity to document length. The log-likelihood of a document grows (or shrinks) with its length, which can bias predictions in unexpected ways.

The Length Problem:

Consider two documents with identical class-conditional word distributions:

Document A: 50 words
Document B: 500 words

Even if both documents have the same proportion of spam-indicative words, Document B will have more extreme log-likelihoods simply because it has more words. This can cause problems when classifying documents of varying lengths.

Mathematical Analysis:

For a document $d$ of length $n$ from class $c$:

$$\log P(d | c) = \sum_{w} \text{count}(w, d) \cdot \log P(w | c)$$

Since $\sum_w \text{count}(w, d) = n$, and most words have $P(w|c) < 1$ (so $\log P(w|c) < 0$), longer documents have more negative log-likelihoods.

Normalization Strategies:

Document Length Normalization Approaches

•Length-normalized likelihood: Divide log-likelihood by document length: $\frac{1}{n} \sum_w \text{count}(w, d) \cdot \log P(w | c)$. This computes the average per-word log-probability.
•TF normalization: Divide word counts by document length before classification: $\text{tf}(w, d) = \text{count}(w, d) / |d|$. Each document becomes a probability distribution over words.
•L2 normalization: Normalize the feature vector to unit length: $\tilde{x}_i = x_i / |\mathbf{x}|_2$. Common in TF-IDF pipelines.
•Sublinear TF scaling: Use $\log(1 + \text{count})$ instead of raw counts. Reduces impact of highly repeated words.
•Binary features: Ignore counts entirely, using only word presence/absence. This is essentially Bernoulli NB (covered in the next page).

length_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
import numpy as np
from collections import Counter
import math
 
def demonstrate_length_effect():
    """
    Show how document length affects Multinomial NB predictions.
    """
    # Simplified vocabulary and probabilities
    vocab = ['free', 'money', 'meeting', 'report']
    
    # Log probabilities (simplified)
    log_p_spam = {'free': -2.0, 'money': -2.5, 'meeting': -5.0, 'report': -5.0}
    log_p_ham = {'free': -5.0, 'money': -5.0, 'meeting': -2.0, 'report': -2.5}
    
    # Two documents with same PROPORTIONS but different lengths
    short_doc = {'free': 1, 'money': 1, 'meeting': 1, 'report': 1}  # 4 words
    long_doc = {'free': 10, 'money': 10, 'meeting': 10, 'report': 10}  # 40 words
    
    def compute_log_likelihood(doc, log_probs):
        return sum(count * log_probs[word] for word, count in doc.items())
    
    # Compute likelihoods
    short_spam = compute_log_likelihood(short_doc, log_p_spam)
    short_ham = compute_log_likelihood(short_doc, log_p_ham)
    long_spam = compute_log_likelihood(long_doc, log_p_spam)
    long_ham = compute_log_likelihood(long_doc, log_p_ham)
    
    print("Without normalization:")
    print(f"  Short doc: spam={short_spam:.2f}, ham={short_ham:.2f}, diff={short_spam-short_ham:+.2f}")
    print(f"  Long doc:  spam={long_spam:.2f}, ham={long_ham:.2f}, diff={long_spam-long_ham:+.2f}")
    
    # With length normalization
    short_len = sum(short_doc.values())
    long_len = sum(long_doc.values())
    
    print("\nWith length normalization:")
    print(f"  Short doc: spam={short_spam/short_len:.2f}, ham={short_ham/short_len:.2f}")
    print(f"  Long doc:  spam={long_spam/long_len:.2f}, ham={long_ham/long_len:.2f}")
    print("  (Now the per-word scores are identical!)")
 
 
class LengthNormalizedMultinomialNB:
    """
    Multinomial Naive Bayes with length normalization.
    """
    
    def __init__(self, alpha: float = 1.0, normalize: str = 'none'):
        """
        Args:
            alpha: Smoothing parameter
            normalize: 'none', 'length', 'l2', or 'sublinear'
        """
        self.alpha = alpha
        self.normalize = normalize
        self.classes_ = []
        self.log_priors_ = {}
        self.log_likelihoods_ = {}
        self.vocabulary_ = set()
        self.vocab_size_ = 0
        
    def _normalize_counts(self, word_counts: dict) -> dict:
        """Apply normalization to word counts."""
        if self.normalize == 'none':
            return word_counts
            
        elif self.normalize == 'length':
            # Divide by document length (TF normalization)
            total = sum(word_counts.values())
            if total == 0:
                return word_counts
            return {w: c/total for w, c in word_counts.items()}
            
        elif self.normalize == 'l2':
            # L2 normalization
            norm = math.sqrt(sum(c**2 for c in word_counts.values()))
            if norm == 0:
                return word_counts
            return {w: c/norm for w, c in word_counts.items()}
            
        elif self.normalize == 'sublinear':
            # Sublinear TF: 1 + log(count) for count > 0
            return {w: (1 + math.log(c)) if c > 0 else 0 
                    for w, c in word_counts.items()}
        
        return word_counts
    
    def predict_log_proba(self, document: str) -> dict:
        """Predict with normalization applied to document."""
        tokens = document.lower().split()
        word_counts = Counter(tokens)
        
        # Apply normalization
        normalized_counts = self._normalize_counts(word_counts)
        
        log_probs = {}
        for c in self.classes_:
            log_prob = self.log_priors_[c]
            
            for word, count in normalized_counts.items():
                if word in self.log_likelihoods_[c]:
                    log_prob += count * self.log_likelihoods_[c][word]
                    
            log_probs[c] = log_prob
            
        return log_probs
 
 
# Demonstrate the effect
demonstrate_length_effect()

When Length Matters

Length normalization is most important when: (1) training and test documents have very different lengths, (2) the classification task involves comparing documents of varying sizes, or (3) longer documents tend to belong to specific classes (creating a spurious correlation). If your training and test documents have similar length distributions, normalization may have little effect.

Practical Considerations and Common Pitfalls

Deploying Multinomial Naive Bayes in production requires attention to several practical issues beyond the core algorithm.

Best Practices

•Always use smoothing: Pure MLE fails catastrophically on unseen words. Laplace smoothing (α=1) is a robust default.
•Handle unknown words gracefully: At prediction time, words not in the vocabulary should be smoothed or ignored consistently.
•Consider vocabulary pruning: Remove rare words (appearing in < 3 documents) and very common words (appearing in > 90% of documents).
•Validate on held-out data: Cross-validation or train/val/test split to assess generalization.
•Monitor class imbalance: Highly imbalanced classes may require prior adjustment or resampling.

Common Pitfalls

•Trusting probability estimates: NB probabilities are often poorly calibrated; use them for ranking, not as true probabilities.
•Ignoring preprocessing impact: The same model with different preprocessing can vary by 10+ accuracy points.
•Using without smoothing: Zero probabilities for unseen words make the entire document likelihood zero.
•Training vocabulary on test data: Vocabulary must be built only from training data to avoid data leakage.
•Ignoring feature selection: Too many features can hurt; consider chi-squared or mutual information feature selection.

When to Choose Multinomial NB

Multinomial NB is ideal when: (1) word frequency matters (longer documents with more mentions = stronger evidence), (2) documents are reasonably long (100+ words), (3) vocabulary is large (1000+ words), (4) training data is limited (< 10,000 examples), (5) interpretability is important, or (6) fast training/prediction is required. For very short texts (tweets, queries), consider Bernoulli NB instead.

Summary: Multinomial Naive Bayes Mastery

We have conducted a thorough examination of Multinomial Naive Bayes, from its probabilistic foundations to practical implementation details.

Key Takeaways

•Multinomial model treats documents as sequences of word draws from a class-specific categorical distribution over the vocabulary.
•Classification is performed by computing log P(c) + Σ count(w,d) · log P(w|c) for each class and selecting the maximum.
•Parameter estimation uses maximum likelihood with smoothing: P(w|c) = (count(w,c) + α) / (total_c + α|V|).
•Log-linear form reveals that Multinomial NB is a linear classifier, with weights equal to log-probabilities.
•Computational efficiency is exceptional: O(N·L) training, O(|C|·L) prediction, with incremental learning support.
•Document length can bias predictions; normalization strategies include TF normalization, L2 normalization, and sublinear scaling.
•Practical deployment requires attention to smoothing, vocabulary construction, and probability calibration.

What's Next:

In the next page, we explore Bernoulli Naive Bayes—an alternative model that treats documents as binary vectors indicating word presence or absence. This model often outperforms Multinomial NB on short texts and provides complementary insights into document classification.

Page Complete

You now have a deep understanding of Multinomial Naive Bayes—its mathematical foundations, implementation, and practical considerations. You can implement the algorithm from scratch, analyze its decisions through log-odds ratios, and deploy it effectively for text classification tasks.