Loading content...
Imagine you have two urns, each filled with colored balls. The "spam" urn contains mostly red balls (words like "free", "winner", "urgent"), while the "ham" urn contains mostly blue balls (words like "meeting", "report", "schedule"). When you receive an email, it's as if someone secretly chose an urn, drew several balls (words) from it, and showed you the result. Your task: figure out which urn the balls came from.
This simple analogy captures the essence of Multinomial Naive Bayes—the most widely used variant for text classification. The model treats each document as a sequence of word draws from a class-specific multinomial distribution, and classification becomes the problem of inferring which distribution most likely generated the observed document.
By the end of this page, you will understand the probabilistic foundations of Multinomial Naive Bayes, derive the parameter estimation formulas from maximum likelihood, implement the complete algorithm from scratch, analyze its computational complexity, and understand the conditions under which it excels or fails.
Before diving into Naive Bayes, we must understand the multinomial distribution—the probability distribution that underpins the entire model.
Definition:
The multinomial distribution is the generalization of the binomial distribution to multiple outcomes. If we have $k$ possible outcomes with probabilities $\theta_1, \theta_2, \ldots, \theta_k$ (where $\sum_{i=1}^{k} \theta_i = 1$), and we conduct $n$ independent trials, the probability of observing exactly $x_1$ occurrences of outcome 1, $x_2$ occurrences of outcome 2, etc., is:
$$P(x_1, x_2, \ldots, x_k | n, \boldsymbol{\theta}) = \frac{n!}{x_1! x_2! \cdots x_k!} \prod_{i=1}^{k} \theta_i^{x_i}$$
where $\sum_{i=1}^{k} x_i = n$.
The Document Generation Analogy:
In the context of text classification:
| Statistical Concept | Text Classification Interpretation | Example |
|---|---|---|
| Number of trials $n$ | Document length (total words) | An email with 150 words has $n = 150$ |
| Number of outcomes $k$ | Vocabulary size | $k = 50,000$ for a typical corpus |
| Outcome probabilities $\boldsymbol{\theta}$ | Word distribution for class $c$ | $P(\text{"free"} | \text{spam}) = 0.05$ |
| Observation counts $\mathbf{x}$ | Word frequencies in document | "free" appears 3 times → $x_{\text{free}} = 3$ |
| Multinomial coefficient | Number of ways to arrange words | Same for all classes, so cancels in ratio |
Why Multinomial Makes Sense for Documents:
The multinomial model captures a fundamental intuition about document generation: word choice is a random process with class-dependent probabilities. A spam email doesn't contain "free" by accident—it appears because spam messages systematically use certain words more frequently.
However, the multinomial model makes strong assumptions:
Exchangeability: The probability of a document depends only on word counts, not word order. "The cat sat on the mat" and "mat the sat cat on the" have identical probabilities.
Independence: Each word is drawn independently. The appearance of "machine" doesn't affect the probability of "learning" appearing next.
Fixed vocabulary: All possible words are known a priori, and their total probability sums to 1.
These assumptions are violated by natural language—yet the model works remarkably well for classification because we only need to correctly rank class probabilities, not estimate them accurately.
Now we derive the complete Multinomial Naive Bayes classifier, starting from Bayes' theorem and the multinomial assumption.
Step 1: Apply Bayes' Theorem
For a document $d$ represented as a vector of word counts $\mathbf{x} = (x_1, x_2, \ldots, x_{|V|})$:
$$P(c | \mathbf{x}) = \frac{P(\mathbf{x} | c) P(c)}{P(\mathbf{x})}$$
Since $P(\mathbf{x})$ is constant across classes, classification requires:
$$\hat{c} = \arg\max_{c} P(c) P(\mathbf{x} | c)$$
Step 2: Model the Likelihood Using Multinomial
Under the multinomial model:
$$P(\mathbf{x} | c) = \frac{n!}{\prod_i x_i!} \prod_{i=1}^{|V|} P(w_i | c)^{x_i}$$
where $n = \sum_i x_i$ is the document length.
The multinomial coefficient $\frac{n!}{\prod_i x_i!}$ is the same for all classes (it depends only on the document, not the class), so we can ignore it for classification:
$$\hat{c} = \arg\max_{c} P(c) \prod_{i=1}^{|V|} P(w_i | c)^{x_i}$$
Step 3: Convert to Log-Space
Multiplying many small probabilities causes numerical underflow. We take logarithms:
$$\hat{c} = \arg\max_{c} \left[ \log P(c) + \sum_{i=1}^{|V|} x_i \log P(w_i | c) \right]$$
This is the fundamental equation of Multinomial Naive Bayes classification.
To classify a document: (1) Start with the log-prior for each class, (2) For each word in the document, add its count times the log-probability of that word given the class, (3) Predict the class with the highest total. This is simply a weighted sum of word log-probabilities, where weights are word frequencies.
Step 4: Geometric Interpretation
The classification decision can be understood geometrically. For binary classification with classes $c_1$ and $c_2$, the decision boundary is:
$$\log P(c_1) + \sum_i x_i \log P(w_i | c_1) = \log P(c_2) + \sum_i x_i \log P(w_i | c_2)$$
Rearranging:
$$\sum_i x_i \left[ \log P(w_i | c_1) - \log P(w_i | c_2) \right] = \log P(c_2) - \log P(c_1)$$
Defining $w_i^* = \log P(w_i | c_1) - \log P(w_i | c_2)$ and $b = \log P(c_2) - \log P(c_1)$:
$$\sum_i x_i w_i^* = b$$
This is a linear decision boundary in the space of word counts! Multinomial Naive Bayes is fundamentally a linear classifier, with weights determined by the log-odds ratios of word probabilities.
The Multinomial Naive Bayes model has two sets of parameters to estimate from training data:
Estimating Class Priors:
The maximum likelihood estimate (MLE) for the prior probability of class $c$ is simply the fraction of training documents belonging to that class:
$$\hat{P}(c) = \frac{N_c}{N}$$
where $N_c$ is the number of documents in class $c$ and $N$ is the total number of training documents.
Estimating Word Likelihoods:
The MLE for the probability of word $w$ given class $c$ is the fraction of word occurrences in class $c$ that are word $w$:
$$\hat{P}(w | c) = \frac{\text{count}(w, c)}{\sum_{w' \in V} \text{count}(w', c)} = \frac{\text{count}(w, c)}{\text{total words in class } c}$$
where $\text{count}(w, c)$ is the number of times word $w$ appears across all documents of class $c$.
The Frequency Table:
| Word | Count in Spam | Count in Ham | $P(w|\text{spam})$ | $P(w|\text{ham})$ |
|---|---|---|---|---|
| free | 150 | 5 | 150/3000 = 0.050 | 5/4000 = 0.001 |
| meeting | 10 | 200 | 10/3000 = 0.003 | 200/4000 = 0.050 |
| winner | 120 | 8 | 120/3000 = 0.040 | 8/4000 = 0.002 |
| report | 15 | 180 | 15/3000 = 0.005 | 180/4000 = 0.045 |
| ... | ... | ... | ... | ... |
| Total | 3000 | 4000 | 1.000 | 1.000 |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252
import numpy as npfrom collections import Counter, defaultdictfrom typing import List, Dict, Tupleimport math class MultinomialNaiveBayes: """ Complete implementation of Multinomial Naive Bayes from scratch. This implementation prioritizes clarity and understanding over computational efficiency. Every step is explicitly documented. """ def __init__(self, alpha: float = 1.0): """ Initialize the classifier. Args: alpha: Laplace smoothing parameter. α = 1 is add-one smoothing (Laplace) α = 0 is no smoothing (pure MLE) α < 1 is Lidstone smoothing """ self.alpha = alpha # Model parameters (learned during fit) self.classes_: List[str] = [] self.log_priors_: Dict[str, float] = {} # log P(c) self.log_likelihoods_: Dict[str, Dict[str, float]] = {} # log P(w|c) self.vocabulary_: set = set() self.vocab_size_: int = 0 # Statistics for interpretability self.class_word_counts_: Dict[str, int] = {} def _tokenize(self, document: str) -> List[str]: """Simple whitespace tokenization with lowercasing.""" return document.lower().split() def fit(self, documents: List[str], labels: List[str]) -> 'MultinomialNaiveBayes': """ Fit the model using Maximum Likelihood Estimation with smoothing. The fitting process: 1. Build vocabulary from all training documents 2. Count class frequencies → compute log P(c) 3. Count word-class co-occurrences → compute log P(w|c) Args: documents: List of text documents labels: List of class labels (same length as documents) Returns: self (fitted model) """ n_docs = len(documents) # Step 1: Tokenize all documents and build vocabulary tokenized_docs = [self._tokenize(doc) for doc in documents] for tokens in tokenized_docs: self.vocabulary_.update(tokens) self.vocab_size_ = len(self.vocabulary_) # Step 2: Count class frequencies self.classes_ = sorted(set(labels)) class_doc_counts = Counter(labels) # Compute log priors: log P(c) = log(N_c / N) self.log_priors_ = { c: math.log(class_doc_counts[c] / n_docs) for c in self.classes_ } # Step 3: Count word occurrences per class # word_counts[class][word] = count of word in all docs of class word_counts = {c: defaultdict(int) for c in self.classes_} for tokens, label in zip(tokenized_docs, labels): for token in tokens: word_counts[label][token] += 1 # Total words per class (for normalization) self.class_word_counts_ = { c: sum(word_counts[c].values()) for c in self.classes_ } # Step 4: Compute log-likelihoods with Laplace smoothing # log P(w|c) = log[(count(w,c) + α) / (total_c + α|V|)] self.log_likelihoods_ = {} for c in self.classes_: total_words_c = self.class_word_counts_[c] denominator = total_words_c + self.alpha * self.vocab_size_ self.log_likelihoods_[c] = { word: math.log((word_counts[c][word] + self.alpha) / denominator) for word in self.vocabulary_ } return self def predict_log_proba(self, document: str) -> Dict[str, float]: """ Compute log-probabilities for each class. For each class c: log P(c|d) ∝ log P(c) + Σ_w count(w,d) * log P(w|c) Args: document: Text document to classify Returns: Dictionary mapping class names to log-probabilities """ tokens = self._tokenize(document) word_counts = Counter(tokens) log_probs = {} for c in self.classes_: # Start with log prior log_prob = self.log_priors_[c] # Add log-likelihood for each word (weighted by count) for word, count in word_counts.items(): if word in self.log_likelihoods_[c]: log_prob += count * self.log_likelihoods_[c][word] else: # Unknown word: use smoothing estimate # P(unknown|c) ≈ α / (total_c + α|V|) log_unknown = math.log( self.alpha / (self.class_word_counts_[c] + self.alpha * self.vocab_size_) ) log_prob += count * log_unknown log_probs[c] = log_prob return log_probs def predict_proba(self, document: str) -> Dict[str, float]: """ Compute normalized probabilities for each class. Uses the log-sum-exp trick for numerical stability: P(c|d) = exp(log P(c|d) - log Σ_c' exp(log P(c'|d))) """ log_probs = self.predict_log_proba(document) # Log-sum-exp for numerical stability max_log_prob = max(log_probs.values()) log_sum = max_log_prob + math.log( sum(math.exp(lp - max_log_prob) for lp in log_probs.values()) ) return { c: math.exp(lp - log_sum) for c, lp in log_probs.items() } def predict(self, document: str) -> str: """Predict the most likely class for a document.""" log_probs = self.predict_log_proba(document) return max(log_probs, key=log_probs.get) def get_feature_importance(self, class_name: str, top_n: int = 10) -> List[Tuple[str, float]]: """ Get the most important features (words) for a class. Returns words with highest P(w|c), which are most indicative of that class. """ if class_name not in self.classes_: raise ValueError(f"Unknown class: {class_name}") word_probs = [ (word, math.exp(log_prob)) for word, log_prob in self.log_likelihoods_[class_name].items() ] return sorted(word_probs, key=lambda x: x[1], reverse=True)[:top_n] # ============================================================# COMPLETE EXAMPLE: Email Classification# ============================================================ # Training data: emails labeled as spam or hamtraining_emails = [ # Spam examples ("FREE FREE FREE! You have won $1000 lottery prize! Claim now!", "spam"), ("Congratulations! You've been selected for a free vacation", "spam"), ("URGENT: Your account will be suspended unless you verify", "spam"), ("Make money fast! Work from home and earn $5000 daily", "spam"), ("Limited time offer! Get 90% discount on luxury watches now", "spam"), ("You are a winner! Click here to claim your prize money", "spam"), # Ham examples ("Meeting scheduled for tomorrow at 3pm in conference room B", "ham"), ("Please review the attached quarterly report before Friday", "ham"), ("Can we reschedule our lunch meeting to next week?", "ham"), ("The project deadline has been extended to December 15th", "ham"), ("Your order has been shipped and will arrive on Tuesday", "ham"), ("Here are the meeting notes from today's discussion", "ham"),] # Separate features and labelsdocs = [email for email, _ in training_emails]labels = [label for _, label in training_emails] # Train the modelmodel = MultinomialNaiveBayes(alpha=1.0)model.fit(docs, labels) # Print model statisticsprint("=" * 60)print("MULTINOMIAL NAIVE BAYES MODEL SUMMARY")print("=" * 60)print(f"\nVocabulary size: {model.vocab_size_} words")print(f"Classes: {model.classes_}") print("\nClass Priors:")for c in model.classes_: print(f" P({c}) = exp({model.log_priors_[c]:.4f}) = {math.exp(model.log_priors_[c]):.4f}") print("\nTop 5 Words per Class:")for c in model.classes_: top_words = model.get_feature_importance(c, top_n=5) print(f"\n {c.upper()}:") for word, prob in top_words: print(f" {word:15s} P(w|{c}) = {prob:.4f}") # Test on new emailsprint("\n" + "=" * 60)print("CLASSIFICATION RESULTS")print("=" * 60) test_emails = [ "Claim your free prize now limited time offer winner", "The meeting has been rescheduled to Monday afternoon", "URGENT winner notification verify your account immediately", "Please find attached the updated project schedule"] for email in test_emails: prediction = model.predict(email) probs = model.predict_proba(email) print(f"\nEmail: '{email[:50]}...'") print(f" Prediction: {prediction.upper()}") print(f" P(spam|email) = {probs['spam']:.4f}") print(f" P(ham|email) = {probs['ham']:.4f}")A crucial insight is that Multinomial Naive Bayes can be rewritten in log-linear form, revealing its connection to logistic regression and other linear classifiers.
Log-Linear Representation:
The classification score for class $c$ is:
$$\text{score}(c, d) = \log P(c) + \sum_{w \in V} \text{count}(w, d) \cdot \log P(w | c)$$
This can be written as a dot product:
$$\text{score}(c, d) = b_c + \mathbf{w}_c^T \mathbf{x}$$
where:
Interpretation of Weights:
The weight $w_{c,i} = \log P(w_i | c)$ captures how much the presence of word $w_i$ supports class $c$. But absolute values are less informative than relative weights (log-odds ratios):
Log-Odds Ratio Analysis:
For binary classification (e.g., spam vs. ham), the most informative quantity is the log-odds ratio:
$$\text{LOR}(w) = \log \frac{P(w | \text{spam})}{P(w | \text{ham})}$$
The magnitude of LOR indicates strength of evidence: $|\text{LOR}(w)| = 2$ means the word is $e^2 \approx 7.4$ times more likely in one class than the other.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npimport mathfrom typing import List, Tuple, Dict def analyze_log_odds(model: 'MultinomialNaiveBayes', class_pos: str = 'spam', class_neg: str = 'ham') -> Dict[str, float]: """ Compute log-odds ratios for all words. LOR(w) = log[P(w|pos) / P(w|neg)] Returns dictionary mapping words to their LOR. """ log_odds = {} for word in model.vocabulary_: log_p_pos = model.log_likelihoods_[class_pos][word] log_p_neg = model.log_likelihoods_[class_neg][word] log_odds[word] = log_p_pos - log_p_neg return log_odds def get_most_discriminative_words(log_odds: Dict[str, float], n: int = 10) -> Tuple[List, List]: """ Get most discriminative words for each class. Returns: (words_for_spam, words_for_ham) - each is list of (word, LOR) """ sorted_words = sorted(log_odds.items(), key=lambda x: x[1], reverse=True) spam_words = sorted_words[:n] # Highest LOR = most spam-indicative ham_words = sorted_words[-n:][::-1] # Lowest LOR = most ham-indicative return spam_words, ham_words def explain_prediction(model: 'MultinomialNaiveBayes', document: str, class_pos: str = 'spam', class_neg: str = 'ham') -> None: """ Explain a classification decision by showing word contributions. For each word, shows how much it pushes toward spam vs ham. """ from collections import Counter tokens = document.lower().split() word_counts = Counter(tokens) log_odds = analyze_log_odds(model, class_pos, class_neg) print(f"\nAnalyzing: '{document[:60]}...'") print("-" * 60) # Prior contribution prior_contribution = model.log_priors_[class_pos] - model.log_priors_[class_neg] print(f"Prior (log-odds): {prior_contribution:+.4f}") # Word contributions contributions = [] for word, count in word_counts.items(): if word in log_odds: contribution = count * log_odds[word] contributions.append((word, count, log_odds[word], contribution)) # Sort by absolute contribution contributions.sort(key=lambda x: abs(x[3]), reverse=True) print("\nWord contributions (sorted by impact):") print(f"{'Word':15s} {'Count':>6s} {'LOR':>8s} {'Impact':>10s}") print("-" * 45) total_word_contribution = 0 for word, count, lor, impact in contributions[:10]: direction = "→spam" if impact > 0 else "→ham" print(f"{word:15s} {count:>6d} {lor:>+8.3f} {impact:>+10.3f} {direction}") total_word_contribution += impact if len(contributions) > 10: remaining = sum(c[3] for c in contributions[10:]) print(f"{'(other words)':15s} {'':>6s} {'':>8s} {remaining:>+10.3f}") total_word_contribution += remaining total_log_odds = prior_contribution + total_word_contribution print("-" * 45) print(f"{'TOTAL':15s} {'':>6s} {'':>8s} {total_log_odds:>+10.3f}") # Convert to probability prob_spam = 1 / (1 + math.exp(-total_log_odds)) print(f"\nP(spam|document) = {prob_spam:.4f}") print(f"Prediction: {'SPAM' if total_log_odds > 0 else 'HAM'}") # Example analysis (continuing from previous code)# Assuming 'model' is already trained... print("=" * 60)print("LOG-ODDS RATIO ANALYSIS")print("=" * 60) # Analyze word discriminabilitylog_odds = analyze_log_odds(model, 'spam', 'ham')spam_words, ham_words = get_most_discriminative_words(log_odds, n=5) print("\nMost Spam-Indicative Words:")print(f"{'Word':15s} {'LOR':>10s} {'Odds Ratio':>12s}")print("-" * 40)for word, lor in spam_words: print(f"{word:15s} {lor:>+10.3f} {math.exp(lor):>12.2f}x") print("\nMost Ham-Indicative Words:")print(f"{'Word':15s} {'LOR':>10s} {'Odds Ratio':>12s}")print("-" * 40)for word, lor in ham_words: print(f"{word:15s} {lor:>+10.3f} {math.exp(lor):>12.2f}x") # Explain specific predictionsprint("\n" + "=" * 60)print("PREDICTION EXPLANATIONS")print("=" * 60) explain_prediction(model, "Free prize winner claim your lottery money now")explain_prediction(model, "Meeting tomorrow to discuss the project deadline")Unlike deep learning models that operate as black boxes, Multinomial Naive Bayes is fully interpretable. You can examine P(w|c) for every word, compute log-odds ratios, and trace exactly why the classifier made any particular decision. This interpretability is invaluable for debugging, explaining predictions to stakeholders, and identifying dataset biases.
One of Multinomial Naive Bayes's most compelling advantages is its computational efficiency. Both training and prediction are remarkably fast, scaling linearly with data size.
Training Complexity:
Let:
Training requires:
Total training complexity: $O(N \cdot L + |C| \cdot |V|)$
Since typically $|V| < N \cdot L$ (vocabulary grows sublinearly with corpus size), training is effectively $O(N \cdot L)$ — linear in the total number of word tokens in the training data.
Prediction Complexity:
For a document of length $L_d$:
Total prediction complexity: $O(|C| \cdot L_d)$ — linear in document length times number of classes.
| Algorithm | Training | Prediction | Memory |
|---|---|---|---|
| Multinomial NB | $O(N \cdot L)$ | $O(|C| \cdot L_d)$ | $O(|C| \cdot |V|)$ |
| Logistic Regression | $O(N \cdot L \cdot |V| \cdot I)$* | $O(|C| \cdot |V|)$ | $O(|C| \cdot |V|)$ |
| SVM (linear) | $O(N \cdot L \cdot |V|)$ | $O(|C| \cdot |V|)$ | $O(|C| \cdot |V|)$ |
| Random Forest | $O(N \cdot L \cdot |V| \cdot T \cdot D)$** | $O(T \cdot D)$ | $O(T \cdot 2^D)$ |
| Neural Network | $O(N \cdot L \cdot |V| \cdot H \cdot E)$*** | $O(|V| \cdot H)$ | $O(|V| \cdot H + H^2)$ |
*$I$ = iterations, **$T$ = trees, $D$ = depth, ***$H$ = hidden units, $E$ = epochs
Memory Efficiency:
Multinomial NB stores only:
For $|C| = 2$ and $|V| = 50,000$, this is only 100,000 floating-point numbers ≈ 400 KB. Compare this to neural networks that may require hundreds of megabytes.
Incremental Learning:
A remarkable property of Multinomial NB is its support for incremental updates. Since the model is based entirely on word counts, new documents can be incorporated without retraining from scratch:
This makes Multinomial NB ideal for streaming applications where models must be updated continuously.
In practice, Multinomial NB can train on millions of documents in seconds. The 20 Newsgroups dataset (18,846 documents, ~100K vocabulary) trains in under 1 second on commodity hardware. Compare this to BERT fine-tuning, which requires hours on GPU hardware for similar-sized datasets.
A subtle but important issue with Multinomial Naive Bayes is its sensitivity to document length. The log-likelihood of a document grows (or shrinks) with its length, which can bias predictions in unexpected ways.
The Length Problem:
Consider two documents with identical class-conditional word distributions:
Even if both documents have the same proportion of spam-indicative words, Document B will have more extreme log-likelihoods simply because it has more words. This can cause problems when classifying documents of varying lengths.
Mathematical Analysis:
For a document $d$ of length $n$ from class $c$:
$$\log P(d | c) = \sum_{w} \text{count}(w, d) \cdot \log P(w | c)$$
Since $\sum_w \text{count}(w, d) = n$, and most words have $P(w|c) < 1$ (so $\log P(w|c) < 0$), longer documents have more negative log-likelihoods.
Normalization Strategies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
import numpy as npfrom collections import Counterimport math def demonstrate_length_effect(): """ Show how document length affects Multinomial NB predictions. """ # Simplified vocabulary and probabilities vocab = ['free', 'money', 'meeting', 'report'] # Log probabilities (simplified) log_p_spam = {'free': -2.0, 'money': -2.5, 'meeting': -5.0, 'report': -5.0} log_p_ham = {'free': -5.0, 'money': -5.0, 'meeting': -2.0, 'report': -2.5} # Two documents with same PROPORTIONS but different lengths short_doc = {'free': 1, 'money': 1, 'meeting': 1, 'report': 1} # 4 words long_doc = {'free': 10, 'money': 10, 'meeting': 10, 'report': 10} # 40 words def compute_log_likelihood(doc, log_probs): return sum(count * log_probs[word] for word, count in doc.items()) # Compute likelihoods short_spam = compute_log_likelihood(short_doc, log_p_spam) short_ham = compute_log_likelihood(short_doc, log_p_ham) long_spam = compute_log_likelihood(long_doc, log_p_spam) long_ham = compute_log_likelihood(long_doc, log_p_ham) print("Without normalization:") print(f" Short doc: spam={short_spam:.2f}, ham={short_ham:.2f}, diff={short_spam-short_ham:+.2f}") print(f" Long doc: spam={long_spam:.2f}, ham={long_ham:.2f}, diff={long_spam-long_ham:+.2f}") # With length normalization short_len = sum(short_doc.values()) long_len = sum(long_doc.values()) print("\nWith length normalization:") print(f" Short doc: spam={short_spam/short_len:.2f}, ham={short_ham/short_len:.2f}") print(f" Long doc: spam={long_spam/long_len:.2f}, ham={long_ham/long_len:.2f}") print(" (Now the per-word scores are identical!)") class LengthNormalizedMultinomialNB: """ Multinomial Naive Bayes with length normalization. """ def __init__(self, alpha: float = 1.0, normalize: str = 'none'): """ Args: alpha: Smoothing parameter normalize: 'none', 'length', 'l2', or 'sublinear' """ self.alpha = alpha self.normalize = normalize self.classes_ = [] self.log_priors_ = {} self.log_likelihoods_ = {} self.vocabulary_ = set() self.vocab_size_ = 0 def _normalize_counts(self, word_counts: dict) -> dict: """Apply normalization to word counts.""" if self.normalize == 'none': return word_counts elif self.normalize == 'length': # Divide by document length (TF normalization) total = sum(word_counts.values()) if total == 0: return word_counts return {w: c/total for w, c in word_counts.items()} elif self.normalize == 'l2': # L2 normalization norm = math.sqrt(sum(c**2 for c in word_counts.values())) if norm == 0: return word_counts return {w: c/norm for w, c in word_counts.items()} elif self.normalize == 'sublinear': # Sublinear TF: 1 + log(count) for count > 0 return {w: (1 + math.log(c)) if c > 0 else 0 for w, c in word_counts.items()} return word_counts def predict_log_proba(self, document: str) -> dict: """Predict with normalization applied to document.""" tokens = document.lower().split() word_counts = Counter(tokens) # Apply normalization normalized_counts = self._normalize_counts(word_counts) log_probs = {} for c in self.classes_: log_prob = self.log_priors_[c] for word, count in normalized_counts.items(): if word in self.log_likelihoods_[c]: log_prob += count * self.log_likelihoods_[c][word] log_probs[c] = log_prob return log_probs # Demonstrate the effectdemonstrate_length_effect()Length normalization is most important when: (1) training and test documents have very different lengths, (2) the classification task involves comparing documents of varying sizes, or (3) longer documents tend to belong to specific classes (creating a spurious correlation). If your training and test documents have similar length distributions, normalization may have little effect.
Deploying Multinomial Naive Bayes in production requires attention to several practical issues beyond the core algorithm.
Multinomial NB is ideal when: (1) word frequency matters (longer documents with more mentions = stronger evidence), (2) documents are reasonably long (100+ words), (3) vocabulary is large (1000+ words), (4) training data is limited (< 10,000 examples), (5) interpretability is important, or (6) fast training/prediction is required. For very short texts (tweets, queries), consider Bernoulli NB instead.
We have conducted a thorough examination of Multinomial Naive Bayes, from its probabilistic foundations to practical implementation details.
What's Next:
In the next page, we explore Bernoulli Naive Bayes—an alternative model that treats documents as binary vectors indicating word presence or absence. This model often outperforms Multinomial NB on short texts and provides complementary insights into document classification.
You now have a deep understanding of Multinomial Naive Bayes—its mathematical foundations, implementation, and practical considerations. You can implement the algorithm from scratch, analyze its decisions through log-odds ratios, and deploy it effectively for text classification tasks.