Loading content...
Consider a different way of thinking about documents: instead of counting how many times each word appears, we simply ask "Does this word appear at all?" A document becomes a checklist—for each word in the vocabulary, we check "yes" or "no."
This binary perspective forms the foundation of Bernoulli Naive Bayes, an alternative to the multinomial model that often excels on short texts. While the multinomial model asks "How many times does each word appear?", the Bernoulli model asks "Which words are present and which are absent?"
The critical distinction is that Bernoulli NB explicitly models word absence. If the word "free" doesn't appear in an email, the multinomial model simply ignores this fact. But the Bernoulli model treats absence as evidence—if spam emails typically contain "free," then not seeing "free" is weak evidence against spam.
By the end of this page, you will understand how Bernoulli NB differs fundamentally from Multinomial NB, derive the complete Bernoulli model from probabilistic principles, implement the algorithm with explicit handling of word absence, analyze when Bernoulli NB outperforms Multinomial NB, and apply the model effectively to short text classification tasks.
The Bernoulli document model represents each document as a binary vector indicating which vocabulary words are present.
Formal Definition:
Given a fixed vocabulary $V = {w_1, w_2, \ldots, w_{|V|}}$, a document $d$ is represented as a binary vector $\mathbf{x} = (x_1, x_2, \ldots, x_{|V|})$ where:
$$x_i = \begin{cases} 1 & \text{if word } w_i \text{ appears in document } d \ 0 & \text{otherwise} \end{cases}$$
The Generative Story:
The Bernoulli model imagines documents being generated by the following process:
Note the key difference from the multinomial model: every vocabulary word has a "vote," whether present or absent. The document is generated by $|V|$ independent binary decisions, not by sampling a sequence of words.
| Aspect | Multinomial Model | Bernoulli Model |
|---|---|---|
| Feature representation | Word counts: $\mathbf{x} \in \mathbb{N}^{|V|}$ | Binary presence: $\mathbf{x} \in {0,1}^{|V|}$ |
| What's modeled | Word frequencies in documents | Word presence/absence patterns |
| Document length | Implicitly encoded in feature sum | Not directly captured |
| Treats absence as evidence? | No—absent words contribute nothing | Yes—absent words affect likelihood |
| Number of features used | Only non-zero (present words) | All vocabulary words |
| Underlying distribution | Multinomial over vocabulary | Product of Bernoullis |
| Best for | Longer documents, frequency matters | Short texts, binary patterns |
Consider spam detection: spam emails often contain 'winner', 'prize', 'free'. If an email doesn't contain these words, Multinomial NB simply doesn't add their contribution. But Bernoulli NB actively considers: 'winner is absent (evidence against spam), prize is absent (evidence against spam), free is absent (evidence against spam).' This cumulative absence signal can be powerful for short messages where word frequency is less informative.
We derive the complete Bernoulli Naive Bayes classifier, highlighting how it differs from the multinomial derivation.
Step 1: Apply Bayes' Theorem
As before, for classification:
$$\hat{c} = \arg\max_{c} P(c) P(\mathbf{x} | c)$$
Step 2: Model Using Product of Bernoullis
Under the naive Bayes assumption, features are conditionally independent:
$$P(\mathbf{x} | c) = \prod_{i=1}^{|V|} P(x_i | c)$$
For binary features with Bernoulli distribution:
$$P(x_i | c) = \begin{cases} P(w_i = 1 | c) & \text{if } x_i = 1 \ 1 - P(w_i = 1 | c) & \text{if } x_i = 0 \end{cases}$$
This can be written compactly as:
$$P(x_i | c) = P(w_i = 1 | c)^{x_i} \cdot (1 - P(w_i = 1 | c))^{1 - x_i}$$
So the full likelihood is:
$$P(\mathbf{x} | c) = \prod_{i=1}^{|V|} P(w_i = 1 | c)^{x_i} (1 - P(w_i = 1 | c))^{1 - x_i}$$
Step 3: Convert to Log-Space
Taking logarithms:
$$\log P(\mathbf{x} | c) = \sum_{i=1}^{|V|} \left[ x_i \log P(w_i = 1 | c) + (1 - x_i) \log(1 - P(w_i = 1 | c)) \right]$$
Let's define:
Then:
$$\log P(\mathbf{x} | c) = \sum_{i=1}^{|V|} \left[ x_i \log p_i^c + (1 - x_i) \log q_i^c \right]$$
Step 4: Separate Present and Absent Words
We can split this sum:
$$\log P(\mathbf{x} | c) = \sum_{i: x_i = 1} \log p_i^c + \sum_{i: x_i = 0} \log q_i^c$$
Or equivalently, rewriting the second sum:
$$\log P(\mathbf{x} | c) = \sum_{i: x_i = 1} \log \frac{p_i^c}{q_i^c} + \sum_{i=1}^{|V|} \log q_i^c$$
The second term $\sum_{i=1}^{|V|} \log q_i^c$ is a constant for each class (depends only on class, not document). It can be absorbed into the prior:
$$\log P(c | \mathbf{x}) \propto \log P(c) + \sum_{i=1}^{|V|} \log q_i^c + \sum_{i: x_i = 1} \log \frac{p_i^c}{q_i^c}$$
The term $\sum_{i=1}^{|V|} \log q_i^c$ captures the 'baseline' likelihood of generating any document from class $c$, assuming all words are absent. Each present word then adds a log-odds adjustment $\log(p_i^c/q_i^c)$. This structure reveals why Bernoulli NB models absence: the baseline already accounts for all words being absent, and present words perturb this baseline.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249
import numpy as npfrom collections import Counter, defaultdictfrom typing import List, Dict, Tuple, Setimport math class BernoulliNaiveBayes: """ Complete implementation of Bernoulli Naive Bayes from scratch. This model differs from Multinomial NB in two key ways: 1. Features are binary (word present/absent), not counts 2. Absent words contribute to the likelihood (not ignored) """ def __init__(self, alpha: float = 1.0, binarize: bool = True): """ Initialize the classifier. Args: alpha: Laplace smoothing parameter for Beta-Bernoulli conjugacy P(w=1|c) = (n_wc + α) / (N_c + 2α) binarize: If True, convert counts to binary (0/1) """ self.alpha = alpha self.binarize = binarize # Model parameters self.classes_: List[str] = [] self.log_priors_: Dict[str, float] = {} # log P(c) self.feature_log_probs_: Dict[str, Dict[str, Tuple[float, float]]] = {} # feature_log_probs_[c][w] = (log P(w=1|c), log P(w=0|c)) self.vocabulary_: Set[str] = set() # For the constant term (all-absent baseline) self.log_absence_baselines_: Dict[str, float] = {} def _tokenize(self, document: str) -> Set[str]: """Tokenize and return unique words (binary representation).""" tokens = document.lower().split() if self.binarize: return set(tokens) # Unique words only return set(tokens) def fit(self, documents: List[str], labels: List[str]) -> 'BernoulliNaiveBayes': """ Fit the Bernoulli NB model. For each word w and class c, we estimate: P(w=1|c) = (count of docs in c containing w + α) / (count of docs in c + 2α) The 2α in the denominator comes from the Beta(α, α) prior on the Bernoulli parameter. """ n_docs = len(documents) # Step 1: Build vocabulary from all documents tokenized_docs = [self._tokenize(doc) for doc in documents] for tokens in tokenized_docs: self.vocabulary_.update(tokens) # Step 2: Compute class priors self.classes_ = sorted(set(labels)) class_doc_counts = Counter(labels) self.log_priors_ = { c: math.log(class_doc_counts[c] / n_docs) for c in self.classes_ } # Step 3: Count documents containing each word, per class # doc_counts[c][w] = number of documents in class c containing word w doc_counts = {c: defaultdict(int) for c in self.classes_} for tokens, label in zip(tokenized_docs, labels): for token in tokens: # tokens is already a set (unique words) doc_counts[label][token] += 1 # Step 4: Compute feature log probabilities with smoothing # P(w=1|c) = (doc_count(w,c) + α) / (N_c + 2α) self.feature_log_probs_ = {} self.log_absence_baselines_ = {} for c in self.classes_: N_c = class_doc_counts[c] denominator = N_c + 2 * self.alpha class_probs = {} log_absence_sum = 0.0 for word in self.vocabulary_: # Laplace-smoothed probability p_present = (doc_counts[c].get(word, 0) + self.alpha) / denominator p_absent = 1 - p_present # Store log probabilities for both states class_probs[word] = ( math.log(p_present), # log P(w=1|c) math.log(p_absent) # log P(w=0|c) ) # Accumulate absence baseline log_absence_sum += math.log(p_absent) self.feature_log_probs_[c] = class_probs self.log_absence_baselines_[c] = log_absence_sum return self def predict_log_proba(self, document: str) -> Dict[str, float]: """ Compute log-probabilities for each class. Uses the efficient formulation: log P(x|c) = baseline_c + Σ_{w present} log(p_w / q_w) where baseline_c = Σ_{all w} log(q_w) """ present_words = self._tokenize(document) log_probs = {} for c in self.classes_: # Start with prior + absence baseline log_prob = self.log_priors_[c] + self.log_absence_baselines_[c] # Add log-odds for each present word for word in present_words: if word in self.feature_log_probs_[c]: log_p_present, log_p_absent = self.feature_log_probs_[c][word] # log(p/q) = log(p) - log(q) log_odds = log_p_present - log_p_absent log_prob += log_odds # Unknown words: contribution is 0 (log-odds of uniform = 0) log_probs[c] = log_prob return log_probs def predict_log_proba_explicit(self, document: str) -> Dict[str, float]: """ Alternative: explicit computation considering ALL vocabulary words. This is slower but clearer for understanding. Demonstrates that we consider EVERY word, present or not. """ present_words = self._tokenize(document) log_probs = {} for c in self.classes_: log_prob = self.log_priors_[c] # Iterate over ENTIRE vocabulary for word in self.vocabulary_: log_p_present, log_p_absent = self.feature_log_probs_[c][word] if word in present_words: log_prob += log_p_present else: log_prob += log_p_absent # This is the key difference! log_probs[c] = log_prob return log_probs def predict(self, document: str) -> str: """Predict the most likely class.""" log_probs = self.predict_log_proba(document) return max(log_probs, key=log_probs.get) def predict_proba(self, document: str) -> Dict[str, float]: """Compute normalized probabilities.""" log_probs = self.predict_log_proba(document) max_log = max(log_probs.values()) exp_probs = {c: math.exp(lp - max_log) for c, lp in log_probs.items()} total = sum(exp_probs.values()) return {c: p / total for c, p in exp_probs.items()} # ============================================================# DEMONSTRATE BERNOULLI VS MULTINOMIAL DIFFERENCE# ============================================================ print("=" * 60)print("BERNOULLI NAIVE BAYES: EXPLICIT ABSENCE MODELING")print("=" * 60) # Training data with short messages (where Bernoulli often excels)training_messages = [ # Spam (short promotional messages) ("free prize win now", "spam"), ("winner free money", "spam"), ("claim prize today", "spam"), ("free offer limited", "spam"), # Ham (short normal messages) ("meeting tomorrow morning", "ham"), ("report ready review", "ham"), ("lunch today noon", "ham"), ("project update status", "ham"),] docs = [msg for msg, _ in training_messages]labels = [label for _, label in training_messages] # Train Bernoulli NBmodel = BernoulliNaiveBayes(alpha=1.0)model.fit(docs, labels) print(f"\nVocabulary size: {len(model.vocabulary_)} words")print(f"Vocabulary: {sorted(model.vocabulary_)}") # Show parameters for a few interesting wordsprint("\nFeature Probabilities:")print(f"{'Word':15s} {'P(w=1|spam)':>12s} {'P(w=1|ham)':>12s} {'Log-Odds':>10s}")print("-" * 55) for word in sorted(model.vocabulary_): log_p_spam, log_q_spam = model.feature_log_probs_['spam'][word] log_p_ham, log_q_ham = model.feature_log_probs_['ham'][word] p_spam = math.exp(log_p_spam) p_ham = math.exp(log_p_ham) log_odds = log_p_spam - log_q_spam - (log_p_ham - log_q_ham) print(f"{word:15s} {p_spam:>12.4f} {p_ham:>12.4f} {log_odds:>+10.3f}") # Demonstrate absence effectprint("\n" + "=" * 60)print("DEMONSTRATING THE EFFECT OF WORD ABSENCE")print("=" * 60) test_cases = [ "free prize win", # Contains spam words "meeting report", # Contains ham words "hello world test", # Neutral words (not in vocab)] for msg in test_cases: log_probs = model.predict_log_proba(msg) probs = model.predict_proba(msg) print(f"\nMessage: '{msg}'") print(f" Present words: {model._tokenize(msg) & model.vocabulary_}") print(f" Absent words: {model.vocabulary_ - model._tokenize(msg)}") print(f" P(spam) = {probs['spam']:.4f}, P(ham) = {probs['ham']:.4f}") print(f" Prediction: {model.predict(msg).upper()}")The parameter estimation for Bernoulli NB differs subtly but importantly from the multinomial case.
What We're Estimating:
For each word $w$ and class $c$, we estimate $P(w = 1 | c)$ — the probability that a document from class $c$ contains word $w$.
Maximum Likelihood Estimate:
$$\hat{P}(w = 1 | c) = \frac{\text{number of documents in class } c \text{ containing word } w}{\text{number of documents in class } c} = \frac{n_{wc}}{N_c}$$
Laplace-Smoothed Estimate:
To avoid zero probabilities, we add smoothing:
$$\hat{P}(w = 1 | c) = \frac{n_{wc} + \alpha}{N_c + 2\alpha}$$
Note the $2\alpha$ in the denominator (not $|V| \cdot \alpha$ as in multinomial). This is because each word is a binary random variable with two outcomes (present/absent), so we add $\alpha$ pseudo-counts for each outcome.
| Aspect | Multinomial NB | Bernoulli NB |
|---|---|---|
| Count being used | Total occurrences of word in class | Number of documents containing word |
| Denominator | Total words in class | Number of documents in class |
| Smoothing denominator | $+ \alpha |V|$ (one α per vocabulary word) | $+ 2\alpha$ (one α per outcome: present/absent) |
| Interpretation | Fraction of class words that are word $w$ | Probability document contains word $w$ |
| Range | Must sum to 1 over vocabulary | Each P(w|c) independent, can be any value |
The Laplace smoothing for Bernoulli NB corresponds to a Beta(α, α) prior on each word's presence probability. With α = 1, this is the uniform prior Beta(1, 1). The posterior mean estimate becomes (n + α) / (N + 2α), which is the Laplace-smoothed estimate. This Bayesian interpretation explains why the denominator has 2α: we're adding α pseudo-observations for 'present' and α for 'absent'.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import mathfrom collections import defaultdict, Counter def detailed_bernoulli_estimation(documents, labels, alpha=1.0): """ Step-by-step parameter estimation for Bernoulli NB. This function explicitly shows each step of the estimation process. """ # Step 1: Tokenize documents (get unique words per document) tokenized = [set(doc.lower().split()) for doc in documents] # Step 2: Build vocabulary vocabulary = set() for tokens in tokenized: vocabulary.update(tokens) vocabulary = sorted(vocabulary) print("STEP 1 & 2: Tokenization and Vocabulary") print(f" Vocabulary ({len(vocabulary)} words): {vocabulary}") # Step 3: Count class frequencies classes = sorted(set(labels)) class_counts = Counter(labels) print("\nSTEP 3: Class Counts") for c, count in class_counts.items(): print(f" {c}: {count} documents") # Step 4: Count documents containing each word, per class doc_word_counts = {c: defaultdict(int) for c in classes} for tokens, label in zip(tokenized, labels): for word in tokens: doc_word_counts[label][word] += 1 print("\nSTEP 4: Document-Word Counts") print(f"{'Word':15s}", end="") for c in classes: print(f"{c + ' docs':>12s}", end="") print() print("-" * 40) for word in vocabulary: print(f"{word:15s}", end="") for c in classes: print(f"{doc_word_counts[c][word]:>12d}", end="") print() # Step 5: Compute smoothed probabilities print("\nSTEP 5: Smoothed Probability Estimates") print(f"Formula: P(w=1|c) = (n_wc + α) / (N_c + 2α)") print(f"Smoothing α = {alpha}") print() print(f"{'Word':15s}", end="") for c in classes: print(f"{'P(w|' + c + ')':>12s}", end="") print(f"{'Log-Odds':>12s}") print("-" * 55) feature_probs = {c: {} for c in classes} for word in vocabulary: print(f"{word:15s}", end="") for c in classes: N_c = class_counts[c] n_wc = doc_word_counts[c][word] p_present = (n_wc + alpha) / (N_c + 2 * alpha) feature_probs[c][word] = p_present print(f"{p_present:>12.4f}", end="") # Log-odds ratio for binary classification if len(classes) == 2: c1, c2 = classes p1 = feature_probs[c1][word] p2 = feature_probs[c2][word] # Log-odds of word being present pointing to c1 vs c2 log_odds = math.log(p1 / (1 - p1)) - math.log(p2 / (1 - p2)) print(f"{log_odds:>+12.3f}") else: print() # Step 6: Compute absence baselines print("\nSTEP 6: Absence Baselines") print(" baseline_c = Σ log(1 - P(w=1|c)) for all words") for c in classes: baseline = sum( math.log(1 - feature_probs[c][word]) for word in vocabulary ) print(f" {c}: {baseline:.4f}") return feature_probs, class_counts # Example with clear datatraining_data = [ ("free money prize", "spam"), ("free winner claim", "spam"), ("money free now", "spam"), ("meeting report project", "ham"), ("status update meeting", "ham"), ("report project review", "ham"),] docs = [d for d, _ in training_data]labels = [l for _, l in training_data] print("=" * 60)print("DETAILED PARAMETER ESTIMATION")print("=" * 60)print(f"\nTraining Data ({len(docs)} documents):")for doc, label in training_data: print(f" [{label}] '{doc}'") print("\n" + "=" * 60)detailed_bernoulli_estimation(docs, labels, alpha=1.0)The most distinctive feature of Bernoulli NB is its explicit modeling of word absence. Let's explore this in depth with concrete examples.
The Mathematical Mechanism:
For a vocabulary word $w$ and class $c$:
Consider a word "free" that appears in 80% of spam but only 5% of ham:
When "free" is present:
When "free" is absent:
The absence of a class-typical word provides evidence against that class. If spam emails usually contain 'free', then NOT seeing 'free' makes spam less likely. This is intuitive but often overlooked in Multinomial NB, where absence simply contributes nothing. Bernoulli NB explicitly captures this anti-evidence.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
import mathfrom typing import Dict, Set def analyze_absence_effect(vocabulary: Set[str], feature_probs: Dict[str, Dict[str, float]], classes: tuple = ('spam', 'ham')): """ Analyze how word presence and absence affect classification. Shows the log-odds contribution of each word in both states. """ c1, c2 = classes print("WORD PRESENCE/ABSENCE ANALYSIS") print("=" * 70) print(f"Comparing {c1} vs {c2}") print() print(f"{'Word':15s} {'P(w|'+c1+')':>10s} {'P(w|'+c2+')':>10s} " f"{'Present→':>12s} {'Absent→':>12s}") print("-" * 70) contributions = [] for word in sorted(vocabulary): p1 = feature_probs[c1][word] p2 = feature_probs[c2][word] # Log-odds when present # log[P(w=1|c1)/P(w=1|c2)] present_log_odds = math.log(p1 / p2) if p2 > 0 else float('inf') # Log-odds when absent # log[P(w=0|c1)/P(w=0|c2)] = log[(1-p1)/(1-p2)] absent_log_odds = math.log((1 - p1) / (1 - p2)) if (1 - p2) > 0 else float('-inf') # Direction indicators present_dir = c1 if present_log_odds > 0 else c2 absent_dir = c1 if absent_log_odds > 0 else c2 print(f"{word:15s} {p1:>10.4f} {p2:>10.4f} " f"{present_log_odds:>+10.3f}→{present_dir[:1]} " f"{absent_log_odds:>+10.3f}→{absent_dir[:1]}") contributions.append({ 'word': word, 'present_effect': present_log_odds, 'absent_effect': absent_log_odds, 'swing': present_log_odds - absent_log_odds # Total discriminative power }) # Find most discriminative words contributions.sort(key=lambda x: abs(x['swing']), reverse=True) print("\n" + "=" * 70) print("MOST DISCRIMINATIVE WORDS (by total swing)") print("-" * 70) for item in contributions[:5]: print(f"{item['word']:15s}: present→{item['present_effect']:+.3f}, " f"absent→{item['absent_effect']:+.3f}, swing={item['swing']:+.3f}") return contributions def classify_with_breakdown(document: str, vocabulary: Set[str], feature_probs: Dict[str, Dict[str, float]], class_priors: Dict[str, float], classes: tuple = ('spam', 'ham')): """ Classify a document with detailed breakdown of contributions. Shows exactly how each word (present and absent) affects the final decision. """ c1, c2 = classes present_words = set(document.lower().split()) & vocabulary absent_words = vocabulary - present_words print(f"\nDocument: '{document}'") print("-" * 60) # Prior contribution prior_diff = math.log(class_priors[c1] / class_priors[c2]) print(f"Prior log-odds ({c1}/{c2}): {prior_diff:+.4f}") # Present word contributions print(f"\nPresent words ({len(present_words)}):") present_total = 0 for word in sorted(present_words): p1 = feature_probs[c1][word] p2 = feature_probs[c2][word] log_odds = math.log(p1 / p2) present_total += log_odds direction = c1 if log_odds > 0 else c2 print(f" {word:15s}: log({p1:.3f}/{p2:.3f}) = {log_odds:+.4f} → {direction}") print(f" {'SUBTOTAL':15s}: {present_total:+.4f}") # Absent word contributions print(f"\nAbsent words ({len(absent_words)}):") absent_total = 0 absent_details = [] for word in sorted(absent_words): q1 = 1 - feature_probs[c1][word] q2 = 1 - feature_probs[c2][word] log_odds = math.log(q1 / q2) absent_total += log_odds absent_details.append((word, log_odds)) # Show most impactful absent words absent_details.sort(key=lambda x: abs(x[1]), reverse=True) for word, log_odds in absent_details[:5]: direction = c1 if log_odds > 0 else c2 print(f" {word:15s}: {log_odds:+.4f} → {direction}") if len(absent_details) > 5: remaining = sum(lo for _, lo in absent_details[5:]) print(f" {'(other absent)':15s}: {remaining:+.4f}") print(f" {'SUBTOTAL':15s}: {absent_total:+.4f}") # Total total = prior_diff + present_total + absent_total print(f"\nTOTAL log-odds: {total:+.4f}") # Convert to probability prob_c1 = 1 / (1 + math.exp(-total)) print(f"P({c1}|doc) = {prob_c1:.4f}") print(f"Prediction: {c1.upper() if total > 0 else c2.upper()}") # Example analysisvocabulary = {'free', 'money', 'prize', 'winner', 'meeting', 'report', 'project', 'status'} feature_probs = { 'spam': {'free': 0.75, 'money': 0.60, 'prize': 0.70, 'winner': 0.65, 'meeting': 0.10, 'report': 0.08, 'project': 0.12, 'status': 0.05}, 'ham': {'free': 0.05, 'money': 0.15, 'prize': 0.02, 'winner': 0.03, 'meeting': 0.70, 'report': 0.65, 'project': 0.60, 'status': 0.55}} class_priors = {'spam': 0.3, 'ham': 0.7} # Analyze each word's presence/absence effectanalyze_absence_effect(vocabulary, feature_probs) # Classify with detailed breakdownclassify_with_breakdown("free prize meeting", vocabulary, feature_probs, class_priors)classify_with_breakdown("report project status", vocabulary, feature_probs, class_priors)Understanding when to choose Bernoulli NB over Multinomial NB is crucial for practitioners. The choice depends on the nature of your documents and the classification task.
Bernoulli NB tends to excel when:
| Characteristic | Choose Multinomial | Choose Bernoulli |
|---|---|---|
| Document length | Long (100+ words) | Short (< 50 words) |
| Word repetition | Meaningful (frequency matters) | Rare or noise |
| Feature count | Many non-zero features | Sparse: few words per doc |
| Classification signal | Frequency-based | Pattern-based |
| Example domains | News articles, research papers, books | Tweets, queries, SMS, tags |
Ask yourself: 'If a word appears 5 times instead of once, does it tell me more about the class?' If yes, use Multinomial. If no (or it's mostly noise), use Bernoulli. For tweets, one 'free' is as spammy as five. For research papers, multiple mentions of 'quantum' more strongly indicate physics.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182
import mathfrom collections import Counterfrom typing import List, Tuple # Simplified implementations for comparisonclass SimpleMultinomialNB: """Minimal Multinomial NB for comparison.""" def __init__(self, alpha=1.0): self.alpha = alpha self.log_priors = {} self.log_word_probs = {} self.vocab = set() def fit(self, docs: List[str], labels: List[str]): tokenized = [doc.lower().split() for doc in docs] for tokens in tokenized: self.vocab.update(tokens) classes = set(labels) class_counts = Counter(labels) n_docs = len(docs) self.log_priors = {c: math.log(class_counts[c] / n_docs) for c in classes} # Word counts per class word_counts = {c: Counter() for c in classes} for tokens, label in zip(tokenized, labels): word_counts[label].update(tokens) # Log probabilities self.log_word_probs = {} for c in classes: total = sum(word_counts[c].values()) self.log_word_probs[c] = { w: math.log((word_counts[c][w] + self.alpha) / (total + self.alpha * len(self.vocab))) for w in self.vocab } def predict(self, doc: str) -> Tuple[str, float]: tokens = doc.lower().split() counts = Counter(tokens) scores = {} for c in self.log_priors: score = self.log_priors[c] for word, count in counts.items(): if word in self.log_word_probs[c]: score += count * self.log_word_probs[c][word] scores[c] = score best = max(scores, key=scores.get) return best, scores[best] - scores[min(scores, key=scores.get)] class SimpleBernoulliNB: """Minimal Bernoulli NB for comparison.""" def __init__(self, alpha=1.0): self.alpha = alpha self.log_priors = {} self.log_probs = {} # (log p, log q) tuples self.vocab = set() def fit(self, docs: List[str], labels: List[str]): tokenized = [set(doc.lower().split()) for doc in docs] for tokens in tokenized: self.vocab.update(tokens) classes = set(labels) class_counts = Counter(labels) n_docs = len(docs) self.log_priors = {c: math.log(class_counts[c] / n_docs) for c in classes} # Document counts per class doc_word_counts = {c: Counter() for c in classes} for tokens, label in zip(tokenized, labels): for token in tokens: doc_word_counts[label][token] += 1 # Probability tuples self.log_probs = {} for c in classes: N_c = class_counts[c] probs = {} for w in self.vocab: p = (doc_word_counts[c][w] + self.alpha) / (N_c + 2 * self.alpha) probs[w] = (math.log(p), math.log(1 - p)) self.log_probs[c] = probs def predict(self, doc: str) -> Tuple[str, float]: present = set(doc.lower().split()) & self.vocab scores = {} for c in self.log_priors: score = self.log_priors[c] for w in self.vocab: log_p, log_q = self.log_probs[c][w] if w in present: score += log_p else: score += log_q scores[c] = score best = max(scores, key=scores.get) return best, scores[best] - scores[min(scores, key=scores.get)] # Comparison experimentdef compare_models(train_docs, train_labels, test_docs, test_labels): """Compare Multinomial and Bernoulli NB on the same data.""" mnb = SimpleMultinomialNB(alpha=1.0) bnb = SimpleBernoulliNB(alpha=1.0) mnb.fit(train_docs, train_labels) bnb.fit(train_docs, train_labels) print("\nMODEL COMPARISON RESULTS") print("=" * 70) print(f"{'Document':40s} {'True':>6s} {'MNB':>6s} {'BNB':>6s}") print("-" * 70) mnb_correct = 0 bnb_correct = 0 for doc, true_label in zip(test_docs, test_labels): mnb_pred, _ = mnb.predict(doc) bnb_pred, _ = bnb.predict(doc) mnb_mark = "✓" if mnb_pred == true_label else "✗" bnb_mark = "✓" if bnb_pred == true_label else "✗" mnb_correct += (mnb_pred == true_label) bnb_correct += (bnb_pred == true_label) print(f"'{doc[:38]:38s}' {true_label:>6s} {mnb_pred:>4s}{mnb_mark:>2s} {bnb_pred:>4s}{bnb_mark:>2s}") print("-" * 70) print(f"Accuracy: Multinomial = {mnb_correct}/{len(test_docs)}, " f"Bernoulli = {bnb_correct}/{len(test_docs)}") # Test case: SHORT documents (Bernoulli should do well)short_train = [ ("free prize", "spam"), ("win money", "spam"), ("free offer", "spam"), ("meeting today", "ham"), ("report ready", "ham"), ("project update", "ham"),]short_test = [ ("free money", "spam"), ("prize win", "spam"), ("meeting report", "ham"), ("update today", "ham"),] print("\n" + "=" * 70)print("TEST 1: SHORT DOCUMENTS")print("=" * 70)compare_models( [d for d, _ in short_train], [l for _, l in short_train], [d for d, _ in short_test], [l for _, l in short_test]) # Test case: REPEATED words (Multinomial should do well) repeat_train = [ ("free free free money money prize", "spam"), ("free free winner winner winner claim", "spam"), ("meeting meeting discussion report report", "ham"), ("project project update status status", "ham"),]repeat_test = [ ("free free free free", "spam"), # Heavy repetition ("meeting meeting meeting", "ham"),] print("\n" + "=" * 70)print("TEST 2: DOCUMENTS WITH REPEATED WORDS")print("=" * 70)compare_models( [d for d, _ in repeat_train], [l for _, l in repeat_train], [d for d, _ in repeat_test], [l for _, l in repeat_test])Bernoulli NB has distinct computational characteristics compared to Multinomial NB, with tradeoffs that affect practical deployment.
Training Complexity:
Both models have similar training complexity: $O(N \cdot L)$ where $N$ is the number of documents and $L$ is average document length. The key difference is what's being counted:
Prediction Complexity:
Here the models diverge significantly:
With vocabulary sizes of 50,000+ words, Bernoulli NB's naive implementation iterates over far more features than Multinomial NB for typical documents (which might have only 100-200 unique words).
Efficient Bernoulli Implementation:
As we derived earlier, Bernoulli NB can be computed efficiently using:
$$\log P(\mathbf{x} | c) = \underbrace{\sum_{i=1}^{|V|} \log(1 - p_i^c)}{\text{precomputed baseline}} + \sum{i: x_i = 1} \log \frac{p_i^c}{1 - p_i^c}$$
The baseline is computed once during training, and prediction only iterates over present words: $O(|C| \cdot L_d)$ — same as Multinomial NB!
| Operation | Multinomial NB | Bernoulli NB (Naive) | Bernoulli NB (Optimized) |
|---|---|---|---|
| Training | $O(N \cdot L)$ | $O(N \cdot L)$ | $O(N \cdot L + |C| \cdot |V|)$* |
| Prediction (per doc) | $O(|C| \cdot L_d)$ | $O(|C| \cdot |V|)$ | $O(|C| \cdot L_d)$ |
| Memory | $O(|C| \cdot |V|)$ | $O(|C| \cdot |V|)$ | $O(|C| \cdot |V| + |C|)$** |
*Includes baseline precomputation, **Plus baseline storage (one value per class)
Space Optimization:
Both models store $O(|C| \cdot |V|)$ probability values. For Bernoulli NB, an additional optimization is storing log-odds ratios $\log(p/(1-p))$ directly, eliminating the need to store both $\log p$ and $\log(1-p)$.
Naive implementations of Bernoulli NB that iterate over the entire vocabulary at prediction time can be 100x slower than Multinomial NB for typical documents. Always use the optimized formulation with precomputed baselines. The efficiency gain is dramatic: predicting on a 100-word document with 50,000 vocabulary goes from 50,000 operations to 100 operations per class.
We have thoroughly examined Bernoulli Naive Bayes, understanding its unique properties and how it complements the multinomial model.
What's Next:
In the next page, we dive deep into Laplace Smoothing—the critical technique that prevents zero probabilities for unseen words. We'll explore its mathematical foundations, variants like Lidstone smoothing, and practical guidance for choosing the smoothing parameter α.
You now understand Bernoulli Naive Bayes in depth—its generative model, explicit handling of word absence, efficient implementation, and situations where it outperforms the more common multinomial variant. You can confidently choose between the two models based on document characteristics and implement either from scratch.