Multinomial Bernoulli Naive Bayes - Learning Module

Loading content...

0/245

Bernoulli Naive Bayes Model

The Binary Document Model

Consider a different way of thinking about documents: instead of counting how many times each word appears, we simply ask "Does this word appear at all?" A document becomes a checklist—for each word in the vocabulary, we check "yes" or "no."

This binary perspective forms the foundation of Bernoulli Naive Bayes, an alternative to the multinomial model that often excels on short texts. While the multinomial model asks "How many times does each word appear?", the Bernoulli model asks "Which words are present and which are absent?"

The critical distinction is that Bernoulli NB explicitly models word absence. If the word "free" doesn't appear in an email, the multinomial model simply ignores this fact. But the Bernoulli model treats absence as evidence—if spam emails typically contain "free," then not seeing "free" is weak evidence against spam.

What You Will Learn

By the end of this page, you will understand how Bernoulli NB differs fundamentally from Multinomial NB, derive the complete Bernoulli model from probabilistic principles, implement the algorithm with explicit handling of word absence, analyze when Bernoulli NB outperforms Multinomial NB, and apply the model effectively to short text classification tasks.

The Bernoulli Document Model

The Bernoulli document model represents each document as a binary vector indicating which vocabulary words are present.

Formal Definition:

Given a fixed vocabulary $V = {w_1, w_2, \ldots, w_{|V|}}$, a document $d$ is represented as a binary vector $\mathbf{x} = (x_1, x_2, \ldots, x_{|V|})$ where:

$$x_i = \begin{cases} 1 & \text{if word } w_i \text{ appears in document } d \ 0 & \text{otherwise} \end{cases}$$

The Generative Story:

The Bernoulli model imagines documents being generated by the following process:

Choose a class $c$ with probability $P(c)$
For each word $w_i$ in the vocabulary:
- Flip a biased coin with probability $P(w_i = 1 | c)$
- If heads, include word $w_i$ in the document
- If tails, exclude word $w_i$ from the document

Note the key difference from the multinomial model: every vocabulary word has a "vote," whether present or absent. The document is generated by $|V|$ independent binary decisions, not by sampling a sequence of words.

Comparing Document Models
Aspect	Multinomial Model	Bernoulli Model
Feature representation	Word counts: $\mathbf{x} \in \mathbb{N}^{\|V\|}$	Binary presence: $\mathbf{x} \in {0,1}^{\|V\|}$
What's modeled	Word frequencies in documents	Word presence/absence patterns
Document length	Implicitly encoded in feature sum	Not directly captured
Treats absence as evidence?	No—absent words contribute nothing	Yes—absent words affect likelihood
Number of features used	Only non-zero (present words)	All vocabulary words
Underlying distribution	Multinomial over vocabulary	Product of Bernoullis
Best for	Longer documents, frequency matters	Short texts, binary patterns

The Absence Signal

Consider spam detection: spam emails often contain 'winner', 'prize', 'free'. If an email doesn't contain these words, Multinomial NB simply doesn't add their contribution. But Bernoulli NB actively considers: 'winner is absent (evidence against spam), prize is absent (evidence against spam), free is absent (evidence against spam).' This cumulative absence signal can be powerful for short messages where word frequency is less informative.

Deriving Bernoulli Naive Bayes from First Principles

We derive the complete Bernoulli Naive Bayes classifier, highlighting how it differs from the multinomial derivation.

Step 1: Apply Bayes' Theorem

As before, for classification:

$$\hat{c} = \arg\max_{c} P(c) P(\mathbf{x} | c)$$

Step 2: Model Using Product of Bernoullis

Under the naive Bayes assumption, features are conditionally independent:

$$P(\mathbf{x} | c) = \prod_{i=1}^{|V|} P(x_i | c)$$

For binary features with Bernoulli distribution:

$$P(x_i | c) = \begin{cases} P(w_i = 1 | c) & \text{if } x_i = 1 \ 1 - P(w_i = 1 | c) & \text{if } x_i = 0 \end{cases}$$

This can be written compactly as:

$$P(x_i | c) = P(w_i = 1 | c)^{x_i} \cdot (1 - P(w_i = 1 | c))^{1 - x_i}$$

So the full likelihood is:

$$P(\mathbf{x} | c) = \prod_{i=1}^{|V|} P(w_i = 1 | c)^{x_i} (1 - P(w_i = 1 | c))^{1 - x_i}$$

Step 3: Convert to Log-Space

Taking logarithms:

$$\log P(\mathbf{x} | c) = \sum_{i=1}^{|V|} \left[ x_i \log P(w_i = 1 | c) + (1 - x_i) \log(1 - P(w_i = 1 | c)) \right]$$

Let's define:

$p_i^c = P(w_i = 1 | c)$ — probability word $i$ is present given class $c$
$q_i^c = 1 - p_i^c$ — probability word $i$ is absent given class $c$

Then:

$$\log P(\mathbf{x} | c) = \sum_{i=1}^{|V|} \left[ x_i \log p_i^c + (1 - x_i) \log q_i^c \right]$$

Step 4: Separate Present and Absent Words

We can split this sum:

$$\log P(\mathbf{x} | c) = \sum_{i: x_i = 1} \log p_i^c + \sum_{i: x_i = 0} \log q_i^c$$

Or equivalently, rewriting the second sum:

$$\log P(\mathbf{x} | c) = \sum_{i: x_i = 1} \log \frac{p_i^c}{q_i^c} + \sum_{i=1}^{|V|} \log q_i^c$$

The second term $\sum_{i=1}^{|V|} \log q_i^c$ is a constant for each class (depends only on class, not document). It can be absorbed into the prior:

$$\log P(c | \mathbf{x}) \propto \log P(c) + \sum_{i=1}^{|V|} \log q_i^c + \sum_{i: x_i = 1} \log \frac{p_i^c}{q_i^c}$$

The Implicit Absence Term

The term $\sum_{i=1}^{|V|} \log q_i^c$ captures the 'baseline' likelihood of generating any document from class $c$, assuming all words are absent. Each present word then adds a log-odds adjustment $\log(p_i^c/q_i^c)$. This structure reveals why Bernoulli NB models absence: the baseline already accounts for all words being absent, and present words perturb this baseline.

bernoulli_nb_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
import numpy as np
from collections import Counter, defaultdict
from typing import List, Dict, Tuple, Set
import math
 
class BernoulliNaiveBayes:
    """
    Complete implementation of Bernoulli Naive Bayes from scratch.
    
    This model differs from Multinomial NB in two key ways:
    1. Features are binary (word present/absent), not counts
    2. Absent words contribute to the likelihood (not ignored)
    """
    
    def __init__(self, alpha: float = 1.0, binarize: bool = True):
        """
        Initialize the classifier.
        
        Args:
            alpha: Laplace smoothing parameter for Beta-Bernoulli conjugacy
                   P(w=1|c) = (n_wc + α) / (N_c + 2α)
            binarize: If True, convert counts to binary (0/1)
        """
        self.alpha = alpha
        self.binarize = binarize
        
        # Model parameters
        self.classes_: List[str] = []
        self.log_priors_: Dict[str, float] = {}            # log P(c)
        self.feature_log_probs_: Dict[str, Dict[str, Tuple[float, float]]] = {}
        # feature_log_probs_[c][w] = (log P(w=1|c), log P(w=0|c))
        self.vocabulary_: Set[str] = set()
        
        # For the constant term (all-absent baseline)
        self.log_absence_baselines_: Dict[str, float] = {}
        
    def _tokenize(self, document: str) -> Set[str]:
        """Tokenize and return unique words (binary representation)."""
        tokens = document.lower().split()
        if self.binarize:
            return set(tokens)  # Unique words only
        return set(tokens)
    
    def fit(self, documents: List[str], labels: List[str]) -> 'BernoulliNaiveBayes':
        """
        Fit the Bernoulli NB model.
        
        For each word w and class c, we estimate:
        P(w=1|c) = (count of docs in c containing w + α) / (count of docs in c + 2α)
        
        The 2α in the denominator comes from the Beta(α, α) prior on
        the Bernoulli parameter.
        """
        n_docs = len(documents)
        
        # Step 1: Build vocabulary from all documents
        tokenized_docs = [self._tokenize(doc) for doc in documents]
        for tokens in tokenized_docs:
            self.vocabulary_.update(tokens)
        
        # Step 2: Compute class priors
        self.classes_ = sorted(set(labels))
        class_doc_counts = Counter(labels)
        
        self.log_priors_ = {
            c: math.log(class_doc_counts[c] / n_docs)
            for c in self.classes_
        }
        
        # Step 3: Count documents containing each word, per class
        # doc_counts[c][w] = number of documents in class c containing word w
        doc_counts = {c: defaultdict(int) for c in self.classes_}
        
        for tokens, label in zip(tokenized_docs, labels):
            for token in tokens:  # tokens is already a set (unique words)
                doc_counts[label][token] += 1
        
        # Step 4: Compute feature log probabilities with smoothing
        # P(w=1|c) = (doc_count(w,c) + α) / (N_c + 2α)
        self.feature_log_probs_ = {}
        self.log_absence_baselines_ = {}
        
        for c in self.classes_:
            N_c = class_doc_counts[c]
            denominator = N_c + 2 * self.alpha
            
            class_probs = {}
            log_absence_sum = 0.0
            
            for word in self.vocabulary_:
                # Laplace-smoothed probability
                p_present = (doc_counts[c].get(word, 0) + self.alpha) / denominator
                p_absent = 1 - p_present
                
                # Store log probabilities for both states
                class_probs[word] = (
                    math.log(p_present),  # log P(w=1|c)
                    math.log(p_absent)    # log P(w=0|c)
                )
                
                # Accumulate absence baseline
                log_absence_sum += math.log(p_absent)
            
            self.feature_log_probs_[c] = class_probs
            self.log_absence_baselines_[c] = log_absence_sum
            
        return self
    
    def predict_log_proba(self, document: str) -> Dict[str, float]:
        """
        Compute log-probabilities for each class.
        
        Uses the efficient formulation:
        log P(x|c) = baseline_c + Σ_{w present} log(p_w / q_w)
        
        where baseline_c = Σ_{all w} log(q_w)
        """
        present_words = self._tokenize(document)
        
        log_probs = {}
        
        for c in self.classes_:
            # Start with prior + absence baseline
            log_prob = self.log_priors_[c] + self.log_absence_baselines_[c]
            
            # Add log-odds for each present word
            for word in present_words:
                if word in self.feature_log_probs_[c]:
                    log_p_present, log_p_absent = self.feature_log_probs_[c][word]
                    # log(p/q) = log(p) - log(q)
                    log_odds = log_p_present - log_p_absent
                    log_prob += log_odds
                # Unknown words: contribution is 0 (log-odds of uniform = 0)
            
            log_probs[c] = log_prob
            
        return log_probs
    
    def predict_log_proba_explicit(self, document: str) -> Dict[str, float]:
        """
        Alternative: explicit computation considering ALL vocabulary words.
        
        This is slower but clearer for understanding.
        Demonstrates that we consider EVERY word, present or not.
        """
        present_words = self._tokenize(document)
        
        log_probs = {}
        
        for c in self.classes_:
            log_prob = self.log_priors_[c]
            
            # Iterate over ENTIRE vocabulary
            for word in self.vocabulary_:
                log_p_present, log_p_absent = self.feature_log_probs_[c][word]
                
                if word in present_words:
                    log_prob += log_p_present
                else:
                    log_prob += log_p_absent  # This is the key difference!
                    
            log_probs[c] = log_prob
            
        return log_probs
    
    def predict(self, document: str) -> str:
        """Predict the most likely class."""
        log_probs = self.predict_log_proba(document)
        return max(log_probs, key=log_probs.get)
    
    def predict_proba(self, document: str) -> Dict[str, float]:
        """Compute normalized probabilities."""
        log_probs = self.predict_log_proba(document)
        
        max_log = max(log_probs.values())
        exp_probs = {c: math.exp(lp - max_log) for c, lp in log_probs.items()}
        total = sum(exp_probs.values())
        
        return {c: p / total for c, p in exp_probs.items()}
 
 
# ============================================================
# DEMONSTRATE BERNOULLI VS MULTINOMIAL DIFFERENCE
# ============================================================
 
print("=" * 60)
print("BERNOULLI NAIVE BAYES: EXPLICIT ABSENCE MODELING")
print("=" * 60)
 
# Training data with short messages (where Bernoulli often excels)
training_messages = [
    # Spam (short promotional messages)
    ("free prize win now", "spam"),
    ("winner free money", "spam"),
    ("claim prize today", "spam"),
    ("free offer limited", "spam"),
    
    # Ham (short normal messages)
    ("meeting tomorrow morning", "ham"),
    ("report ready review", "ham"),
    ("lunch today noon", "ham"),
    ("project update status", "ham"),
]
 
docs = [msg for msg, _ in training_messages]
labels = [label for _, label in training_messages]
 
# Train Bernoulli NB
model = BernoulliNaiveBayes(alpha=1.0)
model.fit(docs, labels)
 
print(f"\nVocabulary size: {len(model.vocabulary_)} words")
print(f"Vocabulary: {sorted(model.vocabulary_)}")
 
# Show parameters for a few interesting words
print("\nFeature Probabilities:")
print(f"{'Word':15s} {'P(w=1|spam)':>12s} {'P(w=1|ham)':>12s} {'Log-Odds':>10s}")
print("-" * 55)
 
for word in sorted(model.vocabulary_):
    log_p_spam, log_q_spam = model.feature_log_probs_['spam'][word]
    log_p_ham, log_q_ham = model.feature_log_probs_['ham'][word]
    
    p_spam = math.exp(log_p_spam)
    p_ham = math.exp(log_p_ham)
    log_odds = log_p_spam - log_q_spam - (log_p_ham - log_q_ham)
    
    print(f"{word:15s} {p_spam:>12.4f} {p_ham:>12.4f} {log_odds:>+10.3f}")
 
# Demonstrate absence effect
print("\n" + "=" * 60)
print("DEMONSTRATING THE EFFECT OF WORD ABSENCE")
print("=" * 60)
 
test_cases = [
    "free prize win",      # Contains spam words
    "meeting report",      # Contains ham words
    "hello world test",    # Neutral words (not in vocab)
]
 
for msg in test_cases:
    log_probs = model.predict_log_proba(msg)
    probs = model.predict_proba(msg)
    
    print(f"\nMessage: '{msg}'")
    print(f"  Present words: {model._tokenize(msg) & model.vocabulary_}")
    print(f"  Absent words:  {model.vocabulary_ - model._tokenize(msg)}")
    print(f"  P(spam) = {probs['spam']:.4f}, P(ham) = {probs['ham']:.4f}")
    print(f"  Prediction: {model.predict(msg).upper()}")

Parameter Estimation for Bernoulli NB

The parameter estimation for Bernoulli NB differs subtly but importantly from the multinomial case.

What We're Estimating:

For each word $w$ and class $c$, we estimate $P(w = 1 | c)$ — the probability that a document from class $c$ contains word $w$.

Maximum Likelihood Estimate:

$$\hat{P}(w = 1 | c) = \frac{\text{number of documents in class } c \text{ containing word } w}{\text{number of documents in class } c} = \frac{n_{wc}}{N_c}$$

Laplace-Smoothed Estimate:

To avoid zero probabilities, we add smoothing:

$$\hat{P}(w = 1 | c) = \frac{n_{wc} + \alpha}{N_c + 2\alpha}$$

Note the $2\alpha$ in the denominator (not $|V| \cdot \alpha$ as in multinomial). This is because each word is a binary random variable with two outcomes (present/absent), so we add $\alpha$ pseudo-counts for each outcome.

Parameter Estimation: Multinomial vs Bernoulli
Aspect	Multinomial NB	Bernoulli NB
Count being used	Total occurrences of word in class	Number of documents containing word
Denominator	Total words in class	Number of documents in class
Smoothing denominator	$+ \alpha \|V\|$ (one α per vocabulary word)	$+ 2\alpha$ (one α per outcome: present/absent)
Interpretation	Fraction of class words that are word $w$	Probability document contains word $w$
Range	Must sum to 1 over vocabulary	Each P(w\|c) independent, can be any value

Bayesian Interpretation of Smoothing

The Laplace smoothing for Bernoulli NB corresponds to a Beta(α, α) prior on each word's presence probability. With α = 1, this is the uniform prior Beta(1, 1). The posterior mean estimate becomes (n + α) / (N + 2α), which is the Laplace-smoothed estimate. This Bayesian interpretation explains why the denominator has 2α: we're adding α pseudo-observations for 'present' and α for 'absent'.

bernoulli_estimation_detail.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import math
from collections import defaultdict, Counter
 
def detailed_bernoulli_estimation(documents, labels, alpha=1.0):
    """
    Step-by-step parameter estimation for Bernoulli NB.
    
    This function explicitly shows each step of the estimation process.
    """
    # Step 1: Tokenize documents (get unique words per document)
    tokenized = [set(doc.lower().split()) for doc in documents]
    
    # Step 2: Build vocabulary
    vocabulary = set()
    for tokens in tokenized:
        vocabulary.update(tokens)
    vocabulary = sorted(vocabulary)
    
    print("STEP 1 & 2: Tokenization and Vocabulary")
    print(f"  Vocabulary ({len(vocabulary)} words): {vocabulary}")
    
    # Step 3: Count class frequencies
    classes = sorted(set(labels))
    class_counts = Counter(labels)
    
    print("\nSTEP 3: Class Counts")
    for c, count in class_counts.items():
        print(f"  {c}: {count} documents")
    
    # Step 4: Count documents containing each word, per class
    doc_word_counts = {c: defaultdict(int) for c in classes}
    
    for tokens, label in zip(tokenized, labels):
        for word in tokens:
            doc_word_counts[label][word] += 1
    
    print("\nSTEP 4: Document-Word Counts")
    print(f"{'Word':15s}", end="")
    for c in classes:
        print(f"{c + ' docs':>12s}", end="")
    print()
    print("-" * 40)
    
    for word in vocabulary:
        print(f"{word:15s}", end="")
        for c in classes:
            print(f"{doc_word_counts[c][word]:>12d}", end="")
        print()
    
    # Step 5: Compute smoothed probabilities
    print("\nSTEP 5: Smoothed Probability Estimates")
    print(f"Formula: P(w=1|c) = (n_wc + α) / (N_c + 2α)")
    print(f"Smoothing α = {alpha}")
    print()
    
    print(f"{'Word':15s}", end="")
    for c in classes:
        print(f"{'P(w|' + c + ')':>12s}", end="")
    print(f"{'Log-Odds':>12s}")
    print("-" * 55)
    
    feature_probs = {c: {} for c in classes}
    
    for word in vocabulary:
        print(f"{word:15s}", end="")
        
        for c in classes:
            N_c = class_counts[c]
            n_wc = doc_word_counts[c][word]
            p_present = (n_wc + alpha) / (N_c + 2 * alpha)
            feature_probs[c][word] = p_present
            print(f"{p_present:>12.4f}", end="")
        
        # Log-odds ratio for binary classification
        if len(classes) == 2:
            c1, c2 = classes
            p1 = feature_probs[c1][word]
            p2 = feature_probs[c2][word]
            # Log-odds of word being present pointing to c1 vs c2
            log_odds = math.log(p1 / (1 - p1)) - math.log(p2 / (1 - p2))
            print(f"{log_odds:>+12.3f}")
        else:
            print()
    
    # Step 6: Compute absence baselines
    print("\nSTEP 6: Absence Baselines")
    print("  baseline_c = Σ log(1 - P(w=1|c)) for all words")
    
    for c in classes:
        baseline = sum(
            math.log(1 - feature_probs[c][word])
            for word in vocabulary
        )
        print(f"  {c}: {baseline:.4f}")
    
    return feature_probs, class_counts
 
 
# Example with clear data
training_data = [
    ("free money prize", "spam"),
    ("free winner claim", "spam"),
    ("money free now", "spam"),
    ("meeting report project", "ham"),
    ("status update meeting", "ham"),
    ("report project review", "ham"),
]
 
docs = [d for d, _ in training_data]
labels = [l for _, l in training_data]
 
print("=" * 60)
print("DETAILED PARAMETER ESTIMATION")
print("=" * 60)
print(f"\nTraining Data ({len(docs)} documents):")
for doc, label in training_data:
    print(f"  [{label}] '{doc}'")
 
print("\n" + "=" * 60)
detailed_bernoulli_estimation(docs, labels, alpha=1.0)

Understanding the Absence Effect

The most distinctive feature of Bernoulli NB is its explicit modeling of word absence. Let's explore this in depth with concrete examples.

The Mathematical Mechanism:

For a vocabulary word $w$ and class $c$:

If $w$ is present: contributes $\log P(w = 1 | c)$
If $w$ is absent: contributes $\log P(w = 0 | c) = \log(1 - P(w = 1 | c))$

Consider a word "free" that appears in 80% of spam but only 5% of ham:

$P(\text{free}=1 | \text{spam}) = 0.80$
$P(\text{free}=1 | \text{ham}) = 0.05$

When "free" is present:

Spam contribution: $\log(0.80) = -0.22$
Ham contribution: $\log(0.05) = -3.00$
Net effect: strongly favors spam (+2.78 log-odds)

When "free" is absent:

Spam contribution: $\log(1 - 0.80) = \log(0.20) = -1.61$
Ham contribution: $\log(1 - 0.05) = \log(0.95) = -0.05$
Net effect: favors ham (+1.56 log-odds against spam)

Absence as Anti-Evidence

The absence of a class-typical word provides evidence against that class. If spam emails usually contain 'free', then NOT seeing 'free' makes spam less likely. This is intuitive but often overlooked in Multinomial NB, where absence simply contributes nothing. Bernoulli NB explicitly captures this anti-evidence.

absence_effect_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
import math
from typing import Dict, Set
 
def analyze_absence_effect(vocabulary: Set[str],
                           feature_probs: Dict[str, Dict[str, float]],
                           classes: tuple = ('spam', 'ham')):
    """
    Analyze how word presence and absence affect classification.
    
    Shows the log-odds contribution of each word in both states.
    """
    c1, c2 = classes
    
    print("WORD PRESENCE/ABSENCE ANALYSIS")
    print("=" * 70)
    print(f"Comparing {c1} vs {c2}")
    print()
    
    print(f"{'Word':15s} {'P(w|'+c1+')':>10s} {'P(w|'+c2+')':>10s} "
          f"{'Present→':>12s} {'Absent→':>12s}")
    print("-" * 70)
    
    contributions = []
    
    for word in sorted(vocabulary):
        p1 = feature_probs[c1][word]
        p2 = feature_probs[c2][word]
        
        # Log-odds when present
        # log[P(w=1|c1)/P(w=1|c2)]
        present_log_odds = math.log(p1 / p2) if p2 > 0 else float('inf')
        
        # Log-odds when absent
        # log[P(w=0|c1)/P(w=0|c2)] = log[(1-p1)/(1-p2)]
        absent_log_odds = math.log((1 - p1) / (1 - p2)) if (1 - p2) > 0 else float('-inf')
        
        # Direction indicators
        present_dir = c1 if present_log_odds > 0 else c2
        absent_dir = c1 if absent_log_odds > 0 else c2
        
        print(f"{word:15s} {p1:>10.4f} {p2:>10.4f} "
              f"{present_log_odds:>+10.3f}→{present_dir[:1]} "
              f"{absent_log_odds:>+10.3f}→{absent_dir[:1]}")
        
        contributions.append({
            'word': word,
            'present_effect': present_log_odds,
            'absent_effect': absent_log_odds,
            'swing': present_log_odds - absent_log_odds  # Total discriminative power
        })
    
    # Find most discriminative words
    contributions.sort(key=lambda x: abs(x['swing']), reverse=True)
    
    print("\n" + "=" * 70)
    print("MOST DISCRIMINATIVE WORDS (by total swing)")
    print("-" * 70)
    
    for item in contributions[:5]:
        print(f"{item['word']:15s}: present→{item['present_effect']:+.3f}, "
              f"absent→{item['absent_effect']:+.3f}, swing={item['swing']:+.3f}")
    
    return contributions
 
 
def classify_with_breakdown(document: str,
                            vocabulary: Set[str],
                            feature_probs: Dict[str, Dict[str, float]],
                            class_priors: Dict[str, float],
                            classes: tuple = ('spam', 'ham')):
    """
    Classify a document with detailed breakdown of contributions.
    
    Shows exactly how each word (present and absent) affects
    the final decision.
    """
    c1, c2 = classes
    
    present_words = set(document.lower().split()) & vocabulary
    absent_words = vocabulary - present_words
    
    print(f"\nDocument: '{document}'")
    print("-" * 60)
    
    # Prior contribution
    prior_diff = math.log(class_priors[c1] / class_priors[c2])
    print(f"Prior log-odds ({c1}/{c2}): {prior_diff:+.4f}")
    
    # Present word contributions
    print(f"\nPresent words ({len(present_words)}):")
    present_total = 0
    for word in sorted(present_words):
        p1 = feature_probs[c1][word]
        p2 = feature_probs[c2][word]
        log_odds = math.log(p1 / p2)
        present_total += log_odds
        direction = c1 if log_odds > 0 else c2
        print(f"  {word:15s}: log({p1:.3f}/{p2:.3f}) = {log_odds:+.4f} → {direction}")
    print(f"  {'SUBTOTAL':15s}: {present_total:+.4f}")
    
    # Absent word contributions
    print(f"\nAbsent words ({len(absent_words)}):")
    absent_total = 0
    absent_details = []
    for word in sorted(absent_words):
        q1 = 1 - feature_probs[c1][word]
        q2 = 1 - feature_probs[c2][word]
        log_odds = math.log(q1 / q2)
        absent_total += log_odds
        absent_details.append((word, log_odds))
    
    # Show most impactful absent words
    absent_details.sort(key=lambda x: abs(x[1]), reverse=True)
    for word, log_odds in absent_details[:5]:
        direction = c1 if log_odds > 0 else c2
        print(f"  {word:15s}: {log_odds:+.4f} → {direction}")
    if len(absent_details) > 5:
        remaining = sum(lo for _, lo in absent_details[5:])
        print(f"  {'(other absent)':15s}: {remaining:+.4f}")
    print(f"  {'SUBTOTAL':15s}: {absent_total:+.4f}")
    
    # Total
    total = prior_diff + present_total + absent_total
    print(f"\nTOTAL log-odds: {total:+.4f}")
    
    # Convert to probability
    prob_c1 = 1 / (1 + math.exp(-total))
    print(f"P({c1}|doc) = {prob_c1:.4f}")
    print(f"Prediction: {c1.upper() if total > 0 else c2.upper()}")
 
 
# Example analysis
vocabulary = {'free', 'money', 'prize', 'winner', 'meeting', 'report', 'project', 'status'}
 
feature_probs = {
    'spam': {'free': 0.75, 'money': 0.60, 'prize': 0.70, 'winner': 0.65,
             'meeting': 0.10, 'report': 0.08, 'project': 0.12, 'status': 0.05},
    'ham':  {'free': 0.05, 'money': 0.15, 'prize': 0.02, 'winner': 0.03,
             'meeting': 0.70, 'report': 0.65, 'project': 0.60, 'status': 0.55}
}
 
class_priors = {'spam': 0.3, 'ham': 0.7}
 
# Analyze each word's presence/absence effect
analyze_absence_effect(vocabulary, feature_probs)
 
# Classify with detailed breakdown
classify_with_breakdown("free prize meeting", vocabulary, feature_probs, class_priors)
classify_with_breakdown("report project status", vocabulary, feature_probs, class_priors)

When Bernoulli NB Outperforms Multinomial NB

Understanding when to choose Bernoulli NB over Multinomial NB is crucial for practitioners. The choice depends on the nature of your documents and the classification task.

Bernoulli NB tends to excel when:

Ideal Conditions for Bernoulli NB

•Short documents: Tweets, search queries, SMS messages, product titles. With few words, repetition is rare, and presence/absence is the dominant signal.
•Binary feature patterns matter: When the pattern of which words appear (regardless of how often) is discriminative. Example: security vulnerability descriptions.
•Vocabulary coverage is limited: Documents don't use most vocabulary words, so absence is informative.
•Word frequency is noisy: Some domains have inconsistent word frequencies (different authors, document formats), but presence/absence is stable.
•Fast, interpretable baseline needed: Bernoulli NB often provides a strong baseline with clear presence/absence interpretations.

Empirical Guidance: Multinomial vs Bernoulli
Characteristic	Choose Multinomial	Choose Bernoulli
Document length	Long (100+ words)	Short (< 50 words)
Word repetition	Meaningful (frequency matters)	Rare or noise
Feature count	Many non-zero features	Sparse: few words per doc
Classification signal	Frequency-based	Pattern-based
Example domains	News articles, research papers, books	Tweets, queries, SMS, tags

The Litmus Test

Ask yourself: 'If a word appears 5 times instead of once, does it tell me more about the class?' If yes, use Multinomial. If no (or it's mostly noise), use Bernoulli. For tweets, one 'free' is as spammy as five. For research papers, multiple mentions of 'quantum' more strongly indicate physics.

model_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
import math
from collections import Counter
from typing import List, Tuple
 
# Simplified implementations for comparison
class SimpleMultinomialNB:
    """Minimal Multinomial NB for comparison."""
    
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.log_priors = {}
        self.log_word_probs = {}
        self.vocab = set()
        
    def fit(self, docs: List[str], labels: List[str]):
        tokenized = [doc.lower().split() for doc in docs]
        for tokens in tokenized:
            self.vocab.update(tokens)
        
        classes = set(labels)
        class_counts = Counter(labels)
        n_docs = len(docs)
        
        self.log_priors = {c: math.log(class_counts[c] / n_docs) for c in classes}
        
        # Word counts per class
        word_counts = {c: Counter() for c in classes}
        for tokens, label in zip(tokenized, labels):
            word_counts[label].update(tokens)
        
        # Log probabilities
        self.log_word_probs = {}
        for c in classes:
            total = sum(word_counts[c].values())
            self.log_word_probs[c] = {
                w: math.log((word_counts[c][w] + self.alpha) / 
                           (total + self.alpha * len(self.vocab)))
                for w in self.vocab
            }
        
    def predict(self, doc: str) -> Tuple[str, float]:
        tokens = doc.lower().split()
        counts = Counter(tokens)
        
        scores = {}
        for c in self.log_priors:
            score = self.log_priors[c]
            for word, count in counts.items():
                if word in self.log_word_probs[c]:
                    score += count * self.log_word_probs[c][word]
            scores[c] = score
        
        best = max(scores, key=scores.get)
        return best, scores[best] - scores[min(scores, key=scores.get)]
 
 
class SimpleBernoulliNB:
    """Minimal Bernoulli NB for comparison."""
    
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.log_priors = {}
        self.log_probs = {}  # (log p, log q) tuples
        self.vocab = set()
        
    def fit(self, docs: List[str], labels: List[str]):
        tokenized = [set(doc.lower().split()) for doc in docs]
        for tokens in tokenized:
            self.vocab.update(tokens)
        
        classes = set(labels)
        class_counts = Counter(labels)
        n_docs = len(docs)
        
        self.log_priors = {c: math.log(class_counts[c] / n_docs) for c in classes}
        
        # Document counts per class
        doc_word_counts = {c: Counter() for c in classes}
        for tokens, label in zip(tokenized, labels):
            for token in tokens:
                doc_word_counts[label][token] += 1
        
        # Probability tuples
        self.log_probs = {}
        for c in classes:
            N_c = class_counts[c]
            probs = {}
            for w in self.vocab:
                p = (doc_word_counts[c][w] + self.alpha) / (N_c + 2 * self.alpha)
                probs[w] = (math.log(p), math.log(1 - p))
            self.log_probs[c] = probs
        
    def predict(self, doc: str) -> Tuple[str, float]:
        present = set(doc.lower().split()) & self.vocab
        
        scores = {}
        for c in self.log_priors:
            score = self.log_priors[c]
            for w in self.vocab:
                log_p, log_q = self.log_probs[c][w]
                if w in present:
                    score += log_p
                else:
                    score += log_q
            scores[c] = score
        
        best = max(scores, key=scores.get)
        return best, scores[best] - scores[min(scores, key=scores.get)]
 
 
# Comparison experiment
def compare_models(train_docs, train_labels, test_docs, test_labels):
    """Compare Multinomial and Bernoulli NB on the same data."""
    
    mnb = SimpleMultinomialNB(alpha=1.0)
    bnb = SimpleBernoulliNB(alpha=1.0)
    
    mnb.fit(train_docs, train_labels)
    bnb.fit(train_docs, train_labels)
    
    print("\nMODEL COMPARISON RESULTS")
    print("=" * 70)
    print(f"{'Document':40s} {'True':>6s} {'MNB':>6s} {'BNB':>6s}")
    print("-" * 70)
    
    mnb_correct = 0
    bnb_correct = 0
    
    for doc, true_label in zip(test_docs, test_labels):
        mnb_pred, _ = mnb.predict(doc)
        bnb_pred, _ = bnb.predict(doc)
        
        mnb_mark = "✓" if mnb_pred == true_label else "✗"
        bnb_mark = "✓" if bnb_pred == true_label else "✗"
        
        mnb_correct += (mnb_pred == true_label)
        bnb_correct += (bnb_pred == true_label)
        
        print(f"'{doc[:38]:38s}' {true_label:>6s} {mnb_pred:>4s}{mnb_mark:>2s} {bnb_pred:>4s}{bnb_mark:>2s}")
    
    print("-" * 70)
    print(f"Accuracy: Multinomial = {mnb_correct}/{len(test_docs)}, "
          f"Bernoulli = {bnb_correct}/{len(test_docs)}")
 
 
# Test case: SHORT documents (Bernoulli should do well)
short_train = [
    ("free prize", "spam"), ("win money", "spam"), ("free offer", "spam"),
    ("meeting today", "ham"), ("report ready", "ham"), ("project update", "ham"),
]
short_test = [
    ("free money", "spam"), ("prize win", "spam"),
    ("meeting report", "ham"), ("update today", "ham"),
]
 
print("\n" + "=" * 70)
print("TEST 1: SHORT DOCUMENTS")
print("=" * 70)
compare_models(
    [d for d, _ in short_train], [l for _, l in short_train],
    [d for d, _ in short_test], [l for _, l in short_test]
)
 
# Test case: REPEATED words (Multinomial should do well)  
repeat_train = [
    ("free free free money money prize", "spam"),
    ("free free winner winner winner claim", "spam"),
    ("meeting meeting discussion report report", "ham"),
    ("project project update status status", "ham"),
]
repeat_test = [
    ("free free free free", "spam"),  # Heavy repetition
    ("meeting meeting meeting", "ham"),
]
 
print("\n" + "=" * 70)
print("TEST 2: DOCUMENTS WITH REPEATED WORDS")
print("=" * 70)
compare_models(
    [d for d, _ in repeat_train], [l for _, l in repeat_train],
    [d for d, _ in repeat_test], [l for _, l in repeat_test]
)

Computational Considerations

Bernoulli NB has distinct computational characteristics compared to Multinomial NB, with tradeoffs that affect practical deployment.

Training Complexity:

Both models have similar training complexity: $O(N \cdot L)$ where $N$ is the number of documents and $L$ is average document length. The key difference is what's being counted:

Multinomial: count word occurrences
Bernoulli: count documents containing each word

Prediction Complexity:

Here the models diverge significantly:

Multinomial NB: $O(|C| \cdot L_d)$ — only considers words present in the document
Bernoulli NB (naive): $O(|C| \cdot |V|)$ — considers entire vocabulary

With vocabulary sizes of 50,000+ words, Bernoulli NB's naive implementation iterates over far more features than Multinomial NB for typical documents (which might have only 100-200 unique words).

Efficient Bernoulli Implementation:

As we derived earlier, Bernoulli NB can be computed efficiently using:

$$\log P(\mathbf{x} | c) = \underbrace{\sum_{i=1}^{|V|} \log(1 - p_i^c)}{\text{precomputed baseline}} + \sum{i: x_i = 1} \log \frac{p_i^c}{1 - p_i^c}$$

The baseline is computed once during training, and prediction only iterates over present words: $O(|C| \cdot L_d)$ — same as Multinomial NB!

Computational Complexity Comparison
Operation	Multinomial NB	Bernoulli NB (Naive)	Bernoulli NB (Optimized)
Training	$O(N \cdot L)$	$O(N \cdot L)$	$O(N \cdot L + \|C\| \cdot \|V\|)$*
Prediction (per doc)	$O(\|C\| \cdot L_d)$	$O(\|C\| \cdot \|V\|)$	$O(\|C\| \cdot L_d)$
Memory	$O(\|C\| \cdot \|V\|)$	$O(\|C\| \cdot \|V\|)$	$O(\|C\| \cdot \|V\| + \|C\|)$**

*Includes baseline precomputation, **Plus baseline storage (one value per class)

Space Optimization:

Both models store $O(|C| \cdot |V|)$ probability values. For Bernoulli NB, an additional optimization is storing log-odds ratios $\log(p/(1-p))$ directly, eliminating the need to store both $\log p$ and $\log(1-p)$.

Implementation Warning

Naive implementations of Bernoulli NB that iterate over the entire vocabulary at prediction time can be 100x slower than Multinomial NB for typical documents. Always use the optimized formulation with precomputed baselines. The efficiency gain is dramatic: predicting on a 100-word document with 50,000 vocabulary goes from 50,000 operations to 100 operations per class.

Summary: Bernoulli Naive Bayes Mastery

We have thoroughly examined Bernoulli Naive Bayes, understanding its unique properties and how it complements the multinomial model.

Key Takeaways

•Bernoulli model represents documents as binary vectors indicating word presence/absence, treating each word as an independent Bernoulli random variable.
•Absence is evidence: Unlike Multinomial NB, Bernoulli NB explicitly models word absence, which can provide anti-evidence against classes that typically contain certain words.
•Parameter estimation differs: P(w=1|c) = (docs_with_w_in_c + α) / (docs_in_c + 2α), counting documents rather than word occurrences.
•Bernoulli excels on short texts where word repetition is rare and presence/absence patterns are the primary signal.
•Efficient implementation uses precomputed absence baselines, achieving the same O(|C|·L) prediction complexity as Multinomial NB.
•Model selection depends on whether word frequency provides additional signal beyond simple presence.

What's Next:

In the next page, we dive deep into Laplace Smoothing—the critical technique that prevents zero probabilities for unseen words. We'll explore its mathematical foundations, variants like Lidstone smoothing, and practical guidance for choosing the smoothing parameter α.

Page Complete

You now understand Bernoulli Naive Bayes in depth—its generative model, explicit handling of word absence, efficient implementation, and situations where it outperforms the more common multinomial variant. You can confidently choose between the two models based on document characteristics and implement either from scratch.