Multinomial Bernoulli Naive Bayes - Learning Module

Loading content...

0/245

Comparing Multinomial and Bernoulli Naive Bayes

Choosing the Right Model

You've now mastered both Multinomial and Bernoulli Naive Bayes in depth. The natural question arises: When should you use each one?

The answer is not always obvious. Both models share the same foundational algorithm—applying Bayes' theorem with the naive independence assumption—yet they make fundamentally different assumptions about how documents are generated. These differences lead to distinct behaviors that make each model better suited for different types of text classification tasks.

This page synthesizes everything we've learned into actionable decision frameworks, backed by theoretical analysis and empirical evidence. By the end, you'll be able to confidently select the right variant for any text classification problem.

What You Will Learn

By the end of this page, you will understand the fundamental differences between Multinomial and Bernoulli NB in terms of assumptions, feature representation, and decision boundaries; have clear criteria for selecting between them; see empirical comparisons across multiple domains; understand hybrid approaches and when to combine both models; and have practical implementation guidance for production systems.

Side-by-Side Model Comparison

Let's systematically compare every aspect of the two models.

Comprehensive Model Comparison
Aspect	Multinomial Naive Bayes	Bernoulli Naive Bayes
Generative Story	Documents generated by drawing n words from a categorical distribution over vocabulary	Documents generated by flipping \|V\| coins to decide which words are present
Feature Input	Word frequency counts: x ∈ ℕ^\|V\|	Binary presence/absence: x ∈ {0,1}^\|V\|
What Counts Matter	How many times each word appears	Whether each word appears at all
Handles Absence	No—absent words contribute nothing to likelihood	Yes—absence explicitly modeled as evidence
Document Length	Implicitly captured (sum of counts)	Not directly captured
Parameter Estimation	P(w\|c) = word_count / total_words_in_class	P(w=1\|c) = num_docs_with_word / num_docs_in_class
Smoothing Denominator	total + α\|V\|	N_class + 2α
Probabilities Summed Over	Only present words (sparse computation)	All vocabulary words (or use optimized form)
Decision Boundary	Linear in word frequencies	Linear in binary presence indicators
Best For	Longer documents where frequency matters	Shorter documents where presence patterns matter

The Core Difference in One Sentence:

Multinomial NB answers: "Given the word frequencies observed, which class most likely generated this document?"

Bernoulli NB answers: "Given which words are present and absent, which class most likely generated this document?"

This seemingly subtle distinction has profound implications for classifier behavior.

Theoretical Analysis: When Each Model Excels

Understanding why each model performs better in certain conditions requires examining their mathematical properties.

Multinomial NB: The Frequency Signal

Multinomial NB is optimal when word frequency carries discriminative information beyond simple presence. Consider these scenarios:

Repeated keywords indicate topic intensity: A document mentioning "quantum" ten times is more strongly about physics than one mentioning it once.
Long documents benefit from accumulation: With many word positions, the law of large numbers helps—the sample word distribution approaches the true class distribution.
Vocabulary coverage is dense: Most documents use a substantial fraction of class-relevant vocabulary.

Bernoulli NB: The Pattern Signal

Bernoulli NB excels when the pattern of words—which are present, which are absent—is more informative than frequencies:

Short documents: With few words, frequency information is sparse and noisy. A word appearing twice vs. once in a 10-word text is noise.
Binary feature patterns: Some domains have signature word patterns. Security vulnerability reports might always contain certain technical terms.
Negative evidence matters: When certain words being absent is discriminative. Spam often lacks professional terms like "quarterly" or "compliance".

The McCallum & Nigam Result

In their influential 1998 paper 'A Comparison of Event Models for Naive Bayes Text Classification', McCallum and Nigam demonstrated that Multinomial NB typically outperforms Bernoulli NB on longer documents, while Bernoulli NB can be competitive on shorter texts. They also showed that Multinomial NB benefits more from additional training data—it has more parameters to estimate (continuous word frequencies vs. binary probabilities).

Mathematical Insight: Information Content

Let's analyze the information content of each representation for a word $w$ appearing $k$ times in a document of length $n$:

Multinomial representation: Records the exact count $k$

Information: $\log P(\text{seeing } k \text{ occurrences}) = \log \binom{n}{k} + k \log \theta_w + (n-k) \log (1-\theta_w)$

Bernoulli representation: Records only that $k \geq 1$ (word is present)

Information: $\log P(\text{word present}) = \log(1 - (1-\theta_w)^n)$

When $k$ is consistently associated with class (e.g., high $k$ → class A, low $k$ → class B), Multinomial NB captures this. When only presence/absence matters, Bernoulli is sufficient and more parsimoniously represents the data.

Empirical Comparison Across Domains

Theory is valuable, but empirical evidence on real datasets provides the clearest guidance. Here we compare performance across representative text classification tasks.

Model Performance Across Domains (Illustrative Results)
Dataset / Domain	Avg. Doc Length	Multinomial NB	Bernoulli NB	Winner
20 Newsgroups (topic classification)	~400 words	85-88%	78-82%	Multinomial (+6-7%)
IMDB Reviews (sentiment)	~230 words	83-85%	80-83%	Multinomial (+3%)
SMS Spam (spam detection)	~15 words	95-97%	96-98%	Bernoulli (+1%)
Twitter Sentiment (sentiment)	~12 words	70-73%	72-76%	Bernoulli (+2-3%)
Academic Papers (topic)	~5000 words	90-93%	82-86%	Multinomial (+8%)
Product Titles (category)	~5 words	68-72%	70-75%	Bernoulli (+3%)

Note: Actual performance varies with preprocessing, vocabulary size, training data size, and class balance. These ranges represent typical results from the literature and practitioner experience.

Key Pattern Observed:

Documents with < 50 words: Often Bernoulli advantage
Documents with 50-200 words: Competitive; depends on domain
Documents with > 200 words: Usually Multinomial advantage

The crossover point varies by domain. The fundamental rule: try both and validate on held-out data.

empirical_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
import numpy as np
from collections import Counter
from typing import List, Tuple, Dict
import math
 
class MultinomialNB:
    """Multinomial Naive Bayes for comparison."""
    
    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha
        self.log_priors = {}
        self.log_probs = {}
        self.vocab = set()
        
    def fit(self, docs: List[str], labels: List[str]):
        tokenized = [doc.lower().split() for doc in docs]
        for tokens in tokenized:
            self.vocab.update(tokens)
        
        self.classes = sorted(set(labels))
        class_counts = Counter(labels)
        n_docs = len(docs)
        
        self.log_priors = {c: math.log(class_counts[c] / n_docs) for c in self.classes}
        
        word_counts = {c: Counter() for c in self.classes}
        for tokens, label in zip(tokenized, labels):
            word_counts[label].update(tokens)
        
        self.log_probs = {}
        for c in self.classes:
            total = sum(word_counts[c].values())
            self.log_probs[c] = {
                w: math.log((word_counts[c][w] + self.alpha) / 
                           (total + self.alpha * len(self.vocab)))
                for w in self.vocab
            }
            self.log_probs[c]['__unseen__'] = math.log(
                self.alpha / (total + self.alpha * len(self.vocab))
            )
        
        return self
    
    def predict(self, doc: str) -> str:
        tokens = doc.lower().split()
        token_counts = Counter(tokens)
        
        scores = {}
        for c in self.classes:
            score = self.log_priors[c]
            for token, count in token_counts.items():
                if token in self.log_probs[c]:
                    score += count * self.log_probs[c][token]
                else:
                    score += count * self.log_probs[c]['__unseen__']
            scores[c] = score
        
        return max(scores, key=scores.get)
 
 
class BernoulliNB:
    """Bernoulli Naive Bayes for comparison."""
    
    def __init__(self, alpha: float = 1.0):
        self.alpha = alpha
        self.log_priors = {}
        self.log_probs = {}
        self.vocab = set()
        self.baselines = {}
        
    def fit(self, docs: List[str], labels: List[str]):
        tokenized = [set(doc.lower().split()) for doc in docs]
        for tokens in tokenized:
            self.vocab.update(tokens)
        
        self.classes = sorted(set(labels))
        class_counts = Counter(labels)
        n_docs = len(docs)
        
        self.log_priors = {c: math.log(class_counts[c] / n_docs) for c in self.classes}
        
        doc_word_counts = {c: Counter() for c in self.classes}
        for tokens, label in zip(tokenized, labels):
            for token in tokens:
                doc_word_counts[label][token] += 1
        
        self.log_probs = {}
        self.baselines = {}
        
        for c in self.classes:
            N_c = class_counts[c]
            probs = {}
            baseline = 0
            
            for word in self.vocab:
                p = (doc_word_counts[c][word] + self.alpha) / (N_c + 2 * self.alpha)
                probs[word] = (math.log(p), math.log(1 - p))
                baseline += math.log(1 - p)
            
            self.log_probs[c] = probs
            self.baselines[c] = baseline
        
        return self
    
    def predict(self, doc: str) -> str:
        present = set(doc.lower().split()) & self.vocab
        
        scores = {}
        for c in self.classes:
            score = self.log_priors[c] + self.baselines[c]
            
            for word in present:
                log_p, log_q = self.log_probs[c][word]
                score += log_p - log_q
            
            scores[c] = score
        
        return max(scores, key=scores.get)
 
 
def compare_on_document_length():
    """
    Compare models across different document lengths.
    
    This simulation demonstrates the length effect empirically.
    """
    import random
    
    # Define class-specific word distributions
    class_a_words = ['science', 'research', 'experiment', 'data', 'analysis',
                     'theory', 'hypothesis', 'method', 'result', 'study']
    class_b_words = ['business', 'market', 'profit', 'sales', 'company',
                     'revenue', 'growth', 'strategy', 'customer', 'product']
    shared_words = ['the', 'is', 'a', 'and', 'of', 'to', 'in', 'that', 'it']
    
    def generate_document(class_label: str, length: int) -> str:
        """Generate a document of specified length from given class."""
        words = class_a_words if class_label == 'A' else class_b_words
        # Mix: 50% class words, 30% shared, 20% noise
        doc_words = []
        for _ in range(length):
            r = random.random()
            if r < 0.5:
                doc_words.append(random.choice(words))
            elif r < 0.8:
                doc_words.append(random.choice(shared_words))
            else:
                doc_words.append(random.choice(class_a_words + class_b_words))
        return ' '.join(doc_words)
    
    # Generate training data (fixed moderate length)
    train_docs = []
    train_labels = []
    for _ in range(100):
        train_docs.append(generate_document('A', 50))
        train_labels.append('A')
        train_docs.append(generate_document('B', 50))
        train_labels.append('B')
    
    # Train both models
    mnb = MultinomialNB(alpha=1.0)
    bnb = BernoulliNB(alpha=1.0)
    
    mnb.fit(train_docs, train_labels)
    bnb.fit(train_docs, train_labels)
    
    # Test at different document lengths
    lengths = [5, 10, 20, 50, 100, 200, 500]
    
    print("DOCUMENT LENGTH COMPARISON")
    print("=" * 60)
    print(f"{'Length':>8s} {'MNB Acc':>10s} {'BNB Acc':>10s} {'Winner':>12s}")
    print("-" * 45)
    
    for length in lengths:
        n_test = 100
        
        # Generate test documents
        test_docs = []
        test_labels = []
        for _ in range(n_test // 2):
            test_docs.append(generate_document('A', length))
            test_labels.append('A')
            test_docs.append(generate_document('B', length))
            test_labels.append('B')
        
        # Evaluate
        mnb_correct = sum(mnb.predict(d) == l for d, l in zip(test_docs, test_labels))
        bnb_correct = sum(bnb.predict(d) == l for d, l in zip(test_docs, test_labels))
        
        mnb_acc = mnb_correct / n_test
        bnb_acc = bnb_correct / n_test
        
        if mnb_acc > bnb_acc + 0.02:
            winner = "Multinomial"
        elif bnb_acc > mnb_acc + 0.02:
            winner = "Bernoulli"
        else:
            winner = "Tie"
        
        print(f"{length:>8d} {mnb_acc:>10.1%} {bnb_acc:>10.1%} {winner:>12s}")
    
    print()
    print("Observations:")
    print("- Short docs (5-20 words): Bernoulli often competitive or better")
    print("- Long docs (100+ words): Multinomial increasingly dominant")
    print("- Medium docs: Often similar performance")
 
 
# Run the comparison
np.random.seed(42)
compare_on_document_length()

Practical Decision Framework

Based on theoretical analysis and empirical evidence, here's a practical framework for choosing between the models.

Decision Flowchart

•
Check average document length:
- < 30 words → Start with Bernoulli
- 30-100 words → Try both, validate
- 100 words → Start with Multinomial
•
Consider word repetition:
- Repetition carries meaning (e.g., 'buy buy buy' = urgency) → Multinomial
- Repetition is noise or rare → Bernoulli may be sufficient
•
Analyze vocabulary coverage:
- Documents use diverse vocabulary → Multinomial benefits from frequencies
- Documents have limited, patterned vocabulary → Bernoulli captures patterns
•
Check if absence is informative:
- Certain words being absent strongly indicates class → Bernoulli
- Absence is typically uninformative → Multinomial is fine
•
Evaluate available training data:
- Small training set (< 1000 docs) → Bernoulli may be more robust
- Large training set (> 10,000 docs) → Multinomial can leverage frequencies
•
When in doubt:
- Train both models
- Evaluate on validation set
- Choose based on performance on YOUR data

Quick Reference: Model Selection by Task
Task / Domain	Recommended Model	Rationale
Email spam detection	Multinomial	Emails are typically 50-200 words; frequency of spam keywords matters
SMS spam detection	Bernoulli	SMS are very short (~15 words); presence patterns dominate
Tweet sentiment	Bernoulli	Tweets are short; word presence is primary signal
Movie review sentiment	Multinomial	Reviews are longer; sentiment intensity via frequency
News topic classification	Multinomial	Articles are long; topic-specific term frequencies matter
Product title categorization	Bernoulli	Titles are very short; keyword presence is key
Academic paper classification	Multinomial	Papers are very long; terminology density indicates field
Intent detection (chatbot)	Bernoulli	Queries are short; specific word presence indicates intent

The Pragmatic Approach

In practice, the performance difference between Multinomial and Bernoulli NB is often 2-5% accuracy. Unless you're optimizing for every percentage point, starting with Multinomial NB (the more common default in libraries like scikit-learn) is reasonable. Switch to Bernoulli only if: (1) documents are very short, (2) you have theoretical reason to prefer binary features, or (3) empirical validation shows it performs better on your data.

Hybrid and Ensemble Approaches

When neither pure Multinomial nor Bernoulli NB is clearly superior, hybrid approaches can capture benefits of both.

Approach 1: Feature-Level Fusion

Use both frequency and binary features. For each word, include:

Binary feature: Is word present? (Bernoulli-style)
Frequency feature: How many times? (Multinomial-style)

This doubles the feature space but lets the model learn which representation is more useful for each word.

Approach 2: Model Ensemble

Train both Multinomial and Bernoulli NB separately, then combine their predictions:

Voting: Predict class that both models agree on, or use tie-breaker
Averaging: Average class probabilities from both models
Stacking: Use a meta-classifier that takes both models' outputs as features

Approach 3: Confidence-Weighted Selection

Use the model that is more confident for each document:

If |Multinomial log-odds| > |Bernoulli log-odds|, trust Multinomial
This leverages the intuition that confident predictions are more reliable

ensemble_approaches.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
import math
from collections import Counter
from typing import List, Dict, Tuple
 
class NaiveBayesEnsemble:
    """
    Ensemble of Multinomial and Bernoulli Naive Bayes.
    
    Combines predictions from both models using various strategies.
    """
    
    def __init__(self, 
                 alpha: float = 1.0,
                 strategy: str = 'average'):
        """
        Args:
            alpha: Smoothing parameter for both models
            strategy: 'vote', 'average', 'confidence', or 'stack'
        """
        self.alpha = alpha
        self.strategy = strategy
        self.mnb = None  # Will hold MultinomialNB
        self.bnb = None  # Will hold BernoulliNB
        self.classes = []
        
    def fit(self, docs: List[str], labels: List[str]):
        """Fit both underlying models."""
        # Simplified inline implementations
        self.classes = sorted(set(labels))
        tokenized_mnb = [doc.lower().split() for doc in docs]
        tokenized_bnb = [set(doc.lower().split()) for doc in docs]
        
        vocab = set()
        for tokens in tokenized_mnb:
            vocab.update(tokens)
        
        class_counts = Counter(labels)
        n_docs = len(docs)
        
        # Multinomial parameters
        self.mnb_log_priors = {c: math.log(class_counts[c] / n_docs) 
                               for c in self.classes}
        
        mnb_word_counts = {c: Counter() for c in self.classes}
        for tokens, label in zip(tokenized_mnb, labels):
            mnb_word_counts[label].update(tokens)
        
        self.mnb_log_probs = {}
        for c in self.classes:
            total = sum(mnb_word_counts[c].values())
            self.mnb_log_probs[c] = {
                w: math.log((mnb_word_counts[c][w] + self.alpha) / 
                           (total + self.alpha * len(vocab)))
                for w in vocab
            }
            self.mnb_log_probs[c]['__unseen__'] = math.log(
                self.alpha / (total + self.alpha * len(vocab))
            )
        
        # Bernoulli parameters
        self.bnb_log_priors = self.mnb_log_priors.copy()
        
        bnb_doc_counts = {c: Counter() for c in self.classes}
        for tokens, label in zip(tokenized_bnb, labels):
            for token in tokens:
                bnb_doc_counts[label][token] += 1
        
        self.bnb_log_probs = {}
        self.bnb_baselines = {}
        for c in self.classes:
            N_c = class_counts[c]
            probs = {}
            baseline = 0
            for word in vocab:
                p = (bnb_doc_counts[c][word] + self.alpha) / (N_c + 2 * self.alpha)
                probs[word] = (math.log(p), math.log(1 - p))
                baseline += math.log(1 - p)
            self.bnb_log_probs[c] = probs
            self.bnb_baselines[c] = baseline
        
        self.vocab = vocab
        return self
    
    def _mnb_predict_proba(self, doc: str) -> Dict[str, float]:
        """Get Multinomial NB probabilities."""
        tokens = doc.lower().split()
        token_counts = Counter(tokens)
        
        log_probs = {}
        for c in self.classes:
            score = self.mnb_log_priors[c]
            for token, count in token_counts.items():
                if token in self.mnb_log_probs[c]:
                    score += count * self.mnb_log_probs[c][token]
                else:
                    score += count * self.mnb_log_probs[c]['__unseen__']
            log_probs[c] = score
        
        # Normalize
        max_log = max(log_probs.values())
        probs = {c: math.exp(lp - max_log) for c, lp in log_probs.items()}
        total = sum(probs.values())
        return {c: p / total for c, p in probs.items()}
    
    def _bnb_predict_proba(self, doc: str) -> Dict[str, float]:
        """Get Bernoulli NB probabilities."""
        present = set(doc.lower().split()) & self.vocab
        
        log_probs = {}
        for c in self.classes:
            score = self.bnb_log_priors[c] + self.bnb_baselines[c]
            for word in present:
                log_p, log_q = self.bnb_log_probs[c][word]
                score += log_p - log_q
            log_probs[c] = score
        
        max_log = max(log_probs.values())
        probs = {c: math.exp(lp - max_log) for c, lp in log_probs.items()}
        total = sum(probs.values())
        return {c: p / total for c, p in probs.items()}
    
    def predict(self, doc: str) -> str:
        """Predict using ensemble strategy."""
        mnb_probs = self._mnb_predict_proba(doc)
        bnb_probs = self._bnb_predict_proba(doc)
        
        if self.strategy == 'vote':
            # Majority voting
            mnb_pred = max(mnb_probs, key=mnb_probs.get)
            bnb_pred = max(bnb_probs, key=bnb_probs.get)
            
            if mnb_pred == bnb_pred:
                return mnb_pred
            else:
                # Tie-breaker: use average probability
                avg_probs = {c: (mnb_probs[c] + bnb_probs[c]) / 2 
                            for c in self.classes}
                return max(avg_probs, key=avg_probs.get)
        
        elif self.strategy == 'average':
            # Average probabilities
            avg_probs = {c: (mnb_probs[c] + bnb_probs[c]) / 2 
                        for c in self.classes}
            return max(avg_probs, key=avg_probs.get)
        
        elif self.strategy == 'confidence':
            # Use model with higher confidence
            mnb_conf = max(mnb_probs.values()) - min(mnb_probs.values())
            bnb_conf = max(bnb_probs.values()) - min(bnb_probs.values())
            
            if mnb_conf > bnb_conf:
                return max(mnb_probs, key=mnb_probs.get)
            else:
                return max(bnb_probs, key=bnb_probs.get)
        
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")
    
    def predict_with_explanation(self, doc: str) -> Tuple[str, Dict]:
        """Predict with detailed explanation of ensemble decision."""
        mnb_probs = self._mnb_predict_proba(doc)
        bnb_probs = self._bnb_predict_proba(doc)
        
        mnb_pred = max(mnb_probs, key=mnb_probs.get)
        bnb_pred = max(bnb_probs, key=bnb_probs.get)
        
        final_pred = self.predict(doc)
        
        return final_pred, {
            'multinomial_prediction': mnb_pred,
            'multinomial_probs': mnb_probs,
            'bernoulli_prediction': bnb_pred,
            'bernoulli_probs': bnb_probs,
            'models_agree': mnb_pred == bnb_pred,
            'ensemble_strategy': self.strategy,
        }
 
 
# Demonstrate ensemble
def demo_ensemble():
    training_data = [
        ("great product excellent quality", "positive"),
        ("amazing value highly recommend", "positive"),
        ("wonderful experience fantastic", "positive"),
        ("terrible waste awful quality", "negative"),
        ("horrible service disappointed", "negative"),
        ("poor product never again", "negative"),
    ] * 3
    
    docs = [d for d, _ in training_data]
    labels = [l for _, l in training_data]
    
    print("ENSEMBLE NAIVE BAYES DEMONSTRATION")
    print("=" * 70)
    
    # Test different strategies
    strategies = ['vote', 'average', 'confidence']
    
    test_docs = [
        "great product recommend",
        "terrible service avoid",
        "okay product decent value",  # Ambiguous
    ]
    
    for strategy in strategies:
        ensemble = NaiveBayesEnsemble(alpha=1.0, strategy=strategy)
        ensemble.fit(docs, labels)
        
        print(f"\nStrategy: {strategy.upper()}")
        print("-" * 50)
        
        for doc in test_docs:
            pred, details = ensemble.predict_with_explanation(doc)
            mnb_p = details['multinomial_probs']['positive']
            bnb_p = details['bernoulli_probs']['positive']
            
            print(f"'{doc[:35]:35s}'")
            print(f"  MNB: {details['multinomial_prediction']} ({mnb_p:.3f})")
            print(f"  BNB: {details['bernoulli_prediction']} ({bnb_p:.3f})")
            print(f"  Ensemble: {pred} (agree: {details['models_agree']})")
 
 
demo_ensemble()

When to Use Ensembles

Ensembling Multinomial and Bernoulli NB is most valuable when: (1) your documents have variable lengths (some short, some long), (2) you need high reliability and have computational budget, (3) the two models disagree frequently on validation data (indicating they capture different signals). For most applications, a single well-chosen model with proper tuning is sufficient.

Production Implementation Recommendations

Deploying Naive Bayes classifiers in production requires attention to several practical considerations.

Production Best Practices

•Use established libraries: scikit-learn's MultinomialNB and BernoulliNB are well-tested and optimized.
•Serialize model efficiently: Store learned parameters (log-priors, log-probs) rather than training data.
•Handle OOV consistently: Define clear policy for out-of-vocabulary words (use smoothed estimate).
•Monitor drift: Track vocabulary coverage over time; new domains introduce new words.
•A/B test model changes: Before switching models, validate improvements on production traffic.

Performance Optimization

•Use sparse representations: Store only non-zero word counts; most documents are sparse.
•Precompute baselines: For Bernoulli NB, precompute the absence baseline per class.
•Batch predictions: Vectorize predictions over multiple documents for efficiency.
•Consider vocabulary pruning: Large vocabularies increase memory; prune rare/common words.
•Cache tokenization: If documents are classified multiple times, cache tokenized forms.

production_tips.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# Using scikit-learn for production Naive Bayes
 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
 
def create_production_pipeline(model_type: str = 'multinomial',
                               max_features: int = 10000,
                               ngram_range: tuple = (1, 2),
                               alpha: float = 1.0):
    """
    Create a production-ready text classification pipeline.
    
    Args:
        model_type: 'multinomial' or 'bernoulli'
        max_features: Maximum vocabulary size
        ngram_range: N-gram range (e.g., (1, 2) for unigrams and bigrams)
        alpha: Smoothing parameter
        
    Returns:
        sklearn Pipeline ready for fit() and predict()
    """
    # Choose vectorizer based on model
    if model_type == 'multinomial':
        vectorizer = CountVectorizer(
            max_features=max_features,
            ngram_range=ngram_range,
            stop_words='english',
            min_df=2,
            max_df=0.95,
        )
        classifier = MultinomialNB(alpha=alpha)
    else:  # bernoulli
        vectorizer = CountVectorizer(
            max_features=max_features,
            ngram_range=ngram_range,
            stop_words='english',
            binary=True,  # Key difference: convert to binary
            min_df=2,
            max_df=0.95,
        )
        classifier = BernoulliNB(alpha=alpha)
    
    return Pipeline([
        ('vectorizer', vectorizer),
        ('classifier', classifier),
    ])
 
 
def compare_models_sklearn(docs: list, labels: list, cv: int = 5):
    """
    Compare Multinomial vs Bernoulli NB using cross-validation.
    """
    mnb_pipeline = create_production_pipeline('multinomial')
    bnb_pipeline = create_production_pipeline('bernoulli')
    
    mnb_scores = cross_val_score(mnb_pipeline, docs, labels, cv=cv, scoring='accuracy')
    bnb_scores = cross_val_score(bnb_pipeline, docs, labels, cv=cv, scoring='accuracy')
    
    print("Cross-Validation Results:")
    print(f"  Multinomial NB: {mnb_scores.mean():.3f} (+/- {mnb_scores.std() * 2:.3f})")
    print(f"  Bernoulli NB:   {bnb_scores.mean():.3f} (+/- {bnb_scores.std() * 2:.3f})")
    
    # Statistical significance (simple t-test approximation)
    mean_diff = abs(mnb_scores.mean() - bnb_scores.mean())
    pooled_std = np.sqrt((mnb_scores.std()**2 + bnb_scores.std()**2) / 2)
    effect_size = mean_diff / pooled_std if pooled_std > 0 else 0
    
    print(f"\n  Difference: {mean_diff:.3f}, Effect size: {effect_size:.2f}")
    
    if effect_size < 0.2:
        print("  Conclusion: Models perform similarly")
    elif mnb_scores.mean() > bnb_scores.mean():
        print("  Conclusion: Multinomial NB preferred")
    else:
        print("  Conclusion: Bernoulli NB preferred")
    
    return mnb_pipeline, bnb_pipeline
 
 
# Example usage with a real-world workflow
def production_workflow_example():
    """
    Complete production workflow example.
    """
    import pickle
    
    # 1. Prepare training data
    train_docs = [
        "excellent product highly recommend",
        "great quality amazing value",
        "terrible waste of money",
        "horrible experience never again",
    ] * 25
    train_labels = ["positive"] * 50 + ["negative"] * 50
    
    # 2. Compare models and select best
    mnb_pipe, bnb_pipe = compare_models_sklearn(train_docs, train_labels)
    
    # 3. Train final model on full data
    best_pipeline = mnb_pipe  # Assume MNB won
    best_pipeline.fit(train_docs, train_labels)
    
    # 4. Save model for production
    with open('naive_bayes_classifier.pkl', 'wb') as f:
        pickle.dump(best_pipeline, f)
    
    # 5. Load and use in production
    with open('naive_bayes_classifier.pkl', 'rb') as f:
        loaded_pipeline = pickle.load(f)
    
    # 6. Make predictions
    new_docs = [
        "great product love it",
        "terrible service avoid",
    ]
    predictions = loaded_pipeline.predict(new_docs)
    probabilities = loaded_pipeline.predict_proba(new_docs)
    
    print("\nProduction Predictions:")
    for doc, pred, prob in zip(new_docs, predictions, probabilities):
        print(f"  '{doc}' -> {pred} (confidence: {max(prob):.3f})")
 
 
# Run example
production_workflow_example()

Module Summary: Multinomial and Bernoulli Naive Bayes

We've completed a comprehensive exploration of Naive Bayes for text classification. Let's consolidate the key insights from this entire module.

Complete Module Takeaways

•Text classification transforms documents into mathematical representations (bag-of-words, binary, TF-IDF) for probabilistic modeling.
•Multinomial NB models documents as multinomial samples over vocabulary, using word frequencies. Best for longer documents where frequency carries meaning.
•Bernoulli NB models documents as binary vectors of word presence/absence. Best for short texts where presence patterns are the primary signal.
•Laplace smoothing is essential to prevent zero probabilities for unseen words. It has a Bayesian interpretation as adding prior pseudo-counts.
•Model selection depends on document length, word repetition patterns, vocabulary coverage, and whether absence is informative.
•Practical deployment benefits from established libraries, efficient sparse representations, and proper validation workflows.

Quick Reference: Model Selection Cheat Sheet
If Your Documents Are...	Choose...	Because...
Long (100+ words)	Multinomial NB	Word frequencies provide rich signal
Short (< 30 words)	Bernoulli NB	Presence/absence patterns dominate
Variable length	Try both + validate	Neither assumption perfectly fits
Repetition matters	Multinomial NB	Captures frequency information
Absence is evidence	Bernoulli NB	Explicitly models word absence

Beyond This Module:

Naive Bayes classifiers, despite their simplicity and the strong independence assumption, remain remarkably effective for text classification. They provide:

Fast training and prediction suitable for real-time applications
Interpretable models where you can examine which words drive predictions
Strong baselines that often compete with far more complex methods
Minimal data requirements compared to deep learning approaches

As you advance in machine learning, you'll encounter more sophisticated NLP methods (TF-IDF + logistic regression, word embeddings, transformers), but Naive Bayes remains a valuable tool in your arsenal—especially when simplicity, speed, interpretability, or limited data are priorities.

Module Complete

Congratulations! You have achieved mastery of Multinomial and Bernoulli Naive Bayes for text classification. You understand their probabilistic foundations, can implement them from scratch, know when to choose each model, and can deploy them effectively in production. This knowledge forms a solid foundation for advanced NLP and machine learning applications.