Loading content...
Imagine training a spam classifier on 10,000 emails. You deploy it, and the very first email contains a word your training data never saw: "cryptocurrency." What happens?
Catastrophic failure.
Without smoothing, the probability $P(\text{cryptocurrency} | \text{spam}) = 0$ because the word never appeared in spam training documents. When we multiply this zero probability with all other word probabilities, the entire likelihood becomes zero. The classifier cannot function.
This is the zero-frequency problem—one of the most fundamental challenges in probabilistic text classification. Laplace smoothing is the elegant solution that transforms zero probabilities into small but nonzero values, allowing the classifier to gracefully handle previously unseen words.
By the end of this page, you will understand why the zero-frequency problem is catastrophic, how Laplace smoothing solves it through pseudo-counts, the Bayesian interpretation as a prior distribution, variants like Lidstone and Jeffreys smoothing, practical guidance for choosing the smoothing parameter α, and how smoothing affects model behavior and decision boundaries.
The zero-frequency problem is not just a theoretical concern—it's a practical reality that affects every text classifier.
Why Zeros Occur:
Vocabulary growth: Natural language has a long-tail distribution (Zipf's law). The vocabulary of any finite training set is a subset of all possible words. New words constantly appear in test data.
Class-specific gaps: Even words seen in training may not appear in all classes. "Algorithm" might appear in tech news but never in sports news.
Rare combinations: Legitimate words may simply not have appeared in your training sample by chance.
Mathematical Disaster:
Recall the Naive Bayes classification formula:
$$\log P(c | d) \propto \log P(c) + \sum_{w \in d} \log P(w | c)$$
If $P(w | c) = 0$ for any word $w$ in document $d$:
$$\log P(w | c) = \log(0) = -\infty$$
The entire sum becomes $-\infty$, making comparison between classes meaningless. Even if a document has 99 words strongly indicating spam, a single unseen word makes $P(\text{spam} | d) = 0$.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
import mathfrom collections import Counter def demonstrate_zero_frequency_problem(): """ Show how a single unseen word breaks classification. """ # Training data training_spam = ["free money", "free prize", "win cash", "claim prize"] training_ham = ["meeting today", "report ready", "project update"] # Count words per class (Maximum Likelihood Estimation) spam_word_counts = Counter() for doc in training_spam: spam_word_counts.update(doc.lower().split()) ham_word_counts = Counter() for doc in training_ham: ham_word_counts.update(doc.lower().split()) total_spam_words = sum(spam_word_counts.values()) total_ham_words = sum(ham_word_counts.values()) # MLE probability function (no smoothing) def mle_prob(word, class_counts, total): return class_counts[word] / total if class_counts[word] > 0 else 0.0 print("VOCABULARY COVERAGE ANALYSIS") print("=" * 60) spam_vocab = set(spam_word_counts.keys()) ham_vocab = set(ham_word_counts.keys()) print(f"Spam vocabulary: {spam_vocab}") print(f"Ham vocabulary: {ham_vocab}") print(f"Overlap: {spam_vocab & ham_vocab}") print() # Test document with unseen word test_doc = "free cryptocurrency offer" test_words = test_doc.lower().split() print(f"Test document: '{test_doc}'") print("-" * 60) # Calculate log probabilities (without smoothing) log_prob_spam = 0 log_prob_ham = 0 print(f"{'Word':20s} {'P(w|spam)':>12s} {'P(w|ham)':>12s} {'log(spam)':>12s} {'log(ham)':>12s}") print("-" * 70) for word in test_words: p_spam = mle_prob(word, spam_word_counts, total_spam_words) p_ham = mle_prob(word, ham_word_counts, total_ham_words) # Handle log(0) case for display log_spam_str = f"{math.log(p_spam):.4f}" if p_spam > 0 else "-∞" log_ham_str = f"{math.log(p_ham):.4f}" if p_ham > 0 else "-∞" print(f"{word:20s} {p_spam:>12.4f} {p_ham:>12.4f} {log_spam_str:>12s} {log_ham_str:>12s}") if p_spam > 0: log_prob_spam += math.log(p_spam) else: log_prob_spam = float('-inf') if p_ham > 0: log_prob_ham += math.log(p_ham) else: log_prob_ham = float('-inf') print("-" * 70) print(f"{'TOTAL':20s} {'':>12s} {'':>12s} {log_prob_spam:>12s} {log_prob_ham:>12s}") print() if log_prob_spam == float('-inf') and log_prob_ham == float('-inf'): print("RESULT: Both classes have -∞ probability!") print(" Classification is UNDEFINED.") elif log_prob_spam == float('-inf'): print("RESULT: P(spam|doc) = 0, even though 'free' strongly suggests spam!") elif log_prob_ham == float('-inf'): print("RESULT: P(ham|doc) = 0") print("\n⚠️ A single unseen word ('cryptocurrency') broke the classifier!") demonstrate_zero_frequency_problem()In production systems, the zero-frequency problem causes: (1) Complete classifier failure on documents with new words, (2) Undefined behavior when log(0) is computed, (3) Division by zero errors in some implementations, (4) Silent errors where predictions appear confident but are meaningless. This is NOT a corner case—it happens on virtually every real document if you don't use smoothing.
Laplace smoothing (also called add-one smoothing) is a simple, elegant technique that prevents zero probabilities by pretending we've seen every word at least once.
The Core Idea:
Instead of using raw counts, we add a small constant $\alpha$ to every word count. For $\alpha = 1$ (classic Laplace smoothing):
$$P(w | c) = \frac{\text{count}(w, c) + 1}{\text{total_count}(c) + |V|}$$
where $|V|$ is the vocabulary size.
Why the Denominator Also Changes:
If we only added 1 to the numerator, probabilities wouldn't sum to 1. Since we add 1 to each of the $|V|$ vocabulary words, we must add $|V|$ to the denominator:
$$\sum_{w \in V} P(w | c) = \sum_{w \in V} \frac{\text{count}(w, c) + 1}{\text{total_count}(c) + |V|} = \frac{\text{total_count}(c) + |V|}{\text{total_count}(c) + |V|} = 1 \checkmark$$
General Formulation:
With smoothing parameter $\alpha$:
$$P(w | c) = \frac{\text{count}(w, c) + \alpha}{\text{total_count}(c) + \alpha \cdot |V|}$$
| Word | Count | MLE P(w|c) | Laplace P(w|c) (α=1) | Lidstone P(w|c) (α=0.1) |
|---|---|---|---|---|
| free | 50 | 50/500 = 0.100 | 51/600 = 0.085 | 50.1/510 = 0.098 |
| meeting | 10 | 10/500 = 0.020 | 11/600 = 0.018 | 10.1/510 = 0.020 |
| cryptocurrency | 0 | 0/500 = 0.000 | 1/600 = 0.0017 | 0.1/510 = 0.0002 |
| Total counts | 500 | — | 500+100=600 | 500+10=510 |
Assuming vocabulary size |V| = 100
Tradeoffs of Smoothing:
Shrinks high-frequency probabilities: Words with high counts get slightly reduced probabilities because we're redistributing probability mass to unseen words.
Creates non-zero floor: All words get at least $\alpha / (\text{total} + \alpha|V|)$ probability.
Affects discrimination: Heavy smoothing ($\alpha \gg 1$) pushes all word probabilities toward uniform, reducing discriminative power.
Scale-dependent: The effect of a fixed $\alpha$ depends on corpus size. With 100 words, $\alpha = 1$ is significant; with 1,000,000 words, it's negligible.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
import mathfrom collections import Counterfrom typing import Dict, List class SmoothingDemo: """ Demonstrates different smoothing strategies and their effects. """ def __init__(self, vocabulary_size: int): self.vocab_size = vocabulary_size def mle_probability(self, count: int, total: int) -> float: """Maximum Likelihood Estimate (no smoothing).""" return count / total if total > 0 else 0.0 def laplace_probability(self, count: int, total: int, alpha: float = 1.0) -> float: """Laplace/Lidstone smoothed probability.""" return (count + alpha) / (total + alpha * self.vocab_size) def jeffreys_probability(self, count: int, total: int) -> float: """Jeffreys smoothing (alpha = 0.5).""" return self.laplace_probability(count, total, alpha=0.5) def compare_smoothing(self, word_counts: Dict[str, int], total: int): """Compare different smoothing methods.""" print("SMOOTHING METHOD COMPARISON") print("=" * 80) print(f"Total word count: {total}") print(f"Vocabulary size: {self.vocab_size}") print() headers = ["Word", "Count", "MLE", "Laplace(α=1)", "Lidstone(α=0.5)", "Lidstone(α=0.1)"] print(f"{headers[0]:15s} {headers[1]:>8s} {headers[2]:>10s} {headers[3]:>14s} {headers[4]:>16s} {headers[5]:>16s}") print("-" * 85) for word, count in sorted(word_counts.items(), key=lambda x: -x[1]): mle = self.mle_probability(count, total) lap1 = self.laplace_probability(count, total, alpha=1.0) lid05 = self.laplace_probability(count, total, alpha=0.5) lid01 = self.laplace_probability(count, total, alpha=0.1) print(f"{word:15s} {count:>8d} {mle:>10.6f} {lap1:>14.6f} {lid05:>16.6f} {lid01:>16.6f}") # Add unseen word word = "(unseen)" count = 0 mle = self.mle_probability(count, total) lap1 = self.laplace_probability(count, total, alpha=1.0) lid05 = self.laplace_probability(count, total, alpha=0.5) lid01 = self.laplace_probability(count, total, alpha=0.1) print("-" * 85) print(f"{word:15s} {count:>8d} {mle:>10.6f} {lap1:>14.6f} {lid05:>16.6f} {lid01:>16.6f}") # Verify normalization print() print("Normalization check (should be 1.0):") for alpha, name in [(1.0, "Laplace"), (0.5, "Lidstone(0.5)"), (0.1, "Lidstone(0.1)")]: total_prob = sum( self.laplace_probability(c, total, alpha) for c in word_counts.values() ) # Add probability mass for unseen words n_unseen = self.vocab_size - len(word_counts) total_prob += n_unseen * self.laplace_probability(0, total, alpha) print(f" {name}: {total_prob:.6f}") def demonstrate_smoothing_effect(): """Show how smoothing fixes the zero-frequency problem.""" # Simulated word counts for spam class spam_counts = { 'free': 50, 'money': 30, 'prize': 25, 'winner': 20, 'click': 15, 'urgent': 10, } total_spam = sum(spam_counts.values()) demo = SmoothingDemo(vocabulary_size=1000) # Assume 1000-word vocabulary demo.compare_smoothing(spam_counts, total_spam) # Show classification with and without smoothing print() print("=" * 80) print("CLASSIFICATION WITH NEW WORD") print("=" * 80) test_words = ['free', 'cryptocurrency', 'money'] print(f"\nTest document: {test_words}") print() # Without smoothing log_prob_mle = 0 print("Without smoothing (MLE):") for word in test_words: count = spam_counts.get(word, 0) prob = demo.mle_probability(count, total_spam) if prob > 0: log_prob_mle += math.log(prob) print(f" P({word}|spam) = {prob:.6f} -> log = {math.log(prob):.4f}") else: log_prob_mle = float('-inf') print(f" P({word}|spam) = 0.000000 -> log = -∞ ❌") break print(f" Total: {log_prob_mle}") # With Laplace smoothing print("\nWith Laplace smoothing (α=1):") log_prob_laplace = 0 for word in test_words: count = spam_counts.get(word, 0) prob = demo.laplace_probability(count, total_spam, alpha=1.0) log_prob_laplace += math.log(prob) print(f" P({word}|spam) = {prob:.6f} -> log = {math.log(prob):.4f}") print(f" Total: {log_prob_laplace:.4f} ✓") demonstrate_smoothing_effect()Laplace smoothing is not just a heuristic fix—it has a rigorous Bayesian interpretation. Understanding this connection provides deeper insight into when and why smoothing works.
The Bayesian Setup for Multinomial NB:
For each class $c$, we model the word distribution as a multinomial with parameters $\boldsymbol{\theta}^c = (\theta_1^c, \theta_2^c, \ldots, \theta_{|V|}^c)$, where $\theta_w^c = P(w | c)$ and $\sum_w \theta_w^c = 1$.
Prior Distribution:
We place a Dirichlet prior on the multinomial parameters:
$$\boldsymbol{\theta}^c \sim \text{Dirichlet}(\alpha_1, \alpha_2, \ldots, \alpha_{|V|})$$
With symmetric hyperparameter $\alpha$ (same for all words):
$$\boldsymbol{\theta}^c \sim \text{Dirichlet}(\alpha, \alpha, \ldots, \alpha)$$
Posterior Distribution:
After observing word counts $(n_1, n_2, \ldots, n_{|V|})$ in class $c$ documents, the posterior is also Dirichlet:
$$\boldsymbol{\theta}^c | \text{data} \sim \text{Dirichlet}(n_1 + \alpha, n_2 + \alpha, \ldots, n_{|V|} + \alpha)$$
Posterior Mean Estimate:
The posterior mean (expected value) of each $\theta_w^c$ is:
$$E[\theta_w^c | \text{data}] = \frac{n_w + \alpha}{\sum_{w'}(n_{w'} + \alpha)} = \frac{n_w + \alpha}{N + \alpha |V|}$$
This is exactly the Laplace-smoothed estimate! The smoothing parameter $\alpha$ is the prior pseudo-count.
The Dirichlet(α, α, ..., α) prior can be interpreted as having seen α 'virtual observations' of each word before seeing any real data. With α = 1, we act as if we've seen each vocabulary word exactly once. With α = 0.5 (Jeffreys prior), we've seen half a virtual observation of each word. This prior belief gets combined with actual observations to produce the posterior estimate.
| Smoothing Name | α Value | Prior Belief | Characteristics |
|---|---|---|---|
| No Smoothing (MLE) | α = 0 | Improper prior (no prior belief) | Zero probabilities for unseen words; unstable with small samples |
| Lidstone | 0 < α < 1 | Weak prior (less than one pseudo-count) | Less shrinkage than Laplace; data-dominated |
| Jeffreys | α = 0.5 | Jeffreys prior (non-informative) | Theoretically optimal in certain senses |
| Laplace | α = 1 | Uniform prior (one pseudo-count) | Simple, robust; standard choice |
| Heavy Smoothing | α > 1 | Strong prior toward uniformity | Reduces discrimination; regularizes heavily |
For Bernoulli NB:
The Bayesian interpretation differs slightly. Each word's presence indicator follows a Bernoulli distribution with parameter $p_w^c = P(w = 1 | c)$.
We place a Beta prior:
$$p_w^c \sim \text{Beta}(\alpha, \alpha)$$
After observing $n_w$ documents containing word $w$ out of $N$ documents in class $c$:
$$p_w^c | \text{data} \sim \text{Beta}(n_w + \alpha, N - n_w + \alpha)$$
The posterior mean is:
$$E[p_w^c | \text{data}] = \frac{n_w + \alpha}{N + 2\alpha}$$
This explains why Bernoulli NB has $2\alpha$ in the denominator (not $\alpha |V|$)—we're adding $\alpha$ pseudo-counts for each of the two outcomes (present/absent).
While $\alpha = 1$ (Laplace smoothing) is the most common choice, the optimal value depends on your data and task. Here's how to think about choosing $\alpha$.
Factors Affecting Optimal α:
Practical Guidelines:
Start with α = 1: The default Laplace smoothing works well in most cases.
Try α < 1 for large corpora: With substantial training data (>100,000 documents), try α = 0.5 or α = 0.1 to reduce shrinkage of well-estimated probabilities.
Use cross-validation: Treat α as a hyperparameter and select via held-out validation. Grid search over [0.01, 0.1, 0.5, 1.0, 2.0].
Consider relative α: Some implementations use α that scales with vocabulary size, e.g., α = 1/|V|, giving total pseudo-count of 1 across vocabulary.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168
import numpy as npfrom collections import Counterfrom typing import List, Tupleimport math def cross_validate_alpha(documents: List[str], labels: List[str], alpha_values: List[float], n_folds: int = 5) -> List[Tuple[float, float]]: """ Cross-validation to select optimal smoothing parameter. Returns list of (alpha, accuracy) tuples. """ from random import shuffle # Create folds indices = list(range(len(documents))) shuffle(indices) fold_size = len(indices) // n_folds folds = [indices[i * fold_size:(i + 1) * fold_size] for i in range(n_folds)] results = [] for alpha in alpha_values: fold_accuracies = [] for fold_idx in range(n_folds): # Split into train/test test_indices = set(folds[fold_idx]) train_docs = [documents[i] for i in range(len(documents)) if i not in test_indices] train_labels = [labels[i] for i in range(len(labels)) if i not in test_indices] test_docs = [documents[i] for i in test_indices] test_labels = [labels[i] for i in test_indices] # Train with this alpha model = SimpleNBWithAlpha(alpha=alpha) model.fit(train_docs, train_labels) # Evaluate correct = sum( model.predict(doc) == label for doc, label in zip(test_docs, test_labels) ) accuracy = correct / len(test_docs) fold_accuracies.append(accuracy) mean_accuracy = np.mean(fold_accuracies) std_accuracy = np.std(fold_accuracies) results.append((alpha, mean_accuracy, std_accuracy)) return results class SimpleNBWithAlpha: """Minimal NB implementation for alpha testing.""" def __init__(self, alpha: float = 1.0): self.alpha = alpha self.log_priors = {} self.log_probs = {} self.vocab = set() def fit(self, docs: List[str], labels: List[str]): tokenized = [doc.lower().split() for doc in docs] for tokens in tokenized: self.vocab.update(tokens) classes = set(labels) class_counts = Counter(labels) n_docs = len(docs) self.log_priors = {c: math.log(class_counts[c] / n_docs) for c in classes} word_counts = {c: Counter() for c in classes} for tokens, label in zip(tokenized, labels): word_counts[label].update(tokens) self.log_probs = {} for c in classes: total = sum(word_counts[c].values()) self.log_probs[c] = { w: math.log((word_counts[c][w] + self.alpha) / (total + self.alpha * len(self.vocab))) for w in self.vocab } # Default for unseen self.log_probs[c]['__unseen__'] = math.log( self.alpha / (total + self.alpha * len(self.vocab)) ) def predict(self, doc: str) -> str: tokens = doc.lower().split() scores = {} for c in self.log_priors: score = self.log_priors[c] for token in tokens: if token in self.log_probs[c]: score += self.log_probs[c][token] else: score += self.log_probs[c]['__unseen__'] scores[c] = score return max(scores, key=scores.get) def demonstrate_alpha_effect(): """Show how different alpha values affect classification.""" # Synthetic data training_data = [ ("great product love it amazing quality", "positive"), ("excellent service wonderful experience", "positive"), ("fantastic value highly recommend", "positive"), ("terrible product waste of money", "negative"), ("horrible service very disappointed", "negative"), ("awful quality never buying again", "negative"), ] * 5 # Repeat for more data # Add some noise training_data.extend([ ("decent product okay value", "positive"), ("not great not terrible", "negative"), ]) docs = [d for d, _ in training_data] labels = [l for _, l in training_data] # Test documents (including novel words) test_docs = [ "great product recommend", # Clear positive "terrible waste disappointed", # Clear negative "good product blockchain technology", # Positive with unseen word "bad service cryptocurrency scam", # Negative with unseen words ] test_labels = ["positive", "negative", "positive", "negative"] print("ALPHA SELECTION EXPERIMENT") print("=" * 70) alpha_values = [0.01, 0.1, 0.5, 1.0, 2.0, 5.0] print(f"\n{'α':>8s}", end="") for doc in test_docs: print(f" | {doc[:20]+'..':22s}", end="") print(" | Accuracy") print("-" * 120) for alpha in alpha_values: model = SimpleNBWithAlpha(alpha=alpha) model.fit(docs, labels) predictions = [model.predict(doc) for doc in test_docs] accuracy = sum(p == t for p, t in zip(predictions, test_labels)) / len(test_labels) print(f"{alpha:>8.2f}", end="") for pred, true in zip(predictions, test_labels): mark = "✓" if pred == true else "✗" print(f" | {pred[:6]:6s} {mark:15s}", end="") print(f" | {accuracy:.2%}") print() print("Observations:") print("- Very small α (0.01) may undersmooth: unseen words get too little probability") print("- Very large α (5.0) may oversmooth: reduces discrimination between classes") print("- Optimal α often in range [0.1, 1.0] for typical datasets") demonstrate_alpha_effect()For most text classification tasks, α = 1 (Laplace smoothing) is a safe default. If you have time to tune, try values in [0.1, 2.0] via cross-validation. Rarely does optimal α fall outside this range. If you notice poor handling of new words, increase α; if you notice reduced discrimination on well-represented classes, decrease α.
Smoothing doesn't just fix technical issues—it fundamentally changes the model's behavior and decision boundaries.
Log-Odds Shrinkage:
Recall that Naive Bayes classification is based on log-odds. For a word $w$ distinguishing class 1 from class 2:
$$\text{log-odds} = \log P(w | c_1) - \log P(w | c_2)$$
Without smoothing:
$$\text{log-odds}{\text{MLE}} = \log \frac{n{w,1}/N_1}{n_{w,2}/N_2} = \log \frac{n_{w,1}}{n_{w,2}} + \log \frac{N_2}{N_1}$$
With smoothing:
$$\text{log-odds}{\alpha} = \log \frac{(n{w,1} + \alpha) / (N_1 + \alpha|V|)}{(n_{w,2} + \alpha) / (N_2 + \alpha|V|)}$$
The smoothed log-odds are shrunk toward zero compared to MLE. This is a form of regularization—it reduces the influence of any single word on the classification decision.
| Scenario | MLE Log-Odds | Smoothed Log-Odds (α=1) | Shrinkage |
|---|---|---|---|
| Word in c₁ only: n₁=10, n₂=0 | +∞ | log(11/1) = 2.40 | From ∞ to 2.40 |
| Strong indicator: n₁=100, n₂=5 | log(20) = 3.00 | log(101/6) = 2.82 | 6% reduction |
| Weak indicator: n₁=15, n₂=10 | log(1.5) = 0.41 | log(16/11) = 0.37 | 9% reduction |
| Rare word: n₁=2, n₂=1 | log(2) = 0.69 | log(3/2) = 0.41 | 41% reduction |
Key Observations:
Infinite log-odds become finite: Words appearing in only one class no longer have infinite influence.
Rare words are shrunk more: A word seen twice vs. once has its log-odds reduced by 41%, while a word seen 100 vs. 5 times is reduced by only 6%.
Common words are stable: Well-estimated probabilities change little.
Connection to Regularization:
This shrinkage effect is analogous to L2 regularization in logistic regression. Just as L2 regularization shrinks weights toward zero, smoothing shrinks log-odds toward zero. Both prevent the model from being overly confident based on limited evidence.
In fact, for large datasets, Naive Bayes with Laplace smoothing and logistic regression with L2 regularization often produce remarkably similar decision boundaries!
The connection between smoothing and regularization runs deep. In the limit of infinite data, Laplace smoothing approaches MLE. With finite data, the smoothing parameter α controls the bias-variance tradeoff: higher α → more bias (toward uniform), less variance (more stable estimates). This is exactly the tradeoff controlled by the regularization strength λ in penalized regression.
While Laplace smoothing is the standard, several advanced techniques exist for specialized applications.
Good-Turing Smoothing:
Good-Turing smoothing adjusts probabilities based on the frequency of frequencies. It estimates the probability of unseen events using the count of events seen exactly once.
For words with count $c$, the adjusted count is:
$$c^* = (c + 1) \frac{N_{c+1}}{N_c}$$
where $N_c$ is the number of distinct words appearing exactly $c$ times.
Intuition: If many words appear exactly once, there's likely substantial probability mass for words we haven't seen at all.
Kneser-Ney Smoothing:
Popular in language modeling, Kneser-Ney uses a different approach: absolute discounting combined with distribution over contexts.
$$P_{KN}(w | c) = \frac{\max(n_{w,c} - d, 0)}{N_c} + \lambda_c P_{\text{continuation}}(w)$$
where $d$ is a fixed discount (typically 0.75) and $P_{\text{continuation}}$ uses how many different contexts a word appears in rather than raw count.
Feature-Specific Smoothing:
For text classification, different word types may benefit from different smoothing:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
import mathfrom collections import Counterfrom typing import Dict, List class GoodTuringSmoothing: """ Good-Turing smoothing for probability estimation. Adjusts counts based on the frequency of frequencies. """ def __init__(self, word_counts: Dict[str, int], vocab_size: int): """ Args: word_counts: Dictionary mapping words to their counts vocab_size: Total vocabulary size (including unseen) """ self.word_counts = word_counts self.vocab_size = vocab_size self.total_count = sum(word_counts.values()) # Compute frequency of frequencies (N_r = count of words with count r) count_of_counts = Counter(word_counts.values()) self.N_r = dict(count_of_counts) # Number of unseen words (words with count 0) n_seen = len(word_counts) self.N_r[0] = vocab_size - n_seen def adjusted_count(self, c: int) -> float: """ Compute Good-Turing adjusted count for a word with count c. c* = (c + 1) * N_{c+1} / N_c """ if c == 0: # Probability mass for unseen words if 1 in self.N_r and self.N_r[0] > 0: return self.N_r[1] / self.N_r[0] return 1 / (self.vocab_size + 1) # Fallback if c + 1 not in self.N_r: # No words with count c+1; use simple smoothing return c # Or use linear interpolation return (c + 1) * self.N_r[c + 1] / self.N_r.get(c, 1) def probability(self, word: str) -> float: """Compute Good-Turing smoothed probability.""" c = self.word_counts.get(word, 0) c_star = self.adjusted_count(c) # Normalize return c_star / self.total_count class AbsoluteDiscounting: """ Absolute discounting smoothing. Subtracts a fixed discount from all positive counts and redistributes to unseen words. """ def __init__(self, word_counts: Dict[str, int], vocab_size: int, discount: float = 0.75): self.word_counts = word_counts self.vocab_size = vocab_size self.discount = discount self.total_count = sum(word_counts.values()) # Compute backoff probability mass n_positive = len([c for c in word_counts.values() if c > 0]) self.backoff_mass = self.discount * n_positive / self.total_count # Number of unseen words self.n_unseen = vocab_size - len(word_counts) def probability(self, word: str) -> float: """Compute absolute discounting probability.""" c = self.word_counts.get(word, 0) if c > 0: # Discount positive counts return max(c - self.discount, 0) / self.total_count else: # Distribute backoff mass uniformly to unseen if self.n_unseen > 0: return self.backoff_mass / self.n_unseen return 0 def compare_smoothing_methods(): """Compare different smoothing approaches.""" # Sample word counts (simulated from a class) word_counts = { 'the': 100, 'is': 50, 'great': 30, 'product': 25, 'excellent': 10, 'fantastic': 5, 'wonderful': 3, 'superb': 2, 'magnificent': 1, 'outstanding': 1, } vocab_size = 1000 # Assume 1000-word vocabulary total = sum(word_counts.values()) # Initialize smoothers gt = GoodTuringSmoothing(word_counts, vocab_size) ad = AbsoluteDiscounting(word_counts, vocab_size, discount=0.75) print("SMOOTHING METHOD COMPARISON") print("=" * 80) print(f"Total observed words: {total}") print(f"Vocabulary size: {vocab_size}") print(f"Seen words: {len(word_counts)}, Unseen: {vocab_size - len(word_counts)}") print() # Compare for specific words test_words = ['the', 'product', 'superb', 'magnificent', '(unseen)'] print(f"{'Word':15s} {'Count':>8s} {'MLE':>10s} {'Laplace':>10s} {'G-T':>10s} {'Abs Disc':>10s}") print("-" * 70) for word in test_words: c = word_counts.get(word, 0) mle = c / total if c > 0 else 0 laplace = (c + 1) / (total + vocab_size) gt_prob = gt.probability(word if word != '(unseen)' else 'xyz_unseen') ad_prob = ad.probability(word if word != '(unseen)' else 'xyz_unseen') print(f"{word:15s} {c:>8d} {mle:>10.6f} {laplace:>10.6f} {gt_prob:>10.6f} {ad_prob:>10.6f}") print() print("Key observations:") print("- MLE gives 0 to unseen words (problematic)") print("- Laplace gives uniform mass to all unseen words") print("- Good-Turing bases unseen estimate on words seen once") print("- Absolute discounting is less aggressive than Laplace") compare_smoothing_methods()For standard text classification, Laplace smoothing is almost always sufficient. Advanced methods like Good-Turing or Kneser-Ney are primarily used in language modeling (predicting next words) where the full probability distribution matters, not just classification decisions. If you're building a production classifier, stick with Laplace or Lidstone unless you have specific evidence that advanced smoothing helps your task.
Smoothing is not just a technical fix—it's a fundamental component of robust probabilistic text classification.
What's Next:
In the final page of this module, we bring everything together with a comprehensive comparison of Multinomial and Bernoulli Naive Bayes. We'll provide practical decision criteria, empirical comparisons across domains, and guidance for choosing the right model for your application.
You now understand Laplace smoothing at a deep level—its necessity, mechanics, Bayesian foundations, and practical application. You can implement multiple smoothing strategies, choose appropriate smoothing parameters, and understand how smoothing affects model behavior. This knowledge is essential for building robust probabilistic classifiers.