Loading content...
Every day, billions of text documents flow through computational systems: emails sorted into spam and inbox, customer reviews categorized by sentiment, news articles routed to appropriate topics, and support tickets triaged by urgency. Behind these seemingly magical systems lies a surprisingly elegant mathematical framework that treats documents as probability distributions and classifies them using centuries-old principles discovered by Reverend Thomas Bayes.
Text classification is the task of automatically assigning predefined categories to text documents based on their content. Despite the advent of deep learning and transformer architectures, Naive Bayes classifiers remain remarkably effective for text classification—often matching or exceeding the performance of far more complex models while requiring orders of magnitude less computation and training data.
By the end of this page, you will understand the fundamental principles of text classification, how documents are represented mathematically for machine learning, the bag-of-words model and its variants, why Naive Bayes excels at text classification, and the practical considerations that make probabilistic text classifiers indispensable in production systems.
Text classification is a supervised learning problem where the goal is to learn a function that maps documents to categories. Given a training set of labeled documents, we seek to build a classifier that can accurately predict the category of previously unseen documents.
Formal Definition:
Let $\mathcal{D} = {d_1, d_2, \ldots, d_n}$ be a collection of documents and $\mathcal{C} = {c_1, c_2, \ldots, c_k}$ be a fixed set of categories. The text classification task is to learn a classifier $\gamma: \mathcal{D} \rightarrow \mathcal{C}$ that assigns the correct category to each document.
More formally, given a document $d$ represented as a feature vector $\mathbf{x}$, we want to find:
$$\hat{c} = \arg\max_{c \in \mathcal{C}} P(c | \mathbf{x})$$
This is where Bayes' theorem becomes essential. Rather than directly modeling $P(c|\mathbf{x})$—which would require understanding the complex conditional distributions of text given categories—we can apply Bayes' theorem to invert the problem:
$$P(c | \mathbf{x}) = \frac{P(\mathbf{x} | c) P(c)}{P(\mathbf{x})}$$
| Task | Categories | Example Applications | Key Challenges |
|---|---|---|---|
| Spam Detection | Binary (spam/ham) | Email filtering, SMS spam, comment moderation | Adversarial adaptation, evolving language |
| Sentiment Analysis | Binary/Multi-class | Product reviews, social media monitoring | Sarcasm, negation, domain specificity |
| Topic Classification | Multi-class | News categorization, document routing | Hierarchical topics, multi-label scenarios |
| Intent Detection | Multi-class | Chatbots, virtual assistants, IVR systems | Short texts, similar intents, context |
| Language Identification | Multi-class | Multilingual systems, content routing | Similar languages, code-switching |
| Author Attribution | Multi-class | Forensic linguistics, plagiarism detection | Style variation, limited training data |
While deep learning models like BERT and GPT achieve state-of-the-art results on many NLP benchmarks, Naive Bayes remains valuable for several reasons: (1) it requires far less training data—often performing well with just hundreds of examples, (2) training is extremely fast—seconds rather than hours, (3) it's highly interpretable—you can examine which words drive predictions, (4) it scales linearly with vocabulary size, and (5) it provides calibrated probability estimates out of the box. For many practical applications, especially those with limited data or computational resources, Naive Bayes is the optimal choice.
Before we can apply any machine learning algorithm to text, we must transform documents from variable-length sequences of characters into fixed-dimension numerical representations. The most fundamental approach is the bag-of-words (BoW) model, which represents each document as a vector of word frequencies.
The Bag-of-Words Assumption:
The bag-of-words model makes a critical simplifying assumption: the order of words in a document does not matter. A document is treated as an unordered collection (or "bag") of words, where only the frequency of each word is retained.
Consider the sentence: "The cat sat on the mat"
In the bag-of-words representation, this becomes:
The same representation would apply to "The mat sat on the cat"—clearly a different meaning, but identical BoW vectors. Despite this loss of information, the bag-of-words model works remarkably well for text classification because category-relevant information is often conveyed by the presence and frequency of specific words rather than their precise arrangement.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
import numpy as npfrom collections import Counterfrom typing import List, Dict, Tuple class BagOfWordsVectorizer: """ A complete implementation of Bag-of-Words vectorization. This class transforms text documents into numerical feature vectors using word frequency counts. """ def __init__(self, min_df: int = 1, max_df: float = 1.0): """ Initialize the vectorizer. Args: min_df: Minimum document frequency for a word to be included max_df: Maximum document frequency (as fraction) for inclusion """ self.vocabulary_: Dict[str, int] = {} self.idf_: np.ndarray = None self.min_df = min_df self.max_df = max_df def _tokenize(self, document: str) -> List[str]: """ Simple whitespace tokenization with lowercasing. In production, you'd use more sophisticated tokenization. """ # Convert to lowercase and split on whitespace tokens = document.lower().split() # Remove punctuation (simplified) tokens = [token.strip('.,!?;:"\'()[]{}') for token in tokens] # Remove empty tokens return [token for token in tokens if token] def fit(self, documents: List[str]) -> 'BagOfWordsVectorizer': """ Learn vocabulary from training documents. Args: documents: List of text documents Returns: self for method chaining """ n_docs = len(documents) # Count document frequency for each word doc_freq = Counter() for doc in documents: tokens = set(self._tokenize(doc)) # Unique tokens per doc doc_freq.update(tokens) # Filter by document frequency thresholds max_count = int(self.max_df * n_docs) vocabulary = { word: idx for idx, (word, count) in enumerate(doc_freq.items()) if self.min_df <= count <= max_count } self.vocabulary_ = vocabulary # Compute IDF weights for TF-IDF (optional enhancement) self.idf_ = np.array([ np.log((n_docs + 1) / (doc_freq[word] + 1)) + 1 for word in vocabulary ]) return self def transform(self, documents: List[str]) -> np.ndarray: """ Transform documents to BoW feature vectors. Args: documents: List of text documents Returns: Document-term matrix of shape (n_docs, n_features) """ n_docs = len(documents) n_features = len(self.vocabulary_) # Initialize sparse-like dense matrix (for simplicity) X = np.zeros((n_docs, n_features), dtype=np.float64) for i, doc in enumerate(documents): tokens = self._tokenize(doc) token_counts = Counter(tokens) for token, count in token_counts.items(): if token in self.vocabulary_: j = self.vocabulary_[token] X[i, j] = count return X def fit_transform(self, documents: List[str]) -> np.ndarray: """Fit and transform in one step.""" return self.fit(documents).transform(documents) # Example usage demonstrating the complete pipelinedocuments = [ "The quick brown fox jumps over the lazy dog", "The lazy dog sleeps all day", "Quick brown foxes are faster than lazy dogs", "The dog chased the cat across the yard"] vectorizer = BagOfWordsVectorizer(min_df=1)X = vectorizer.fit_transform(documents) print("Vocabulary:")for word, idx in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]): print(f" {word}: index {idx}") print(f"\nDocument-term matrix shape: {X.shape}")print(f"\nFirst document vector (non-zero entries):")for word, idx in vectorizer.vocabulary_.items(): if X[0, idx] > 0: print(f" {word}: {X[0, idx]}")Vocabulary Construction:
The vocabulary $V$ is the set of all unique words (or tokens) that appear in the training corpus. Each document is then represented as a vector $\mathbf{x} \in \mathbb{R}^{|V|}$, where $x_i$ represents the count (or presence) of word $i$ in the document.
For a vocabulary of size $|V| = 10,000$ words, each document becomes a 10,000-dimensional vector. Most entries are zero—document vectors are sparse—because any given document uses only a small fraction of the total vocabulary.
Key Design Decisions in BoW:
The raw bag-of-words representation uses word counts as features, but several variants exist that can improve classification performance. Understanding these representations is crucial because they connect directly to different Naive Bayes model variants.
Three Fundamental Representations:
| Representation | Feature Value | Mathematical Form | Best Used For |
|---|---|---|---|
| Term Frequency (TF) | Raw word count | $tf(w, d) = count(w, d)$ | Multinomial Naive Bayes, when word frequency matters |
| Binary | Word presence (0/1) | $x_w = \mathbb{1}[w \in d]$ | Bernoulli Naive Bayes, short documents |
| TF-IDF | Frequency × rarity | $tfidf(w, d) = tf(w, d) \cdot \log\frac{N}{df(w)}$ | When rare words are more informative |
Term Frequency (TF) — The Multinomial Approach:
The simplest approach counts how many times each word appears in each document. If document $d$ contains the word "machine" 5 times, then $tf(\text{machine}, d) = 5$.
This representation assumes that word frequency conveys information—a document mentioning "profit" ten times is more likely to be about business than one mentioning it once. This is the foundation of the Multinomial Naive Bayes model.
Binary (Presence/Absence) — The Bernoulli Approach:
An alternative representation simply records whether each word is present or absent:
$$x_w = \begin{cases} 1 & \text{if word } w \text{ appears in document } d \ 0 & \text{otherwise} \end{cases}$$
This representation ignores word frequency, treating a word that appears once identically to one appearing a hundred times. Surprisingly, this often works well for short documents where word repetition is rare. This is the foundation of the Bernoulli Naive Bayes model.
TF-IDF — Inverse Document Frequency Weighting:
TF-IDF combines term frequency with a measure of how rare a word is across the corpus:
$$\text{tfidf}(w, d) = tf(w, d) \times \log\frac{N}{df(w)}$$
Where $df(w)$ is the document frequency (number of documents containing word $w$) and $N$ is the total number of documents.
Intuition: Words that appear frequently in one document but rarely across the corpus are likely to be highly discriminative. The word "the" might appear 50 times, but it appears in every document, so $\log(N/df) \approx 0$ and its TF-IDF weight is low. Conversely, a technical term like "mitochondria" might appear only in biology documents, giving it a high IDF weight.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npfrom collections import Counterfrom typing import List, Dict class TFIDFVectorizer: """ TF-IDF Vectorizer with multiple normalization options. Demonstrates the mathematical transformation from raw counts to weighted representations that emphasize discriminative words. """ def __init__(self, norm: str = 'l2', use_idf: bool = True, smooth_idf: bool = True, sublinear_tf: bool = False): """ Args: norm: Normalization method ('l1', 'l2', or None) use_idf: Whether to apply IDF weighting smooth_idf: Add 1 to document frequencies (prevents division by zero) sublinear_tf: Apply sublinear TF scaling (1 + log(tf)) """ self.norm = norm self.use_idf = use_idf self.smooth_idf = smooth_idf self.sublinear_tf = sublinear_tf self.vocabulary_: Dict[str, int] = {} self.idf_: np.ndarray = None def _tokenize(self, text: str) -> List[str]: """Simple tokenization.""" return text.lower().split() def fit(self, documents: List[str]) -> 'TFIDFVectorizer': """Learn vocabulary and IDF weights from documents.""" n_docs = len(documents) # Build vocabulary and count document frequencies doc_freq = Counter() word_to_idx = {} idx = 0 for doc in documents: unique_tokens = set(self._tokenize(doc)) for token in unique_tokens: if token not in word_to_idx: word_to_idx[token] = idx idx += 1 doc_freq.update(unique_tokens) self.vocabulary_ = word_to_idx n_features = len(word_to_idx) # Compute IDF weights if self.use_idf: # idf(t) = log[(N + 1) / (df(t) + 1)] + 1 (smoothed) # or idf(t) = log[N / df(t)] + 1 (unsmoothed) df = np.ones(n_features) # Default 1 for smoothing for word, count in doc_freq.items(): df[word_to_idx[word]] = count if self.smooth_idf: self.idf_ = np.log((n_docs + 1) / (df + 1)) + 1 else: self.idf_ = np.log(n_docs / df) + 1 else: self.idf_ = np.ones(n_features) return self def transform(self, documents: List[str]) -> np.ndarray: """Transform documents to TF-IDF representation.""" n_docs = len(documents) n_features = len(self.vocabulary_) # Compute term frequencies X = np.zeros((n_docs, n_features)) for i, doc in enumerate(documents): tokens = self._tokenize(doc) counts = Counter(tokens) for token, count in counts.items(): if token in self.vocabulary_: j = self.vocabulary_[token] if self.sublinear_tf: # Sublinear scaling: 1 + log(tf) if tf > 0 X[i, j] = 1 + np.log(count) if count > 0 else 0 else: X[i, j] = count # Apply IDF weighting X = X * self.idf_ # Apply normalization if self.norm == 'l2': # L2 normalization: divide each row by its L2 norm norms = np.sqrt(np.sum(X ** 2, axis=1, keepdims=True)) norms[norms == 0] = 1 # Prevent division by zero X = X / norms elif self.norm == 'l1': # L1 normalization: divide each row by its L1 norm norms = np.sum(np.abs(X), axis=1, keepdims=True) norms[norms == 0] = 1 X = X / norms return X # Demonstrate how TF-IDF distinguishes discriminative wordsdocuments = [ "the cat sat on the mat", # Common words dominate "the dog sat on the rug", # Similar structure "machine learning is transforming artificial intelligence", "deep learning neural networks artificial intelligence"] vectorizer = TFIDFVectorizer(norm='l2', smooth_idf=True)X = vectorizer.fit_transform(documents) print("TF-IDF Analysis:")print("-" * 50)for word, idx in sorted(vectorizer.vocabulary_.items(), key=lambda x: vectorizer.idf_[x[1]], reverse=True)[:10]: print(f" {word:20s} IDF: {vectorizer.idf_[idx]:.4f}") print(f"\nDocument similarity matrix (cosine similarity):")similarity = X @ X.T # Cosine similarity (vectors are L2-normalized)print(similarity.round(3))Given the simplicity of the Naive Bayes algorithm and its strong independence assumption—which is clearly violated in natural language—it's remarkable that it works so well for text classification. Understanding why requires examining both the nature of text data and the properties of the classification task.
The Independence Assumption Revisited:
Naive Bayes assumes that features (words) are conditionally independent given the class:
$$P(\mathbf{x} | c) = P(x_1, x_2, \ldots, x_n | c) = \prod_{i=1}^{n} P(x_i | c)$$
This assumption is clearly false for natural language. The word "learning" is far more likely to appear if "machine" has already appeared. The word "intelligence" correlates strongly with "artificial." Word dependencies are the very fabric of language.
So why does Naive Bayes still work?
Pedro Domingos and Michael Pazzani's seminal 1997 paper 'On the Optimality of the Simple Bayesian Classifier under Zero-One Loss' proved that the independence assumption doesn't need to be true for Naive Bayes to return the optimal decision. The classifier only needs the class with maximum posterior probability to be correctly identified—not the actual probability values. This theoretical insight explains why Naive Bayes often outperforms sophisticated methods that estimate dependencies.
Practical Advantages for Text:
Beyond the theoretical justification, Naive Bayes offers compelling practical advantages for text classification:
| Advantage | Why It Matters | Comparison to Alternatives |
|---|---|---|
| Linear Training Time | Training requires a single pass through the data, counting word occurrences per class | SVMs: O(n²) to O(n³); Neural networks: multiple epochs over data |
| Works with Small Data | Good performance with hundreds of labeled examples due to strong generative assumptions | Deep learning often requires 10,000+ examples for NLP tasks |
| No Hyperparameter Tuning | Only smoothing parameter to set (Laplace correction α, typically 1) | Neural networks have learning rate, architecture, regularization, etc. |
| Incremental Learning | Can update model with new documents without full retraining | Most discriminative models require complete retraining |
| Interpretable | P(word|class) directly shows which words drive predictions | Neural network decisions are opaque without specialized techniques |
| Handles High Dimensions | Works naturally with vocabularies of 100,000+ words | Many algorithms struggle with such high dimensionality |
Naive Bayes for text classification takes a generative approach: rather than directly modeling which category a document belongs to, it models how documents are generated from each category. This probabilistic story provides both intuition and a framework for deriving concrete algorithms.
The Generative Story:
Imagine you want to write a document about machine learning. According to the Naive Bayes generative model, you would:
For the Multinomial model, you generate each word position by sampling from the multinomial distribution over the vocabulary. For the Bernoulli model, you flip a coin for each vocabulary word to decide whether to include it.
The Classification Decision:
Given an observed document $d$, we apply Bayes' theorem to compute the posterior probability of each class:
$$P(c | d) = \frac{P(d | c) P(c)}{P(d)} \propto P(d | c) P(c)$$
Since $P(d)$ is constant across all classes (it's the same document!), we can ignore it for classification purposes. We classify to the class that maximizes:
$$\hat{c} = \arg\max_{c} P(c) P(d | c)$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
import numpy as npfrom collections import defaultdictfrom typing import List, Dict, Tuple class GenerativeTextModel: """ Illustrates the generative model perspective of Naive Bayes. This class shows how we can: 1. Estimate P(c) and P(w|c) from training data 2. Generate synthetic documents from learned distributions 3. Classify new documents using Bayes' rule This demonstrates that classification and generation are two sides of the same probabilistic model. """ def __init__(self, smoothing: float = 1.0): self.smoothing = smoothing self.class_priors: Dict[str, float] = {} self.word_probs: Dict[str, Dict[str, float]] = {} self.vocabulary: List[str] = [] self.classes: List[str] = [] def fit(self, documents: List[str], labels: List[str]): """Estimate model parameters from labeled documents.""" # Count class frequencies class_counts = defaultdict(int) for label in labels: class_counts[label] += 1 n_docs = len(labels) self.classes = list(set(labels)) # Estimate class priors: P(c) = count(c) / N self.class_priors = { c: count / n_docs for c, count in class_counts.items() } # Count word occurrences per class word_counts_per_class = defaultdict(lambda: defaultdict(int)) all_words = set() for doc, label in zip(documents, labels): words = doc.lower().split() all_words.update(words) for word in words: word_counts_per_class[label][word] += 1 self.vocabulary = sorted(all_words) vocab_size = len(self.vocabulary) # Estimate word probabilities: P(w|c) # Using Laplace smoothing: (count(w,c) + α) / (Σ count(w',c) + α|V|) self.word_probs = {} for c in self.classes: total_words = sum(word_counts_per_class[c].values()) self.word_probs[c] = { word: (word_counts_per_class[c][word] + self.smoothing) / (total_words + self.smoothing * vocab_size) for word in self.vocabulary } return self def generate_document(self, class_label: str, length: int = 20) -> str: """ Generate a document from the learned distribution. This demonstrates the 'generative' aspect of the model. We sample words according to P(w|c). """ if class_label not in self.classes: raise ValueError(f"Unknown class: {class_label}") # Sample words according to P(w|class_label) words = list(self.word_probs[class_label].keys()) probs = list(self.word_probs[class_label].values()) probs = np.array(probs) / np.sum(probs) # Ensure normalized sampled_words = np.random.choice(words, size=length, p=probs) return ' '.join(sampled_words) def classify(self, document: str) -> Tuple[str, Dict[str, float]]: """ Classify a document using Bayes' rule. Returns the predicted class and log-probabilities for all classes. """ words = document.lower().split() log_posteriors = {} for c in self.classes: # Start with log prior log_prob = np.log(self.class_priors[c]) # Add log-likelihood of each word for word in words: if word in self.word_probs[c]: log_prob += np.log(self.word_probs[c][word]) else: # Handle unknown words with minimal probability log_prob += np.log(self.smoothing / (sum(self.word_probs[c].values()) * len(self.vocabulary))) log_posteriors[c] = log_prob # Convert to normalized probabilities for interpretation max_log = max(log_posteriors.values()) posteriors = { c: np.exp(lp - max_log) for c, lp in log_posteriors.items() } total = sum(posteriors.values()) posteriors = {c: p/total for c, p in posteriors.items()} predicted_class = max(log_posteriors, key=log_posteriors.get) return predicted_class, posteriors # Demonstration: Training, Generation, and Classificationtraining_docs = [ ("great product love it highly recommend", "positive"), ("excellent quality best purchase ever", "positive"), ("amazing value wonderful experience fantastic", "positive"), ("terrible waste of money avoid", "negative"), ("horrible quality worst experience ever", "negative"), ("disappointing product poor service bad", "negative"),] documents = [doc for doc, _ in training_docs]labels = [label for _, label in training_docs] model = GenerativeTextModel(smoothing=1.0)model.fit(documents, labels) print("Learned Class Priors:")for c, prior in model.class_priors.items(): print(f" P({c}) = {prior:.4f}") print("\nTop Words per Class:")for c in model.classes: top_words = sorted(model.word_probs[c].items(), key=lambda x: x[1], reverse=True)[:5] print(f" {c}: {', '.join(w for w, _ in top_words)}") print("\nGenerating Sample Documents:")for c in model.classes: synthetic = model.generate_document(c, length=8) print(f" {c}: '{synthetic}'") print("\nClassifying New Documents:")test_docs = [ "this is an amazing product", "terrible quality very disappointed", "good value but could be better"]for doc in test_docs: pred, probs = model.classify(doc) print(f" '{doc[:40]}...'") print(f" -> {pred} (positive: {probs.get('positive', 0):.3f}, " f"negative: {probs.get('negative', 0):.3f})")When computing P(d|c), we multiply probabilities for each word: P(w₁|c) × P(w₂|c) × ... × P(wₙ|c). For documents with hundreds of words, this product of small probabilities underflows to zero. The solution is to work in log-space: log P(d|c) = Σᵢ log P(wᵢ|c). Addition in log-space corresponds to multiplication in probability space, and log of small numbers remains numerically stable.
Building an effective text classifier requires more than just implementing the Naive Bayes algorithm. The preprocessing and feature engineering decisions often have as much impact as the choice of model. Here we present a complete pipeline from raw text to predictions.
The Pipeline Stages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256
import reimport numpy as npfrom collections import Counter, defaultdictfrom typing import List, Dict, Tuple, Optionalimport string class TextPreprocessor: """ Complete text preprocessing pipeline. Handles the messy reality of text data: HTML, unicode, punctuation, case normalization, and more. """ # Common English stopwords STOPWORDS = { 'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should', 'may', 'might', 'must', 'shall', 'can', 'need', 'dare', 'ought', 'used', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from', 'as', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'between', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'just', 'and', 'but', 'if', 'or', 'because', 'until', 'while', 'this', 'that', 'these', 'those', 'am', 'it', 'its', 'itself', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who' } def __init__(self, lowercase: bool = True, remove_stopwords: bool = True, remove_punctuation: bool = True, remove_numbers: bool = False, min_word_length: int = 2): self.lowercase = lowercase self.remove_stopwords = remove_stopwords self.remove_punctuation = remove_punctuation self.remove_numbers = remove_numbers self.min_word_length = min_word_length def clean_html(self, text: str) -> str: """Remove HTML tags.""" return re.sub(r'<[^>]+>', ' ', text) def clean_urls(self, text: str) -> str: """Remove URLs.""" return re.sub(r'http\S+|www\S+', ' ', text) def normalize_unicode(self, text: str) -> str: """Normalize unicode characters.""" # Replace common unicode characters with ASCII equivalents replacements = { '"': '"', '"': '"', ''': "'", ''': "'", '—': '-', '–': '-', '…': '...' } for unicode_char, ascii_char in replacements.items(): text = text.replace(unicode_char, ascii_char) return text def process(self, text: str) -> List[str]: """Full preprocessing pipeline.""" # Clean text = self.clean_html(text) text = self.clean_urls(text) text = self.normalize_unicode(text) # Lowercase if self.lowercase: text = text.lower() # Tokenize (simple whitespace + punctuation splitting) if self.remove_punctuation: text = text.translate(str.maketrans('', '', string.punctuation)) tokens = text.split() # Filter if self.remove_numbers: tokens = [t for t in tokens if not t.isdigit()] if self.remove_stopwords: tokens = [t for t in tokens if t not in self.STOPWORDS] tokens = [t for t in tokens if len(t) >= self.min_word_length] return tokens class NaiveBayesTextClassifier: """ Complete Naive Bayes text classifier with preprocessing. Supports both Multinomial and Bernoulli variants. """ def __init__(self, model_type: str = 'multinomial', smoothing: float = 1.0, preprocessor: Optional[TextPreprocessor] = None): """ Args: model_type: 'multinomial' or 'bernoulli' smoothing: Laplace smoothing parameter preprocessor: TextPreprocessor instance (creates default if None) """ self.model_type = model_type self.smoothing = smoothing self.preprocessor = preprocessor or TextPreprocessor() self.classes_: List[str] = [] self.class_log_prior_: Dict[str, float] = {} self.feature_log_prob_: Dict[str, Dict[str, float]] = {} self.vocabulary_: Dict[str, int] = {} def fit(self, documents: List[str], labels: List[str]) -> 'NaiveBayesTextClassifier': """Fit the model on labeled documents.""" # Preprocess all documents tokenized_docs = [self.preprocessor.process(doc) for doc in documents] # Build vocabulary all_words = set() for tokens in tokenized_docs: all_words.update(tokens) self.vocabulary_ = {word: idx for idx, word in enumerate(sorted(all_words))} vocab_size = len(self.vocabulary_) # Count classes self.classes_ = list(set(labels)) class_counts = Counter(labels) n_docs = len(labels) # Compute log priors self.class_log_prior_ = { c: np.log(count / n_docs) for c, count in class_counts.items() } # Compute feature probabilities per class self.feature_log_prob_ = {} for c in self.classes_: # Get documents in this class class_docs = [ tokens for tokens, label in zip(tokenized_docs, labels) if label == c ] if self.model_type == 'multinomial': # Count total word occurrences word_counts = Counter() for tokens in class_docs: word_counts.update(tokens) total_words = sum(word_counts.values()) # P(w|c) = (count(w,c) + α) / (Σcount + α|V|) self.feature_log_prob_[c] = { word: np.log( (word_counts.get(word, 0) + self.smoothing) / (total_words + self.smoothing * vocab_size) ) for word in self.vocabulary_ } else: # bernoulli # Count documents containing each word doc_word_counts = Counter() for tokens in class_docs: doc_word_counts.update(set(tokens)) # Unique per doc n_class_docs = len(class_docs) # P(w|c) = (count_docs_with_w + α) / (n_docs_in_class + 2α) self.feature_log_prob_[c] = {} for word in self.vocabulary_: p_word = (doc_word_counts.get(word, 0) + self.smoothing) / (n_class_docs + 2 * self.smoothing) self.feature_log_prob_[c][word] = ( np.log(p_word), # log P(w=1|c) np.log(1 - p_word) # log P(w=0|c) ) return self def predict_log_proba(self, document: str) -> Dict[str, float]: """Compute log probabilities for each class.""" tokens = self.preprocessor.process(document) log_probs = {} for c in self.classes_: log_prob = self.class_log_prior_[c] if self.model_type == 'multinomial': # Sum log probabilities of observed words for token in tokens: if token in self.feature_log_prob_[c]: log_prob += self.feature_log_prob_[c][token] else: # bernoulli # Consider presence AND absence of vocabulary words observed_words = set(tokens) for word in self.vocabulary_: log_p_present, log_p_absent = self.feature_log_prob_[c][word] if word in observed_words: log_prob += log_p_present else: log_prob += log_p_absent log_probs[c] = log_prob return log_probs def predict(self, document: str) -> str: """Predict the class of a document.""" log_probs = self.predict_log_proba(document) return max(log_probs, key=log_probs.get) def predict_batch(self, documents: List[str]) -> List[str]: """Predict classes for multiple documents.""" return [self.predict(doc) for doc in documents] # Example: Spam Classificationprint("=" * 60)print("SPAM CLASSIFICATION EXAMPLE")print("=" * 60) training_data = [ ("Get rich quick! Make money fast with this amazing offer", "spam"), ("Congratulations! You've won a free lottery prize click now", "spam"), ("LIMITED TIME: Buy now and get 90% discount on watches", "spam"), ("Your order has been shipped and will arrive Tuesday", "ham"), ("Meeting scheduled for 3pm tomorrow in conference room B", "ham"), ("Please review the attached quarterly report before Friday", "ham"), ("Free trial! No credit card required claim your prize", "spam"), ("Your password has been reset successfully", "ham"),] train_docs = [doc for doc, _ in training_data]train_labels = [label for _, label in training_data] # Train classifierclf = NaiveBayesTextClassifier(model_type='multinomial', smoothing=1.0)clf.fit(train_docs, train_labels) # Test on new documentstest_docs = [ "Claim your free prize now! Limited time offer", "Please find the meeting notes attached", "You've been selected as a lucky winner click here", "The project deadline has been extended to next week"] print("\nPredictions:")for doc in test_docs: pred = clf.predict(doc) log_probs = clf.predict_log_proba(doc) print(f" '{doc[:50]}...'") print(f" -> {pred.upper()} (log-odds: {log_probs['spam'] - log_probs['ham']:.2f})")We have established the foundational concepts that underpin probabilistic text classification. These concepts are essential for understanding the specific Naive Bayes variants we'll explore in the following pages.
What's Next:
With the foundations in place, we're ready to dive deep into the two dominant Naive Bayes variants for text:
You now understand the fundamental principles of text classification and why Naive Bayes is particularly well-suited for this domain. In the next page, we'll derive the Multinomial Naive Bayes model from first principles and see how it models documents as samples from a categorical distribution over vocabulary.