Loading content...
Imagine you're a wildlife biologist studying two species of birds: robins and blue jays. To classify a new bird, you could take two fundamentally different approaches:
Approach 1: Study everything about each species—their size distributions, color patterns, beak shapes, migration habits, preferred habitats. You become an expert on what makes a robin a robin and what makes a blue jay a blue jay. When a new bird appears, you ask: "Which species is more likely to have produced a bird like this?"
Approach 2: Simply collect examples of correctly labeled birds and find the boundary that best separates them. You never deeply understand either species—you just learn where to draw the line.
This fundamental choice defines one of the deepest divides in machine learning: generative versus discriminative modeling. In this page, we dive deep into the first approach—generative models—and develop the mathematical machinery that makes them work.
By the end of this page, you will understand: (1) The mathematical formulation of generative classifiers, (2) How Bayes' theorem transforms learned distributions into classification decisions, (3) The modeling assumptions required and their implications, (4) The key insight that generative models capture the data generation process, and (5) Why this approach provides unique capabilities like handling missing data and generating synthetic samples.
At its core, a generative model attempts to understand and model the process by which data is generated. Rather than asking "What label should this data point have?", a generative model asks "How did this data point come to exist in the first place?"
This philosophical difference has profound practical implications. Let's formalize this intuition mathematically.
Think of generative models as telling a story about data creation. For classification: first, nature "picks" a class (e.g., spam or not spam), then it generates features according to that class's characteristic patterns. Generative models attempt to reverse-engineer this story.
A generative classifier models the joint probability distribution $P(X, Y)$ over features $X$ and labels $Y$. This joint distribution captures everything about the relationship between inputs and outputs—how features and labels co-occur in the data.
The joint distribution can be factored in two equivalent ways:
$$P(X, Y) = P(X|Y) \cdot P(Y) = P(Y|X) \cdot P(X)$$
Generative models leverage the first factorization: they model the class-conditional distribution $P(X|Y)$ (how features are distributed within each class) and the prior $P(Y)$ (how common each class is).
Once we have these two components, we can compute anything we want—including the posterior $P(Y|X)$ needed for classification—via Bayes' theorem.
| Component | Notation | Meaning | How It's Obtained |
|---|---|---|---|
| Class Prior | $P(Y = k)$ | How common is class $k$ in the population? | Estimated from training data frequencies |
| Class-Conditional | $P(X | Y = k)$ | What do examples from class $k$ look like? | Modeled explicitly (e.g., Gaussian, multinomial) |
| Evidence | $P(X)$ | How likely is this feature vector overall? | Computed as $\sum_k P(X|Y=k) P(Y=k)$ |
| Posterior | $P(Y = k | X)$ | Given features $X$, what's the probability of class $k$? | Derived via Bayes' theorem |
The mathematical heart of generative classification is Bayes' theorem, which allows us to invert the conditional probability:
$$P(Y = k | X) = \frac{P(X | Y = k) \cdot P(Y = k)}{P(X)}$$
Let's carefully dissect each component and understand its role in classification.
Since $P(X)$ is constant across all classes, for classification we only need to compute the unnormalized posterior: $P(Y = k | X) \propto P(X | Y = k) \cdot P(Y = k)$. We choose the class that maximizes this product. This avoids computing the potentially expensive normalization constant.
For a $K$-class classification problem, the generative classifier assigns a new input $X$ to the class that maximizes the posterior probability:
$$\hat{y} = \arg\max_{k \in {1, \ldots, K}} P(Y = k | X) = \arg\max_{k} P(X | Y = k) \cdot P(Y = k)$$
This is known as the Maximum A Posteriori (MAP) decision rule. It naturally incorporates prior knowledge—if one class is much more common, we need stronger evidence to classify into the rarer class.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom typing import Dict, List class GenerativeClassifier: """ Abstract base class for generative classifiers. Generative classifiers model P(X|Y) and P(Y), then use Bayes' theorem to compute P(Y|X) for classification. """ def __init__(self, classes: List[int]): self.classes = classes self.priors: Dict[int, float] = {} def fit(self, X: np.ndarray, y: np.ndarray) -> 'GenerativeClassifier': """ Learn the model parameters from training data. 1. Estimate class priors P(Y=k) from class frequencies 2. Estimate class-conditional distributions P(X|Y=k) """ n_samples = len(y) # Step 1: Estimate priors from class frequencies for k in self.classes: n_k = np.sum(y == k) self.priors[k] = n_k / n_samples # Step 2: Estimate class-conditional distributions self._fit_class_conditionals(X, y) return self def _fit_class_conditionals(self, X: np.ndarray, y: np.ndarray) -> None: """Subclasses implement this to model P(X|Y=k)""" raise NotImplementedError def _log_likelihood(self, X: np.ndarray, k: int) -> np.ndarray: """Compute log P(X|Y=k) for all samples. Subclasses implement this.""" raise NotImplementedError def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Compute posterior probabilities P(Y=k|X) for all classes. Uses Bayes' theorem with log-space computation for numerical stability. """ n_samples = X.shape[0] n_classes = len(self.classes) # Compute log unnormalized posteriors: log P(X|Y=k) + log P(Y=k) log_posteriors = np.zeros((n_samples, n_classes)) for idx, k in enumerate(self.classes): log_likelihood = self._log_likelihood(X, k) # log P(X|Y=k) log_prior = np.log(self.priors[k]) # log P(Y=k) log_posteriors[:, idx] = log_likelihood + log_prior # Normalize to get proper probabilities using log-sum-exp trick log_normalizer = self._logsumexp(log_posteriors, axis=1, keepdims=True) log_probs = log_posteriors - log_normalizer return np.exp(log_probs) def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels using MAP decision rule.""" posteriors = self.predict_proba(X) class_indices = np.argmax(posteriors, axis=1) return np.array([self.classes[idx] for idx in class_indices]) @staticmethod def _logsumexp(a: np.ndarray, axis: int = None, keepdims: bool = False) -> np.ndarray: """Numerically stable log-sum-exp for probability normalization.""" a_max = np.max(a, axis=axis, keepdims=True) result = a_max + np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True)) if not keepdims: result = np.squeeze(result, axis=axis) return resultThe power and challenge of generative models lies in modeling $P(X | Y = k)$—the distribution of features within each class. This is where modeling assumptions become critical.
In general, if $X$ is a $d$-dimensional feature vector, $P(X | Y = k)$ is a $d$-dimensional probability distribution. Without restrictions, this distribution could be arbitrarily complex, requiring exponentially many parameters to specify.
Consider a simple case: each feature is binary, and we have $d = 20$ features. The full joint distribution $P(X | Y = k)$ over all possible feature combinations would have $2^{20} - 1 \approx 1$ million free parameters per class. This is practically impossible to estimate reliably from typical training set sizes.
This fundamental problem leads to the core insight of tractable generative modeling:
There's no free lunch in generative modeling. We must impose structural assumptions on P(X|Y) to make estimation tractable. These assumptions reduce model complexity but may not match reality. The art lies in choosing assumptions that balance expressiveness with learnability.
Different generative classifiers make different assumptions about $P(X|Y=k)$:
| Model | Assumption about $P(X|Y=k)$ | Parameters per Class | Best For |
|---|---|---|---|
| Naive Bayes (Gaussian) | Features independent, each Gaussian: $\prod_j N(x_j; \mu_{kj}, \sigma_{kj}^2)$ | $2d$ (means + variances) | Continuous data, quick baseline |
| Naive Bayes (Multinomial) | Multinomial over word counts | $V$ (vocabulary size) | Text classification |
| LDA (Linear Discriminant) | Multivariate Gaussian with shared covariance | $d + d(d+1)/2$ shared | Linear boundaries, limited data |
| QDA (Quadratic Discriminant) | Multivariate Gaussian with class-specific covariance | $d + d(d+1)/2$ per class | Quadratic boundaries, more data |
| Gaussian Mixture Model | Mixture of Gaussians within each class | $(d + d(d+1)/2 + 1) \times M$ where $M$ = components | Multi-modal class distributions |
The most fundamental continuous generative model assumes that features within each class follow a multivariate Gaussian distribution:
$$P(X | Y = k) = \mathcal{N}(X; \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(X - \mu_k)^T \Sigma_k^{-1} (X - \mu_k)\right)$$
Where:
The exponent $(X - \mu_k)^T \Sigma_k^{-1} (X - \mu_k)$ is the Mahalanobis distance squared—a measure of how far $X$ is from the class center, accounting for the shape of the distribution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom typing import Dict, List class GaussianGenerativeClassifier: """ Generative classifier with multivariate Gaussian class-conditionals. This implements the full QDA model where each class has its own mean vector and covariance matrix. """ def __init__(self, classes: List[int], reg_param: float = 1e-6): """ Args: classes: List of class labels reg_param: Regularization added to covariance diagonal for stability """ self.classes = classes self.reg_param = reg_param self.priors: Dict[int, float] = {} self.means: Dict[int, np.ndarray] = {} self.covariances: Dict[int, np.ndarray] = {} def fit(self, X: np.ndarray, y: np.ndarray) -> 'GaussianGenerativeClassifier': """ Estimate model parameters via Maximum Likelihood. For Gaussian models: - Mean MLE: sample mean of class examples - Covariance MLE: sample covariance of class examples """ n_samples, n_features = X.shape for k in self.classes: # Extract samples belonging to class k X_k = X[y == k] n_k = len(X_k) # Prior: relative frequency self.priors[k] = n_k / n_samples # Mean: MLE is sample mean self.means[k] = np.mean(X_k, axis=0) # Covariance: MLE is sample covariance (with regularization) X_centered = X_k - self.means[k] self.covariances[k] = (X_centered.T @ X_centered) / n_k # Add regularization to diagonal for numerical stability self.covariances[k] += self.reg_param * np.eye(n_features) return self def _log_likelihood(self, X: np.ndarray, k: int) -> np.ndarray: """ Compute log P(X|Y=k) for multivariate Gaussian. log N(x; μ, Σ) = -d/2 log(2π) - 1/2 log|Σ| - 1/2 (x-μ)ᵀΣ⁻¹(x-μ) """ n_samples, n_features = X.shape mean = self.means[k] cov = self.covariances[k] # Compute log determinant sign, logdet = np.linalg.slogdet(cov) # Compute precision matrix (inverse of covariance) precision = np.linalg.inv(cov) # Compute Mahalanobis distances X_centered = X - mean mahalanobis_sq = np.sum(X_centered @ precision * X_centered, axis=1) # Log-likelihood log_likelihood = ( -0.5 * n_features * np.log(2 * np.pi) - 0.5 * logdet - 0.5 * mahalanobis_sq ) return log_likelihood def predict_proba(self, X: np.ndarray) -> np.ndarray: """Compute posterior probabilities via Bayes' theorem.""" n_samples = X.shape[0] n_classes = len(self.classes) log_posteriors = np.zeros((n_samples, n_classes)) for idx, k in enumerate(self.classes): log_likelihood = self._log_likelihood(X, k) log_prior = np.log(self.priors[k]) log_posteriors[:, idx] = log_likelihood + log_prior # Normalize using log-sum-exp log_normalizer = np.max(log_posteriors, axis=1, keepdims=True) log_posteriors_shifted = log_posteriors - log_normalizer posteriors = np.exp(log_posteriors_shifted) posteriors /= posteriors.sum(axis=1, keepdims=True) return posteriors def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels using MAP decision rule.""" posteriors = self.predict_proba(X) class_indices = np.argmax(posteriors, axis=1) return np.array([self.classes[idx] for idx in class_indices]) def sample(self, n_samples: int, class_label: int = None) -> np.ndarray: """ Generate synthetic samples from the learned model. This is a unique capability of generative models! """ if class_label is not None: # Sample from specific class mean = self.means[class_label] cov = self.covariances[class_label] return np.random.multivariate_normal(mean, cov, n_samples) else: # Sample according to prior class distribution samples = [] for k in self.classes: n_k = int(n_samples * self.priors[k]) if n_k > 0: class_samples = np.random.multivariate_normal( self.means[k], self.covariances[k], n_k ) samples.append(class_samples) return np.vstack(samples) if samples else np.array([])One of the most elegant aspects of generative models is that they tell a story about how data comes to exist. This story has a specific structure that mirrors our factorization $P(X,Y) = P(X|Y) \cdot P(Y)$:
First, nature "chooses" a class according to the prior $P(Y)$. In our spam filter example, each incoming email has some baseline probability of being spam.
Then, nature "generates" features according to $P(X|Y)$. Given that an email is spam, it will contain certain words with certain frequencies; given it's ham (not spam), different word patterns emerge.
We observe only the features $X$, and must infer the hidden class $Y$ that produced them.
This is a latent variable perspective: the class label is the hidden (latent) variable that explains the observed features. Generative modeling is essentially the art of inverting this generative process.
This generative story can be represented as a probabilistic graphical model (specifically, a Bayesian network). The graph structure encodes conditional independence assumptions:
Y
|
v
X = (X₁, X₂, ..., X_d)
The class $Y$ is the parent of all features. This means that knowing $Y$ "explains" the features—their joint distribution factors according to the graph structure.
For Naive Bayes, we make the additional assumption that features are conditionally independent given the class:
Y
/ | \
v v v
X₁ X₂ ... X_d
This graphical structure encodes: $P(X_1, \ldots, X_d | Y) = \prod_{j=1}^{d} P(X_j | Y)$
Because generative models capture the full data generation process, they can do more than just classify: (1) Generate samples by running the generative story forward, (2) Handle missing data by marginalizing over unknown features, (3) Detect outliers by identifying points with low P(X), (4) Semi-supervised learning by leveraging unlabeled data to better estimate P(X).
Once we've chosen a parametric form for $P(X|Y)$ and $P(Y)$, we need to estimate the parameters from data. The dominant approach is Maximum Likelihood Estimation (MLE).
Given a training set ${(x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})}$ of i.i.d. samples, the likelihood is:
$$L(\theta) = \prod_{i=1}^{n} P(X = x^{(i)}, Y = y^{(i)} | \theta) = \prod_{i=1}^{n} P(X = x^{(i)} | Y = y^{(i)}, \theta) \cdot P(Y = y^{(i)} | \theta)$$
The MLE finds parameters $\hat{\theta}$ that maximize this likelihood (or equivalently, the log-likelihood).
The MLE for class priors is simply the empirical frequency:
$$\hat{P}(Y = k) = \frac{n_k}{n}$$
where $n_k$ is the number of training examples with label $k$, and $n$ is the total number of examples.
This is intuitive: if 30% of training emails are spam, our best estimate for the prior probability of spam is 0.3.
For Gaussian class-conditionals, the MLEs are:
$$\hat{\mu}k = \frac{1}{n_k} \sum{i: y^{(i)} = k} x^{(i)}$$
$$\hat{\Sigma}k = \frac{1}{n_k} \sum{i: y^{(i)} = k} (x^{(i)} - \hat{\mu}_k)(x^{(i)} - \hat{\mu}_k)^T$$
These are just the sample mean and sample covariance of the examples belonging to each class.
When classes have few training examples, MLEs can be unreliable. Covariance matrices may be singular (non-invertible), and probability estimates may be extreme (0 or 1). Common remedies include: (1) regularization (shrinkage), (2) parameter sharing across classes (as in LDA), (3) Bayesian estimation with priors on parameters, and (4) Laplace smoothing for discrete distributions.
| Model | Parameters | MLE Formula |
|---|---|---|
| Prior | $\pi_k = P(Y=k)$ | $\hat{\pi}_k = n_k / n$ |
| Gaussian mean | $\mu_k$ | $\hat{\mu}k = (1/n_k) \sum{i:y^{(i)}=k} x^{(i)}$ |
| Gaussian covariance | $\Sigma_k$ | $\hat{\Sigma}k = (1/n_k) \sum{i:y^{(i)}=k} (x^{(i)}-\hat{\mu}_k)(x^{(i)}-\hat{\mu}_k)^T$ |
| Multinomial (word) | $\theta_{kw}$ | $(\text{count of word } w \text{ in class } k) / (\text{total words in class } k)$ |
Let's consolidate everything into a complete picture of generative classification, from raw data to predictions.
Let's trace through a concrete example. Suppose we have two classes with the following learned parameters:
Class 0 (circles):
Class 1 (triangles):
For a new point $x = [3.5, 3.5]$:
Note how the prior for class 0 partially offsets its lower likelihood, but class 1 still wins due to its much better fit to the data.
Because generative models explicitly model $P(X|Y)$ and $P(Y)$, they unlock capabilities that discriminative models cannot easily provide.
These unique capabilities come with a cost: generative models require stronger assumptions about the data distribution. If P(X|Y) is badly misspecified, generative models can perform worse than discriminative alternatives that make minimal assumptions. This tradeoff is central to the generative vs discriminative debate.
We've now established the complete theoretical foundation of generative classification. Let's consolidate the key insights:
What's next:
Now that we understand the generative paradigm, we'll contrast it with the discriminative approach in the next page. Discriminative models take a fundamentally different philosophy: they model $P(Y|X)$ directly without ever modeling how data is generated. This leads to different tradeoffs in accuracy, sample efficiency, and flexibility.
You now understand the fundamental theory of generative classifiers: how they model the joint distribution, use Bayes' theorem for inference, and what unique capabilities this approach provides. Next, we'll explore the contrasting discriminative paradigm.