Loading content...
The Bayes decision rule is elegantly simple: classify to the class with maximum posterior probability. But how do we actually compute these posteriors? The answer involves one of the most important formulas in all of statistics—Bayes' theorem—and reveals two fundamentally different approaches to classification.
This page bridges theory and practice. We'll see exactly how to compute posteriors, confront the numerical challenges that arise, and understand why the choice between generative and discriminative modeling matters.
By the end of this page, you will master the application of Bayes' theorem for posterior computation, understand the numerical challenges and solutions for stable computation, appreciate the distinction between generative and discriminative approaches, and see how different modeling choices affect what we can compute and infer.
At the heart of posterior computation lies Bayes' theorem, one of the most important results in probability theory. For classification, it takes the form:
$$P(Y = k | X = x) = \frac{P(X = x | Y = k) \cdot P(Y = k)}{P(X = x)}$$
Or using our notation:
$$\eta_k(x) = \frac{p_k(x) \cdot \pi_k}{p(x)} = \frac{p_k(x) \cdot \pi_k}{\sum_{j=1}^K p_j(x) \cdot \pi_j}$$
Let's dissect each component:
| Component | Symbol | Name | How to Obtain |
|---|---|---|---|
| Posterior | $P(Y=k|X=x)$ | What we want | Compute via Bayes' theorem |
| Likelihood | $P(X=x|Y=k)$ | Class-conditional | Model and estimate from class $k$ samples |
| Prior | $P(Y=k)$ | Class prior | Estimate as class proportions |
| Evidence | $P(X=x)$ | Marginal | Sum over all class contributions |
The Key Insight: Inversion of Conditioning
Bayes' theorem "inverts" the direction of conditioning. We want $P(Y|X)$—the probability of the class given features. But it's often easier to model $P(X|Y)$—the probability of features given the class.
Why is $P(X|Y)$ easier to model?
When we model $P(X|Y)$ and $P(Y)$, we're building a generative model—one that describes how data is generated. Given a class, we model how features arise. This is in contrast to discriminative models that directly model $P(Y|X)$ without specifying the feature distribution.
The generative approach computes posteriors by modeling the full joint distribution $P(X, Y) = P(X|Y) \cdot P(Y)$. The steps are:
Step 1: Estimate Priors $$\hat{\pi}_k = \frac{n_k}{n} = \frac{\text{number of class } k \text{ samples}}{\text{total samples}}$$
This is the maximum likelihood estimate of the class prior.
Step 2: Model Class-Conditionals
Choose a parametric family for $p_k(x; \theta_k)$ and estimate $\theta_k$ from class $k$ samples:
Step 3: Compute Posteriors $$\hat{\eta}_k(x) = \frac{\hat{\pi}_k \cdot \hat{p}_k(x)}{\sum_j \hat{\pi}_j \cdot \hat{p}_j(x)}$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npfrom scipy.stats import multivariate_normalfrom typing import List, Tuple class GenerativeClassifier: """ A generative classifier that computes posteriors via Bayes' theorem. Models class-conditionals as multivariate Gaussians. """ def __init__(self, shared_covariance: bool = True): """ Args: shared_covariance: If True, use shared covariance (LDA). If False, use class-specific covariances (QDA). """ self.shared_covariance = shared_covariance self.classes_ = None self.priors_ = None self.means_ = None self.covariances_ = None self.shared_cov_ = None def fit(self, X: np.ndarray, y: np.ndarray): """Estimate class priors and class-conditional distributions.""" self.classes_ = np.unique(y) n_classes = len(self.classes_) n_samples, n_features = X.shape # Step 1: Estimate priors self.priors_ = np.array([np.mean(y == c) for c in self.classes_]) # Step 2: Estimate class-conditional Gaussians self.means_ = np.zeros((n_classes, n_features)) if self.shared_covariance: # Pooled covariance estimate (LDA) self.shared_cov_ = np.zeros((n_features, n_features)) for idx, c in enumerate(self.classes_): X_c = X[y == c] self.means_[idx] = X_c.mean(axis=0) self.shared_cov_ += (X_c - self.means_[idx]).T @ (X_c - self.means_[idx]) self.shared_cov_ /= (n_samples - n_classes) else: # Class-specific covariances (QDA) self.covariances_ = [] for idx, c in enumerate(self.classes_): X_c = X[y == c] self.means_[idx] = X_c.mean(axis=0) cov_c = np.cov(X_c.T) + 1e-6 * np.eye(n_features) # Regularize self.covariances_.append(cov_c) return self def _log_likelihood(self, X: np.ndarray) -> np.ndarray: """Compute log p(X | Y=k) for each class.""" n_samples = X.shape[0] n_classes = len(self.classes_) log_likelihoods = np.zeros((n_samples, n_classes)) for idx in range(n_classes): if self.shared_covariance: cov = self.shared_cov_ else: cov = self.covariances_[idx] mvn = multivariate_normal(mean=self.means_[idx], cov=cov) log_likelihoods[:, idx] = mvn.logpdf(X) return log_likelihoods def predict_proba(self, X: np.ndarray) -> np.ndarray: """ Compute posterior probabilities P(Y=k | X=x) for each class. Uses Bayes' theorem: P(Y|X) = P(X|Y) * P(Y) / P(X) """ log_likelihoods = self._log_likelihood(X) # log P(X|Y) log_priors = np.log(self.priors_) # log P(Y) # log P(X,Y) = log P(X|Y) + log P(Y) log_joint = log_likelihoods + log_priors # Normalize via log-sum-exp for numerical stability # P(Y|X) = P(X,Y) / P(X) where P(X) = sum_k P(X,Y=k) log_evidence = self._logsumexp(log_joint, axis=1, keepdims=True) log_posterior = log_joint - log_evidence return np.exp(log_posterior) def predict(self, X: np.ndarray) -> np.ndarray: """Predict class labels using MAP decision rule.""" posteriors = self.predict_proba(X) return self.classes_[np.argmax(posteriors, axis=1)] @staticmethod def _logsumexp(a: np.ndarray, axis: int = None, keepdims: bool = False): """Numerically stable log-sum-exp computation.""" a_max = np.max(a, axis=axis, keepdims=True) result = a_max + np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True)) if not keepdims: result = np.squeeze(result, axis=axis) return result # Example usageif __name__ == "__main__": from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # Generate synthetic data X, y = make_classification(n_samples=300, n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # Fit and predict clf = GenerativeClassifier(shared_covariance=True) clf.fit(X_train, y_train) # Get posteriors for a few test points posteriors = clf.predict_proba(X_test[:5]) predictions = clf.predict(X_test[:5]) print("Posterior probabilities (first 5 test points):") for i, (post, pred, true) in enumerate(zip(posteriors, predictions, y_test[:5])): print(f" Point {i}: P(Y=0)={post[0]:.3f}, P(Y=1)={post[1]:.3f} -> Pred={pred}, True={true}")The generative classifier implements the complete Bayes theorem pipeline: estimate priors, model class-conditionals, then compute posteriors by normalization. This approach gives us full probabilistic predictions, not just class labels.
Computing posteriors via Bayes' theorem involves products and ratios of probabilities, which can lead to serious numerical issues.
Problem 1: Underflow
Probability densities for high-dimensional data can be astronomically small: $$p(x) = \prod_{j=1}^d p(x_j) \approx (0.1)^{1000} = 10^{-1000}$$
This is far below the smallest representable float (~$10^{-308}$). The product underflows to zero.
Solution: Work in Log Space $$\log p(x) = \sum_{j=1}^d \log p(x_j)$$
Sums of logs don't underflow. We do all arithmetic in log space and exponentiate only at the end.
Problem 2: Normalization in Log Space
To compute posteriors, we need to divide by the evidence: $$\eta_k(x) = \frac{\pi_k \cdot p_k(x)}{\sum_j \pi_j \cdot p_j(x)}$$
In log space, this requires computing: $$\log \eta_k(x) = \log(\pi_k \cdot p_k(x)) - \log\left(\sum_j \pi_j \cdot p_j(x)\right)$$
The difficulty is computing $\log(\sum_j e^{a_j})$ where $a_j = \log(\pi_j \cdot p_j(x))$.
Solution: The Log-Sum-Exp Trick $$\log\left(\sum_j e^{a_j}\right) = a_{\max} + \log\left(\sum_j e^{a_j - a_{\max}}\right)$$
By subtracting the maximum, we ensure the largest term in the sum is $e^0 = 1$, preventing overflow, while the exponents are non-positive, preventing overflow.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as np def log_sum_exp(log_values: np.ndarray, axis: int = None) -> np.ndarray: """ Compute log(sum(exp(log_values))) in a numerically stable way. Key insight: log(sum(exp(a))) = max(a) + log(sum(exp(a - max(a)))) This prevents: - Overflow: exp(large) -> inf - Underflow: exp(very negative) -> 0, losing precision Args: log_values: Array of log values axis: Axis along which to sum Returns: log(sum(exp(log_values))) """ max_val = np.max(log_values, axis=axis, keepdims=True) # Subtract max, exponentiate, sum, log, add max back stable_sum = max_val + np.log( np.sum(np.exp(log_values - max_val), axis=axis, keepdims=True) ) return np.squeeze(stable_sum, axis=axis) def compute_log_posterior( log_likelihoods: np.ndarray, # Shape: (n_samples, n_classes) log_priors: np.ndarray # Shape: (n_classes,)) -> np.ndarray: """ Compute log posterior probabilities stably. log P(Y=k|X) = log P(X|Y=k) + log P(Y=k) - log P(X) where log P(X) = log_sum_exp over classes Args: log_likelihoods: log P(X|Y=k) for each sample and class log_priors: log P(Y=k) for each class Returns: log P(Y=k|X) for each sample and class """ # log P(X, Y=k) = log P(X|Y=k) + log P(Y=k) log_joint = log_likelihoods + log_priors # Broadcasting # log P(X) = log sum_k P(X, Y=k) log_evidence = log_sum_exp(log_joint, axis=1) # log P(Y=k|X) = log P(X, Y=k) - log P(X) log_posterior = log_joint - log_evidence[:, np.newaxis] return log_posterior # Demonstration: Why stability mattersprint("=== Numerical Stability Demonstration ===\n") # High-dimensional scenario where naive computation failsn_features = 500n_classes = 3 # Simulate log-likelihoods (very negative for high-d data)np.random.seed(42)log_likelihoods = -500 + np.random.randn(1, n_classes) * 10log_priors = np.log([0.3, 0.3, 0.4]) print(f"Log-likelihoods: {log_likelihoods[0]}")print(f"Log-priors: {log_priors}") # Naive computation (will fail)try: naive_unnormalized = np.exp(log_likelihoods) * np.exp(log_priors) naive_posterior = naive_unnormalized / naive_unnormalized.sum() print(f"\nNaive posterior: {naive_posterior}")except Exception as e: print(f"\nNaive computation produces: {naive_unnormalized}") print(" → All zeros due to underflow!") # Stable computationlog_posterior = compute_log_posterior(log_likelihoods, log_priors)posterior = np.exp(log_posterior)print(f"\nStable posterior: {posterior[0]}")print(f"Sum (should be 1.0): {posterior.sum():.6f}")For any serious implementation of probabilistic classifiers, work in log space throughout. Convert to probabilities only at the final output if needed. This single practice prevents an entire class of numerical bugs.
The denominator $P(X = x)$ in Bayes' theorem is called the evidence (or marginal likelihood). It plays several important roles.
Role 1: Normalization Constant
For classification with a fixed $x$, the evidence is the same for all classes. It ensures posteriors sum to 1: $$\sum_k \eta_k(x) = \sum_k \frac{\pi_k \cdot p_k(x)}{p(x)} = \frac{p(x)}{p(x)} = 1$$
For MAP classification, we can ignore $p(x)$ because: $$\arg\max_k \eta_k(x) = \arg\max_k \frac{\pi_k \cdot p_k(x)}{p(x)} = \arg\max_k \pi_k \cdot p_k(x)$$
The argmax is unchanged by the constant divisor.
Role 2: Model Selection
When comparing different models $M_1, M_2, \ldots$ for the class-conditionals, the evidence becomes crucial: $$P(M_i | X) \propto P(X | M_i) \cdot P(M_i)$$
Here $P(X | M_i) = \int P(X | \theta, M_i) P(\theta | M_i) d\theta$ is the model evidence—how well the model explains the data, averaged over parameter uncertainty.
Higher evidence → Model fits data well without overfitting.
Role 3: Outlier Detection
The evidence $p(x) = \sum_k \pi_k \cdot p_k(x)$ measures how likely $x$ is under all classes combined. Very low $p(x)$ suggests $x$ is an outlier—unlike any training data.
$$\text{Outlier score}(x) = -\log p(x)$$
High score → $x$ is unusual; predictions may be unreliable.
For classification: Skip it; work with unnormalized posteriors. For probability calibration: Compute it for proper posteriors. For model comparison: Essential; it's the key quantity. For outlier detection: Very useful; low evidence signals unusual inputs.
There are two fundamental approaches to computing posteriors for classification:
Generative Approach:
Discriminative Approach:
Both approaches ultimately target the same quantity—the posterior—but via different routes.
The Ng-Jordan Analysis (2001)
A famous paper by Andrew Ng and Michael Jordan analyzed the trade-off:
The intuition: Generative models make stronger assumptions. If correct, these assumptions extract more information from limited data. If wrong, they introduce bias that discriminative models avoid.
Practical Guidance:
For generative classifiers, we need to estimate priors and class-conditional distributions from training data.
Estimating Priors:
The maximum likelihood estimate: $$\hat{\pi}_k = \frac{n_k}{n}$$
With a Dirichlet prior (Bayesian estimate with pseudo-counts $\alpha_k$): $$\hat{\pi}_k = \frac{n_k + \alpha_k}{n + \sum_j \alpha_j}$$
Uniform prior ($\alpha_k = 1$) is Laplace smoothing: $$\hat{\pi}_k = \frac{n_k + 1}{n + K}$$
This prevents zero probabilities when a class has few or no training samples.
Estimating Class-Conditionals:
The choice of model for $p_k(x)$ is critical:
1. Gaussian (LDA/QDA) $$\hat{\mu}k = \frac{1}{n_k} \sum{i: y_i = k} x_i$$ $$\hat{\Sigma}k = \frac{1}{n_k - 1} \sum{i: y_i = k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$
For LDA, pool covariances: $\hat{\Sigma} = \frac{1}{n-K} \sum_k \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$
2. Multinomial (for counts) $$\hat{\theta}{kj} = \frac{\sum{i: y_i = k} x_{ij} + \alpha}{\sum_l \sum_{i: y_i = k} x_{il} + d\alpha}$$
where $\alpha$ is the smoothing parameter.
3. Kernel Density Estimation (non-parametric) $$\hat{p}k(x) = \frac{1}{n_k h^d} \sum{i: y_i = k} K\left(\frac{x - x_i}{h}\right)$$
Flexible but computationally expensive and suffers in high dimensions.
| Model | Assumption | Parameters | Best For |
|---|---|---|---|
| Gaussian (shared cov) | Features normal, same spread | $O(d^2)$ | Continuous, few features |
| Gaussian (per-class) | Features normal, different spread | $O(Kd^2)$ | When class shapes differ |
| Multinomial | Counts from distribution | $O(Kd)$ | Text, discrete data |
| Bernoulli | Binary features | $O(Kd)$ | Binary attributes |
| Kernel Density | Smooth but unknown | All data | Low-d, non-Gaussian |
Building a robust posterior computation pipeline requires attention to several practical details.
1. Handling Zero Probabilities
If $p_k(x) = 0$ for some class $k$, the posterior is zero regardless of the prior. This can happen with:
Solution: Smoothing (add pseudo-counts) or use more flexible distributions.
2. Prior Adjustment for Class Imbalance
Training class proportions may not match deployment. If training is balanced but deployment is 99/1:
3. Singular Covariance Matrices
With high-dimensional Gaussian models, $\hat{\Sigma}$ may be singular (non-invertible) when $n_k < d$:
4. Probability Calibration
Posteriors from generative models may be poorly calibrated (predicted probabilities don't match true frequencies). Apply calibration techniques:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
import numpy as npfrom typing import Optional class RobustGaussianClassifier: """ Gaussian classifier with practical robustness features: - Covariance regularization to prevent singularity - Prior adjustment for class imbalance - Smoothing to handle rare classes """ def __init__( self, reg_param: float = 1e-6, prior_smoothing: float = 1.0, user_priors: Optional[np.ndarray] = None ): self.reg_param = reg_param self.prior_smoothing = prior_smoothing self.user_priors = user_priors def fit(self, X: np.ndarray, y: np.ndarray): self.classes_ = np.unique(y) n_classes = len(self.classes_) n_samples, n_features = X.shape # Smoothed prior estimation if self.user_priors is not None: self.log_priors_ = np.log(self.user_priors) else: # Laplace-smoothed priors class_counts = np.array([np.sum(y == c) for c in self.classes_]) smoothed_counts = class_counts + self.prior_smoothing self.log_priors_ = np.log(smoothed_counts / smoothed_counts.sum()) # Fit Gaussians with regularization self.means_ = [] self.precisions_ = [] # Store inverse covariance for efficiency self.log_det_covs_ = [] for c in self.classes_: X_c = X[y == c] mean = X_c.mean(axis=0) # Sample covariance with regularization cov = np.cov(X_c.T) if X_c.shape[0] > 1 else np.eye(n_features) cov += self.reg_param * np.eye(n_features) # Regularize # Compute precision and log determinant try: precision = np.linalg.inv(cov) sign, log_det = np.linalg.slogdet(cov) log_det_cov = log_det if sign > 0 else np.inf except np.linalg.LinAlgError: # Fallback: use diagonal cov_diag = np.var(X_c, axis=0) + self.reg_param precision = np.diag(1.0 / cov_diag) log_det_cov = np.sum(np.log(cov_diag)) self.means_.append(mean) self.precisions_.append(precision) self.log_det_covs_.append(log_det_cov) return self def _log_likelihood(self, X: np.ndarray) -> np.ndarray: """Compute log p(x | y=k) for each class.""" n_samples = X.shape[0] n_classes = len(self.classes_) d = X.shape[1] log_likelihoods = np.zeros((n_samples, n_classes)) for k in range(n_classes): diff = X - self.means_[k] # Mahalanobis distance: (x - μ)ᵀΣ⁻¹(x - μ) mahal = np.sum((diff @ self.precisions_[k]) * diff, axis=1) # Log Gaussian: -0.5 * (d*log(2π) + log|Σ| + mahal) log_likelihoods[:, k] = -0.5 * ( d * np.log(2 * np.pi) + self.log_det_covs_[k] + mahal ) return log_likelihoods def predict_log_proba(self, X: np.ndarray) -> np.ndarray: """Return log posteriors (more numerically stable).""" log_likelihood = self._log_likelihood(X) log_joint = log_likelihood + self.log_priors_ log_evidence = self._logsumexp(log_joint, axis=1, keepdims=True) return log_joint - log_evidence def predict_proba(self, X: np.ndarray) -> np.ndarray: """Return posterior probabilities.""" return np.exp(self.predict_log_proba(X)) @staticmethod def _logsumexp(a, axis=None, keepdims=False): a_max = np.max(a, axis=axis, keepdims=True) result = a_max + np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True)) return result if keepdims else np.squeeze(result, axis=axis)We've covered the essential techniques for computing posterior probabilities, the key to implementing Bayes classifiers. Let's consolidate:
What's Next:
We've seen how to compute posteriors when we can model class-conditional distributions. But this is challenging in high-dimensional spaces—the curse of dimensionality makes density estimation hard. In the next page, we'll explore the practical challenges that make the Bayes classifier difficult to implement directly, setting the stage for Naive Bayes as a practical approximation.
You now understand how to compute posterior probabilities via Bayes' theorem, the numerical techniques for stable computation, and the generative approach to classification. These foundations prepare you for understanding both the promise and the practical challenges of Bayesian classification.