Bayes Classifier - Learning Module

Loading content...

0/245

Posterior Computation

From Theory to Computation

The Bayes decision rule is elegantly simple: classify to the class with maximum posterior probability. But how do we actually compute these posteriors? The answer involves one of the most important formulas in all of statistics—Bayes' theorem—and reveals two fundamentally different approaches to classification.

This page bridges theory and practice. We'll see exactly how to compute posteriors, confront the numerical challenges that arise, and understand why the choice between generative and discriminative modeling matters.

What You Will Learn

By the end of this page, you will master the application of Bayes' theorem for posterior computation, understand the numerical challenges and solutions for stable computation, appreciate the distinction between generative and discriminative approaches, and see how different modeling choices affect what we can compute and infer.

Bayes' Theorem: The Foundation

At the heart of posterior computation lies Bayes' theorem, one of the most important results in probability theory. For classification, it takes the form:

$$P(Y = k | X = x) = \frac{P(X = x | Y = k) \cdot P(Y = k)}{P(X = x)}$$

Or using our notation:

$$\eta_k(x) = \frac{p_k(x) \cdot \pi_k}{p(x)} = \frac{p_k(x) \cdot \pi_k}{\sum_{j=1}^K p_j(x) \cdot \pi_j}$$

Let's dissect each component:

Components of Bayes' Theorem for Classification
Component	Symbol	Name	How to Obtain
Posterior	$P(Y=k\|X=x)$	What we want	Compute via Bayes' theorem
Likelihood	$P(X=x\|Y=k)$	Class-conditional	Model and estimate from class $k$ samples
Prior	$P(Y=k)$	Class prior	Estimate as class proportions
Evidence	$P(X=x)$	Marginal	Sum over all class contributions

The Key Insight: Inversion of Conditioning

Bayes' theorem "inverts" the direction of conditioning. We want $P(Y|X)$—the probability of the class given features. But it's often easier to model $P(X|Y)$—the probability of features given the class.

Why is $P(X|Y)$ easier to model?

Data organization: Training data is naturally grouped by class. We can estimate $p_k(x)$ using only class $k$ samples.
Physical interpretation: The generative process ($Y$ causes $X$) often has natural structure. A disease causes symptoms; a digit's identity determines pixel values.
Dimensionality: Modeling $p_k(x)$ for each class separately is often simpler than directly modeling the multivariate conditional $P(Y|X)$.

The Generative Perspective

When we model $P(X|Y)$ and $P(Y)$, we're building a generative model—one that describes how data is generated. Given a class, we model how features arise. This is in contrast to discriminative models that directly model $P(Y|X)$ without specifying the feature distribution.

The Generative Approach to Posterior Computation

The generative approach computes posteriors by modeling the full joint distribution $P(X, Y) = P(X|Y) \cdot P(Y)$. The steps are:

Step 1: Estimate Priors $$\hat{\pi}_k = \frac{n_k}{n} = \frac{\text{number of class } k \text{ samples}}{\text{total samples}}$$

This is the maximum likelihood estimate of the class prior.

Step 2: Model Class-Conditionals

Choose a parametric family for $p_k(x; \theta_k)$ and estimate $\theta_k$ from class $k$ samples:

Gaussian: $p_k(x) = \mathcal{N}(x; \mu_k, \Sigma_k)$
Multinomial: $p_k(x) = \prod_j \theta_{kj}^{x_j}$ (for counts)
Kernel density: Non-parametric estimation

Step 3: Compute Posteriors $$\hat{\eta}_k(x) = \frac{\hat{\pi}_k \cdot \hat{p}_k(x)}{\sum_j \hat{\pi}_j \cdot \hat{p}_j(x)}$$

generative_posterior_computation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
from scipy.stats import multivariate_normal
from typing import List, Tuple
 
class GenerativeClassifier:
    """
    A generative classifier that computes posteriors via Bayes' theorem.
    
    Models class-conditionals as multivariate Gaussians.
    """
    
    def __init__(self, shared_covariance: bool = True):
        """
        Args:
            shared_covariance: If True, use shared covariance (LDA).
                              If False, use class-specific covariances (QDA).
        """
        self.shared_covariance = shared_covariance
        self.classes_ = None
        self.priors_ = None
        self.means_ = None
        self.covariances_ = None
        self.shared_cov_ = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """Estimate class priors and class-conditional distributions."""
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        n_samples, n_features = X.shape
        
        # Step 1: Estimate priors
        self.priors_ = np.array([np.mean(y == c) for c in self.classes_])
        
        # Step 2: Estimate class-conditional Gaussians
        self.means_ = np.zeros((n_classes, n_features))
        
        if self.shared_covariance:
            # Pooled covariance estimate (LDA)
            self.shared_cov_ = np.zeros((n_features, n_features))
            for idx, c in enumerate(self.classes_):
                X_c = X[y == c]
                self.means_[idx] = X_c.mean(axis=0)
                self.shared_cov_ += (X_c - self.means_[idx]).T @ (X_c - self.means_[idx])
            self.shared_cov_ /= (n_samples - n_classes)
        else:
            # Class-specific covariances (QDA)
            self.covariances_ = []
            for idx, c in enumerate(self.classes_):
                X_c = X[y == c]
                self.means_[idx] = X_c.mean(axis=0)
                cov_c = np.cov(X_c.T) + 1e-6 * np.eye(n_features)  # Regularize
                self.covariances_.append(cov_c)
        
        return self
    
    def _log_likelihood(self, X: np.ndarray) -> np.ndarray:
        """Compute log p(X | Y=k) for each class."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        log_likelihoods = np.zeros((n_samples, n_classes))
        
        for idx in range(n_classes):
            if self.shared_covariance:
                cov = self.shared_cov_
            else:
                cov = self.covariances_[idx]
            
            mvn = multivariate_normal(mean=self.means_[idx], cov=cov)
            log_likelihoods[:, idx] = mvn.logpdf(X)
        
        return log_likelihoods
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Compute posterior probabilities P(Y=k | X=x) for each class.
        
        Uses Bayes' theorem: P(Y|X) = P(X|Y) * P(Y) / P(X)
        """
        log_likelihoods = self._log_likelihood(X)  # log P(X|Y)
        log_priors = np.log(self.priors_)          # log P(Y)
        
        # log P(X,Y) = log P(X|Y) + log P(Y)
        log_joint = log_likelihoods + log_priors
        
        # Normalize via log-sum-exp for numerical stability
        # P(Y|X) = P(X,Y) / P(X) where P(X) = sum_k P(X,Y=k)
        log_evidence = self._logsumexp(log_joint, axis=1, keepdims=True)
        log_posterior = log_joint - log_evidence
        
        return np.exp(log_posterior)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels using MAP decision rule."""
        posteriors = self.predict_proba(X)
        return self.classes_[np.argmax(posteriors, axis=1)]
    
    @staticmethod
    def _logsumexp(a: np.ndarray, axis: int = None, keepdims: bool = False):
        """Numerically stable log-sum-exp computation."""
        a_max = np.max(a, axis=axis, keepdims=True)
        result = a_max + np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True))
        if not keepdims:
            result = np.squeeze(result, axis=axis)
        return result
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    # Generate synthetic data
    X, y = make_classification(n_samples=300, n_features=2, n_redundant=0,
                               n_informative=2, n_clusters_per_class=1,
                               random_state=42)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    # Fit and predict
    clf = GenerativeClassifier(shared_covariance=True)
    clf.fit(X_train, y_train)
    
    # Get posteriors for a few test points
    posteriors = clf.predict_proba(X_test[:5])
    predictions = clf.predict(X_test[:5])
    
    print("Posterior probabilities (first 5 test points):")
    for i, (post, pred, true) in enumerate(zip(posteriors, predictions, y_test[:5])):
        print(f"  Point {i}: P(Y=0)={post[0]:.3f}, P(Y=1)={post[1]:.3f} -> Pred={pred}, True={true}")

From Generative Model to Classifier

The generative classifier implements the complete Bayes theorem pipeline: estimate priors, model class-conditionals, then compute posteriors by normalization. This approach gives us full probabilistic predictions, not just class labels.

Numerical Challenges in Posterior Computation

Computing posteriors via Bayes' theorem involves products and ratios of probabilities, which can lead to serious numerical issues.

Problem 1: Underflow

Probability densities for high-dimensional data can be astronomically small: $$p(x) = \prod_{j=1}^d p(x_j) \approx (0.1)^{1000} = 10^{-1000}$$

This is far below the smallest representable float (~$10^{-308}$). The product underflows to zero.

Solution: Work in Log Space $$\log p(x) = \sum_{j=1}^d \log p(x_j)$$

Sums of logs don't underflow. We do all arithmetic in log space and exponentiate only at the end.

Problem 2: Normalization in Log Space

To compute posteriors, we need to divide by the evidence: $$\eta_k(x) = \frac{\pi_k \cdot p_k(x)}{\sum_j \pi_j \cdot p_j(x)}$$

In log space, this requires computing: $$\log \eta_k(x) = \log(\pi_k \cdot p_k(x)) - \log\left(\sum_j \pi_j \cdot p_j(x)\right)$$

The difficulty is computing $\log(\sum_j e^{a_j})$ where $a_j = \log(\pi_j \cdot p_j(x))$.

Solution: The Log-Sum-Exp Trick $$\log\left(\sum_j e^{a_j}\right) = a_{\max} + \log\left(\sum_j e^{a_j - a_{\max}}\right)$$

By subtracting the maximum, we ensure the largest term in the sum is $e^0 = 1$, preventing overflow, while the exponents are non-positive, preventing overflow.

numerical_stability.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
 
def log_sum_exp(log_values: np.ndarray, axis: int = None) -> np.ndarray:
    """
    Compute log(sum(exp(log_values))) in a numerically stable way.
    
    Key insight: log(sum(exp(a))) = max(a) + log(sum(exp(a - max(a))))
    
    This prevents:
    - Overflow: exp(large) -> inf
    - Underflow: exp(very negative) -> 0, losing precision
    
    Args:
        log_values: Array of log values
        axis: Axis along which to sum
    
    Returns:
        log(sum(exp(log_values)))
    """
    max_val = np.max(log_values, axis=axis, keepdims=True)
    
    # Subtract max, exponentiate, sum, log, add max back
    stable_sum = max_val + np.log(
        np.sum(np.exp(log_values - max_val), axis=axis, keepdims=True)
    )
    
    return np.squeeze(stable_sum, axis=axis)
 
 
def compute_log_posterior(
    log_likelihoods: np.ndarray,  # Shape: (n_samples, n_classes)
    log_priors: np.ndarray        # Shape: (n_classes,)
) -> np.ndarray:
    """
    Compute log posterior probabilities stably.
    
    log P(Y=k|X) = log P(X|Y=k) + log P(Y=k) - log P(X)
    where log P(X) = log_sum_exp over classes
    
    Args:
        log_likelihoods: log P(X|Y=k) for each sample and class
        log_priors: log P(Y=k) for each class
    
    Returns:
        log P(Y=k|X) for each sample and class
    """
    # log P(X, Y=k) = log P(X|Y=k) + log P(Y=k)
    log_joint = log_likelihoods + log_priors  # Broadcasting
    
    # log P(X) = log sum_k P(X, Y=k)
    log_evidence = log_sum_exp(log_joint, axis=1)
    
    # log P(Y=k|X) = log P(X, Y=k) - log P(X)
    log_posterior = log_joint - log_evidence[:, np.newaxis]
    
    return log_posterior
 
 
# Demonstration: Why stability matters
print("=== Numerical Stability Demonstration ===\n")
 
# High-dimensional scenario where naive computation fails
n_features = 500
n_classes = 3
 
# Simulate log-likelihoods (very negative for high-d data)
np.random.seed(42)
log_likelihoods = -500 + np.random.randn(1, n_classes) * 10
log_priors = np.log([0.3, 0.3, 0.4])
 
print(f"Log-likelihoods: {log_likelihoods[0]}")
print(f"Log-priors: {log_priors}")
 
# Naive computation (will fail)
try:
    naive_unnormalized = np.exp(log_likelihoods) * np.exp(log_priors)
    naive_posterior = naive_unnormalized / naive_unnormalized.sum()
    print(f"\nNaive posterior: {naive_posterior}")
except Exception as e:
    print(f"\nNaive computation produces: {naive_unnormalized}")
    print("  → All zeros due to underflow!")
 
# Stable computation
log_posterior = compute_log_posterior(log_likelihoods, log_priors)
posterior = np.exp(log_posterior)
print(f"\nStable posterior: {posterior[0]}")
print(f"Sum (should be 1.0): {posterior.sum():.6f}")

Always Use Log Space

For any serious implementation of probabilistic classifiers, work in log space throughout. Convert to probabilities only at the final output if needed. This single practice prevents an entire class of numerical bugs.

The Evidence Term and Its Roles

The denominator $P(X = x)$ in Bayes' theorem is called the evidence (or marginal likelihood). It plays several important roles.

Role 1: Normalization Constant

For classification with a fixed $x$, the evidence is the same for all classes. It ensures posteriors sum to 1: $$\sum_k \eta_k(x) = \sum_k \frac{\pi_k \cdot p_k(x)}{p(x)} = \frac{p(x)}{p(x)} = 1$$

For MAP classification, we can ignore $p(x)$ because: $$\arg\max_k \eta_k(x) = \arg\max_k \frac{\pi_k \cdot p_k(x)}{p(x)} = \arg\max_k \pi_k \cdot p_k(x)$$

The argmax is unchanged by the constant divisor.

Role 2: Model Selection

When comparing different models $M_1, M_2, \ldots$ for the class-conditionals, the evidence becomes crucial: $$P(M_i | X) \propto P(X | M_i) \cdot P(M_i)$$

Here $P(X | M_i) = \int P(X | \theta, M_i) P(\theta | M_i) d\theta$ is the model evidence—how well the model explains the data, averaged over parameter uncertainty.

Higher evidence → Model fits data well without overfitting.

Role 3: Outlier Detection

The evidence $p(x) = \sum_k \pi_k \cdot p_k(x)$ measures how likely $x$ is under all classes combined. Very low $p(x)$ suggests $x$ is an outlier—unlike any training data.

$$\text{Outlier score}(x) = -\log p(x)$$

High score → $x$ is unusual; predictions may be unreliable.

When to Compute the Evidence

For classification: Skip it; work with unnormalized posteriors. For probability calibration: Compute it for proper posteriors. For model comparison: Essential; it's the key quantity. For outlier detection: Very useful; low evidence signals unusual inputs.

Generative vs. Discriminative Approaches

There are two fundamental approaches to computing posteriors for classification:

Generative Approach:

Model the joint distribution $P(X, Y) = P(X|Y) \cdot P(Y)$
Compute posteriors via Bayes' theorem
Examples: Naive Bayes, LDA, QDA, Gaussian Mixture Models

Discriminative Approach:

Model the conditional $P(Y|X)$ directly
No need for Bayes' theorem
Examples: Logistic Regression, Neural Networks, SVMs

Both approaches ultimately target the same quantity—the posterior—but via different routes.

Generative Advantages

•Can generate new samples from $P(X|Y)$
•Natural handling of missing features
•Often better with limited data (stronger assumptions help)
•Provides interpretable model of data generation
•Enables outlier detection via $P(X)$
•Handles class imbalance via priors

Discriminative Advantages

•Directly optimizes classification accuracy
•More flexible—fewer assumptions about data
•Often better with abundant data
•Doesn't waste capacity modeling $P(X)$
•Can handle complex decision boundaries
•Generally higher asymptotic accuracy

The Ng-Jordan Analysis (2001)

A famous paper by Andrew Ng and Michael Jordan analyzed the trade-off:

With infinite data: Discriminative models win (lower asymptotic error)
With limited data: Generative models can win (lower variance)

The intuition: Generative models make stronger assumptions. If correct, these assumptions extract more information from limited data. If wrong, they introduce bias that discriminative models avoid.

Practical Guidance:

Small data, approximately correct generative assumptions → Generative
Large data, complex boundaries → Discriminative
Need probability estimates, not just labels → Either, but check calibration

Estimating the Components

For generative classifiers, we need to estimate priors and class-conditional distributions from training data.

Estimating Priors:

The maximum likelihood estimate: $$\hat{\pi}_k = \frac{n_k}{n}$$

With a Dirichlet prior (Bayesian estimate with pseudo-counts $\alpha_k$): $$\hat{\pi}_k = \frac{n_k + \alpha_k}{n + \sum_j \alpha_j}$$

Uniform prior ($\alpha_k = 1$) is Laplace smoothing: $$\hat{\pi}_k = \frac{n_k + 1}{n + K}$$

This prevents zero probabilities when a class has few or no training samples.

Estimating Class-Conditionals:

The choice of model for $p_k(x)$ is critical:

1. Gaussian (LDA/QDA) $$\hat{\mu}k = \frac{1}{n_k} \sum{i: y_i = k} x_i$$ $$\hat{\Sigma}k = \frac{1}{n_k - 1} \sum{i: y_i = k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$

For LDA, pool covariances: $\hat{\Sigma} = \frac{1}{n-K} \sum_k \sum_{i: y_i=k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$

2. Multinomial (for counts) $$\hat{\theta}{kj} = \frac{\sum{i: y_i = k} x_{ij} + \alpha}{\sum_l \sum_{i: y_i = k} x_{il} + d\alpha}$$

where $\alpha$ is the smoothing parameter.

3. Kernel Density Estimation (non-parametric) $$\hat{p}k(x) = \frac{1}{n_k h^d} \sum{i: y_i = k} K\left(\frac{x - x_i}{h}\right)$$

Flexible but computationally expensive and suffers in high dimensions.

Class-Conditional Models and Their Use Cases
Model	Assumption	Parameters	Best For
Gaussian (shared cov)	Features normal, same spread	$O(d^2)$	Continuous, few features
Gaussian (per-class)	Features normal, different spread	$O(Kd^2)$	When class shapes differ
Multinomial	Counts from distribution	$O(Kd)$	Text, discrete data
Bernoulli	Binary features	$O(Kd)$	Binary attributes
Kernel Density	Smooth but unknown	All data	Low-d, non-Gaussian

Practical Implementation Considerations

Building a robust posterior computation pipeline requires attention to several practical details.

1. Handling Zero Probabilities

If $p_k(x) = 0$ for some class $k$, the posterior is zero regardless of the prior. This can happen with:

Discrete features where a value never occurred in class $k$
Gaussian models when $x$ is far from any class $k$ sample

Solution: Smoothing (add pseudo-counts) or use more flexible distributions.

2. Prior Adjustment for Class Imbalance

Training class proportions may not match deployment. If training is balanced but deployment is 99/1:

Use actual deployment priors, not training proportions
Or use a threshold adjustment at prediction time

3. Singular Covariance Matrices

With high-dimensional Gaussian models, $\hat{\Sigma}$ may be singular (non-invertible) when $n_k < d$:

Regularization: $\hat{\Sigma}_{\text{reg}} = \hat{\Sigma} + \lambda I$
Dimensionality reduction: Project to lower-dimensional subspace first
Diagonal assumption: Force $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_d^2)$ (Naive Bayes)

4. Probability Calibration

Posteriors from generative models may be poorly calibrated (predicted probabilities don't match true frequencies). Apply calibration techniques:

Platt scaling
Isotonic regression
Temperature scaling

robust_posterior.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import numpy as np
from typing import Optional
 
class RobustGaussianClassifier:
    """
    Gaussian classifier with practical robustness features:
    - Covariance regularization to prevent singularity
    - Prior adjustment for class imbalance
    - Smoothing to handle rare classes
    """
    
    def __init__(
        self, 
        reg_param: float = 1e-6,
        prior_smoothing: float = 1.0,
        user_priors: Optional[np.ndarray] = None
    ):
        self.reg_param = reg_param
        self.prior_smoothing = prior_smoothing
        self.user_priors = user_priors
        
    def fit(self, X: np.ndarray, y: np.ndarray):
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        n_samples, n_features = X.shape
        
        # Smoothed prior estimation
        if self.user_priors is not None:
            self.log_priors_ = np.log(self.user_priors)
        else:
            # Laplace-smoothed priors
            class_counts = np.array([np.sum(y == c) for c in self.classes_])
            smoothed_counts = class_counts + self.prior_smoothing
            self.log_priors_ = np.log(smoothed_counts / smoothed_counts.sum())
        
        # Fit Gaussians with regularization
        self.means_ = []
        self.precisions_ = []  # Store inverse covariance for efficiency
        self.log_det_covs_ = []
        
        for c in self.classes_:
            X_c = X[y == c]
            mean = X_c.mean(axis=0)
            
            # Sample covariance with regularization
            cov = np.cov(X_c.T) if X_c.shape[0] > 1 else np.eye(n_features)
            cov += self.reg_param * np.eye(n_features)  # Regularize
            
            # Compute precision and log determinant
            try:
                precision = np.linalg.inv(cov)
                sign, log_det = np.linalg.slogdet(cov)
                log_det_cov = log_det if sign > 0 else np.inf
            except np.linalg.LinAlgError:
                # Fallback: use diagonal
                cov_diag = np.var(X_c, axis=0) + self.reg_param
                precision = np.diag(1.0 / cov_diag)
                log_det_cov = np.sum(np.log(cov_diag))
            
            self.means_.append(mean)
            self.precisions_.append(precision)
            self.log_det_covs_.append(log_det_cov)
        
        return self
    
    def _log_likelihood(self, X: np.ndarray) -> np.ndarray:
        """Compute log p(x | y=k) for each class."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        d = X.shape[1]
        
        log_likelihoods = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            diff = X - self.means_[k]
            
            # Mahalanobis distance: (x - μ)ᵀΣ⁻¹(x - μ)
            mahal = np.sum((diff @ self.precisions_[k]) * diff, axis=1)
            
            # Log Gaussian: -0.5 * (d*log(2π) + log|Σ| + mahal)
            log_likelihoods[:, k] = -0.5 * (
                d * np.log(2 * np.pi) + self.log_det_covs_[k] + mahal
            )
        
        return log_likelihoods
    
    def predict_log_proba(self, X: np.ndarray) -> np.ndarray:
        """Return log posteriors (more numerically stable)."""
        log_likelihood = self._log_likelihood(X)
        log_joint = log_likelihood + self.log_priors_
        log_evidence = self._logsumexp(log_joint, axis=1, keepdims=True)
        return log_joint - log_evidence
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Return posterior probabilities."""
        return np.exp(self.predict_log_proba(X))
    
    @staticmethod
    def _logsumexp(a, axis=None, keepdims=False):
        a_max = np.max(a, axis=axis, keepdims=True)
        result = a_max + np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True))
        return result if keepdims else np.squeeze(result, axis=axis)

Summary: Computing Posteriors in Practice

We've covered the essential techniques for computing posterior probabilities, the key to implementing Bayes classifiers. Let's consolidate:

Key Takeaways

•Bayes' theorem inverts conditioning: from $P(X|Y)$ and $P(Y)$ to $P(Y|X)$
•Generative models estimate priors and class-conditionals, then apply Bayes' theorem
•Numerical stability requires working in log space and using the log-sum-exp trick
•The evidence term normalizes posteriors, enables model comparison, and detects outliers
•Generative vs. discriminative is a fundamental modeling choice with different trade-offs
•Prior estimation uses class counts, potentially with smoothing for rare classes
•Class-conditional estimation depends on feature types: Gaussian, multinomial, or nonparametric
•Practical robustness requires regularization, handling class imbalance, and calibration

What's Next:

We've seen how to compute posteriors when we can model class-conditional distributions. But this is challenging in high-dimensional spaces—the curse of dimensionality makes density estimation hard. In the next page, we'll explore the practical challenges that make the Bayes classifier difficult to implement directly, setting the stage for Naive Bayes as a practical approximation.

Page Complete

You now understand how to compute posterior probabilities via Bayes' theorem, the numerical techniques for stable computation, and the generative approach to classification. These foundations prepare you for understanding both the promise and the practical challenges of Bayesian classification.