Generative Vs Discriminative - Learning Module

Loading content...

0/245

Generative Model Formulation

The Art of Learning How Data is Born

Imagine you're a wildlife biologist studying two species of birds: robins and blue jays. To classify a new bird, you could take two fundamentally different approaches:

Approach 1: Study everything about each species—their size distributions, color patterns, beak shapes, migration habits, preferred habitats. You become an expert on what makes a robin a robin and what makes a blue jay a blue jay. When a new bird appears, you ask: "Which species is more likely to have produced a bird like this?"

Approach 2: Simply collect examples of correctly labeled birds and find the boundary that best separates them. You never deeply understand either species—you just learn where to draw the line.

This fundamental choice defines one of the deepest divides in machine learning: generative versus discriminative modeling. In this page, we dive deep into the first approach—generative models—and develop the mathematical machinery that makes them work.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical formulation of generative classifiers, (2) How Bayes' theorem transforms learned distributions into classification decisions, (3) The modeling assumptions required and their implications, (4) The key insight that generative models capture the data generation process, and (5) Why this approach provides unique capabilities like handling missing data and generating synthetic samples.

The Generative Philosophy

At its core, a generative model attempts to understand and model the process by which data is generated. Rather than asking "What label should this data point have?", a generative model asks "How did this data point come to exist in the first place?"

This philosophical difference has profound practical implications. Let's formalize this intuition mathematically.

The Generative Story

Think of generative models as telling a story about data creation. For classification: first, nature "picks" a class (e.g., spam or not spam), then it generates features according to that class's characteristic patterns. Generative models attempt to reverse-engineer this story.

The Joint Distribution Perspective

A generative classifier models the joint probability distribution $P(X, Y)$ over features $X$ and labels $Y$. This joint distribution captures everything about the relationship between inputs and outputs—how features and labels co-occur in the data.

The joint distribution can be factored in two equivalent ways:

$$P(X, Y) = P(X|Y) \cdot P(Y) = P(Y|X) \cdot P(X)$$

Generative models leverage the first factorization: they model the class-conditional distribution $P(X|Y)$ (how features are distributed within each class) and the prior $P(Y)$ (how common each class is).

Once we have these two components, we can compute anything we want—including the posterior $P(Y|X)$ needed for classification—via Bayes' theorem.

Key Probability Components in Generative Models
Component	Notation	Meaning	How It's Obtained
Class Prior	$P(Y = k)$	How common is class $k$ in the population?	Estimated from training data frequencies
Class-Conditional	$P(X \| Y = k)$	What do examples from class $k$ look like?	Modeled explicitly (e.g., Gaussian, multinomial)
Evidence	$P(X)$	How likely is this feature vector overall?	Computed as $\sum_k P(X\|Y=k) P(Y=k)$
Posterior	$P(Y = k \| X)$	Given features $X$, what's the probability of class $k$?	Derived via Bayes' theorem

Bayes' Theorem as the Classification Engine

The mathematical heart of generative classification is Bayes' theorem, which allows us to invert the conditional probability:

$$P(Y = k | X) = \frac{P(X | Y = k) \cdot P(Y = k)}{P(X)}$$

Let's carefully dissect each component and understand its role in classification.

Anatomy of Bayes' Theorem for Classification

•$P(Y = k | X)$ — The Posterior: This is what we want—the probability that the input $X$ belongs to class $k$. For classification, we typically choose the class with the highest posterior.
•$P(X | Y = k)$ — The Likelihood: This measures how well class $k$'s model explains the observed data $X$. High likelihood means the features are consistent with what we expect from class $k$.
•$P(Y = k)$ — The Prior: Our initial belief about class $k$'s prevalence before seeing the data. In a spam filter, this might be the baseline rate of spam emails.
•$P(X)$ — The Evidence (Marginal Likelihood): This normalizes the posteriors to sum to 1. Crucially, it's the same for all classes, so we can often ignore it when finding the most likely class.

The Normalization Trick

Since $P(X)$ is constant across all classes, for classification we only need to compute the unnormalized posterior: $P(Y = k | X) \propto P(X | Y = k) \cdot P(Y = k)$. We choose the class that maximizes this product. This avoids computing the potentially expensive normalization constant.

The Decision Rule

For a $K$-class classification problem, the generative classifier assigns a new input $X$ to the class that maximizes the posterior probability:

$$\hat{y} = \arg\max_{k \in {1, \ldots, K}} P(Y = k | X) = \arg\max_{k} P(X | Y = k) \cdot P(Y = k)$$

This is known as the Maximum A Posteriori (MAP) decision rule. It naturally incorporates prior knowledge—if one class is much more common, we need stronger evidence to classify into the rarer class.

generative_classification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from typing import Dict, List
 
class GenerativeClassifier:
    """
    Abstract base class for generative classifiers.
    
    Generative classifiers model P(X|Y) and P(Y), then use Bayes' theorem
    to compute P(Y|X) for classification.
    """
    
    def __init__(self, classes: List[int]):
        self.classes = classes
        self.priors: Dict[int, float] = {}
        
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'GenerativeClassifier':
        """
        Learn the model parameters from training data.
        
        1. Estimate class priors P(Y=k) from class frequencies
        2. Estimate class-conditional distributions P(X|Y=k)
        """
        n_samples = len(y)
        
        # Step 1: Estimate priors from class frequencies
        for k in self.classes:
            n_k = np.sum(y == k)
            self.priors[k] = n_k / n_samples
            
        # Step 2: Estimate class-conditional distributions
        self._fit_class_conditionals(X, y)
        
        return self
    
    def _fit_class_conditionals(self, X: np.ndarray, y: np.ndarray) -> None:
        """Subclasses implement this to model P(X|Y=k)"""
        raise NotImplementedError
    
    def _log_likelihood(self, X: np.ndarray, k: int) -> np.ndarray:
        """Compute log P(X|Y=k) for all samples. Subclasses implement this."""
        raise NotImplementedError
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Compute posterior probabilities P(Y=k|X) for all classes.
        
        Uses Bayes' theorem with log-space computation for numerical stability.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes)
        
        # Compute log unnormalized posteriors: log P(X|Y=k) + log P(Y=k)
        log_posteriors = np.zeros((n_samples, n_classes))
        
        for idx, k in enumerate(self.classes):
            log_likelihood = self._log_likelihood(X, k)  # log P(X|Y=k)
            log_prior = np.log(self.priors[k])           # log P(Y=k)
            log_posteriors[:, idx] = log_likelihood + log_prior
        
        # Normalize to get proper probabilities using log-sum-exp trick
        log_normalizer = self._logsumexp(log_posteriors, axis=1, keepdims=True)
        log_probs = log_posteriors - log_normalizer
        
        return np.exp(log_probs)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels using MAP decision rule."""
        posteriors = self.predict_proba(X)
        class_indices = np.argmax(posteriors, axis=1)
        return np.array([self.classes[idx] for idx in class_indices])
    
    @staticmethod
    def _logsumexp(a: np.ndarray, axis: int = None, keepdims: bool = False) -> np.ndarray:
        """Numerically stable log-sum-exp for probability normalization."""
        a_max = np.max(a, axis=axis, keepdims=True)
        result = a_max + np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True))
        if not keepdims:
            result = np.squeeze(result, axis=axis)
        return result

Modeling Class-Conditional Distributions

The power and challenge of generative models lies in modeling $P(X | Y = k)$—the distribution of features within each class. This is where modeling assumptions become critical.

In general, if $X$ is a $d$-dimensional feature vector, $P(X | Y = k)$ is a $d$-dimensional probability distribution. Without restrictions, this distribution could be arbitrarily complex, requiring exponentially many parameters to specify.

The Curse of Dimensionality in Density Estimation

Consider a simple case: each feature is binary, and we have $d = 20$ features. The full joint distribution $P(X | Y = k)$ over all possible feature combinations would have $2^{20} - 1 \approx 1$ million free parameters per class. This is practically impossible to estimate reliably from typical training set sizes.

This fundamental problem leads to the core insight of tractable generative modeling:

The Modeling Tradeoff

There's no free lunch in generative modeling. We must impose structural assumptions on P(X|Y) to make estimation tractable. These assumptions reduce model complexity but may not match reality. The art lies in choosing assumptions that balance expressiveness with learnability.

Common Distributional Assumptions

Different generative classifiers make different assumptions about $P(X|Y=k)$:

Class-Conditional Distribution Assumptions Across Generative Models
Model	Assumption about $P(X\|Y=k)$	Parameters per Class	Best For
Naive Bayes (Gaussian)	Features independent, each Gaussian: $\prod_j N(x_j; \mu_{kj}, \sigma_{kj}^2)$	$2d$ (means + variances)	Continuous data, quick baseline
Naive Bayes (Multinomial)	Multinomial over word counts	$V$ (vocabulary size)	Text classification
LDA (Linear Discriminant)	Multivariate Gaussian with shared covariance	$d + d(d+1)/2$ shared	Linear boundaries, limited data
QDA (Quadratic Discriminant)	Multivariate Gaussian with class-specific covariance	$d + d(d+1)/2$ per class	Quadratic boundaries, more data
Gaussian Mixture Model	Mixture of Gaussians within each class	$(d + d(d+1)/2 + 1) \times M$ where $M$ = components	Multi-modal class distributions

The Gaussian Class-Conditional Model

The most fundamental continuous generative model assumes that features within each class follow a multivariate Gaussian distribution:

$$P(X | Y = k) = \mathcal{N}(X; \mu_k, \Sigma_k) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(X - \mu_k)^T \Sigma_k^{-1} (X - \mu_k)\right)$$

Where:

$\mu_k \in \mathbb{R}^d$ is the mean vector for class $k$
$\Sigma_k \in \mathbb{R}^{d \times d}$ is the covariance matrix for class $k$
$|\Sigma_k|$ is the determinant of the covariance matrix

The exponent $(X - \mu_k)^T \Sigma_k^{-1} (X - \mu_k)$ is the Mahalanobis distance squared—a measure of how far $X$ is from the class center, accounting for the shape of the distribution.

gaussian_generative.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from typing import Dict, List
 
class GaussianGenerativeClassifier:
    """
    Generative classifier with multivariate Gaussian class-conditionals.
    
    This implements the full QDA model where each class has its own
    mean vector and covariance matrix.
    """
    
    def __init__(self, classes: List[int], reg_param: float = 1e-6):
        """
        Args:
            classes: List of class labels
            reg_param: Regularization added to covariance diagonal for stability
        """
        self.classes = classes
        self.reg_param = reg_param
        self.priors: Dict[int, float] = {}
        self.means: Dict[int, np.ndarray] = {}
        self.covariances: Dict[int, np.ndarray] = {}
        
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'GaussianGenerativeClassifier':
        """
        Estimate model parameters via Maximum Likelihood.
        
        For Gaussian models:
        - Mean MLE: sample mean of class examples
        - Covariance MLE: sample covariance of class examples
        """
        n_samples, n_features = X.shape
        
        for k in self.classes:
            # Extract samples belonging to class k
            X_k = X[y == k]
            n_k = len(X_k)
            
            # Prior: relative frequency
            self.priors[k] = n_k / n_samples
            
            # Mean: MLE is sample mean
            self.means[k] = np.mean(X_k, axis=0)
            
            # Covariance: MLE is sample covariance (with regularization)
            X_centered = X_k - self.means[k]
            self.covariances[k] = (X_centered.T @ X_centered) / n_k
            
            # Add regularization to diagonal for numerical stability
            self.covariances[k] += self.reg_param * np.eye(n_features)
            
        return self
    
    def _log_likelihood(self, X: np.ndarray, k: int) -> np.ndarray:
        """
        Compute log P(X|Y=k) for multivariate Gaussian.
        
        log N(x; μ, Σ) = -d/2 log(2π) - 1/2 log|Σ| - 1/2 (x-μ)ᵀΣ⁻¹(x-μ)
        """
        n_samples, n_features = X.shape
        mean = self.means[k]
        cov = self.covariances[k]
        
        # Compute log determinant
        sign, logdet = np.linalg.slogdet(cov)
        
        # Compute precision matrix (inverse of covariance)
        precision = np.linalg.inv(cov)
        
        # Compute Mahalanobis distances
        X_centered = X - mean
        mahalanobis_sq = np.sum(X_centered @ precision * X_centered, axis=1)
        
        # Log-likelihood
        log_likelihood = (
            -0.5 * n_features * np.log(2 * np.pi)
            - 0.5 * logdet
            - 0.5 * mahalanobis_sq
        )
        
        return log_likelihood
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Compute posterior probabilities via Bayes' theorem."""
        n_samples = X.shape[0]
        n_classes = len(self.classes)
        
        log_posteriors = np.zeros((n_samples, n_classes))
        
        for idx, k in enumerate(self.classes):
            log_likelihood = self._log_likelihood(X, k)
            log_prior = np.log(self.priors[k])
            log_posteriors[:, idx] = log_likelihood + log_prior
        
        # Normalize using log-sum-exp
        log_normalizer = np.max(log_posteriors, axis=1, keepdims=True)
        log_posteriors_shifted = log_posteriors - log_normalizer
        posteriors = np.exp(log_posteriors_shifted)
        posteriors /= posteriors.sum(axis=1, keepdims=True)
        
        return posteriors
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels using MAP decision rule."""
        posteriors = self.predict_proba(X)
        class_indices = np.argmax(posteriors, axis=1)
        return np.array([self.classes[idx] for idx in class_indices])
    
    def sample(self, n_samples: int, class_label: int = None) -> np.ndarray:
        """
        Generate synthetic samples from the learned model.
        
        This is a unique capability of generative models!
        """
        if class_label is not None:
            # Sample from specific class
            mean = self.means[class_label]
            cov = self.covariances[class_label]
            return np.random.multivariate_normal(mean, cov, n_samples)
        else:
            # Sample according to prior class distribution
            samples = []
            for k in self.classes:
                n_k = int(n_samples * self.priors[k])
                if n_k > 0:
                    class_samples = np.random.multivariate_normal(
                        self.means[k], self.covariances[k], n_k
                    )
                    samples.append(class_samples)
            return np.vstack(samples) if samples else np.array([])

The Data Generation Story

One of the most elegant aspects of generative models is that they tell a story about how data comes to exist. This story has a specific structure that mirrors our factorization $P(X,Y) = P(X|Y) \cdot P(Y)$:

First, nature "chooses" a class according to the prior $P(Y)$. In our spam filter example, each incoming email has some baseline probability of being spam.
Then, nature "generates" features according to $P(X|Y)$. Given that an email is spam, it will contain certain words with certain frequencies; given it's ham (not spam), different word patterns emerge.
We observe only the features $X$, and must infer the hidden class $Y$ that produced them.

This is a latent variable perspective: the class label is the hidden (latent) variable that explains the observed features. Generative modeling is essentially the art of inverting this generative process.

Converting Mermaid diagram...

Probabilistic Graphical Model View

This generative story can be represented as a probabilistic graphical model (specifically, a Bayesian network). The graph structure encodes conditional independence assumptions:

     Y
     |
     v
     X = (X₁, X₂, ..., X_d)

The class $Y$ is the parent of all features. This means that knowing $Y$ "explains" the features—their joint distribution factors according to the graph structure.

For Naive Bayes, we make the additional assumption that features are conditionally independent given the class:

          Y
        / | \
       v  v  v
      X₁ X₂ ... X_d

This graphical structure encodes: $P(X_1, \ldots, X_d | Y) = \prod_{j=1}^{d} P(X_j | Y)$

The Power of the Generative View

Because generative models capture the full data generation process, they can do more than just classify: (1) Generate samples by running the generative story forward, (2) Handle missing data by marginalizing over unknown features, (3) Detect outliers by identifying points with low P(X), (4) Semi-supervised learning by leveraging unlabeled data to better estimate P(X).

Parameter Estimation: Maximum Likelihood

Once we've chosen a parametric form for $P(X|Y)$ and $P(Y)$, we need to estimate the parameters from data. The dominant approach is Maximum Likelihood Estimation (MLE).

Given a training set ${(x^{(1)}, y^{(1)}), \ldots, (x^{(n)}, y^{(n)})}$ of i.i.d. samples, the likelihood is:

$$L(\theta) = \prod_{i=1}^{n} P(X = x^{(i)}, Y = y^{(i)} | \theta) = \prod_{i=1}^{n} P(X = x^{(i)} | Y = y^{(i)}, \theta) \cdot P(Y = y^{(i)} | \theta)$$

The MLE finds parameters $\hat{\theta}$ that maximize this likelihood (or equivalently, the log-likelihood).

Estimating Priors

The MLE for class priors is simply the empirical frequency:

$$\hat{P}(Y = k) = \frac{n_k}{n}$$

where $n_k$ is the number of training examples with label $k$, and $n$ is the total number of examples.

This is intuitive: if 30% of training emails are spam, our best estimate for the prior probability of spam is 0.3.

Estimating Class-Conditional Parameters

For Gaussian class-conditionals, the MLEs are:

$$\hat{\mu}k = \frac{1}{n_k} \sum{i: y^{(i)} = k} x^{(i)}$$

$$\hat{\Sigma}k = \frac{1}{n_k} \sum{i: y^{(i)} = k} (x^{(i)} - \hat{\mu}_k)(x^{(i)} - \hat{\mu}_k)^T$$

These are just the sample mean and sample covariance of the examples belonging to each class.

The Small Sample Problem

When classes have few training examples, MLEs can be unreliable. Covariance matrices may be singular (non-invertible), and probability estimates may be extreme (0 or 1). Common remedies include: (1) regularization (shrinkage), (2) parameter sharing across classes (as in LDA), (3) Bayesian estimation with priors on parameters, and (4) Laplace smoothing for discrete distributions.

MLE Formulas for Common Generative Models
Model	Parameters	MLE Formula
Prior	$\pi_k = P(Y=k)$	$\hat{\pi}_k = n_k / n$
Gaussian mean	$\mu_k$	$\hat{\mu}k = (1/n_k) \sum{i:y^{(i)}=k} x^{(i)}$
Gaussian covariance	$\Sigma_k$	$\hat{\Sigma}k = (1/n_k) \sum{i:y^{(i)}=k} (x^{(i)}-\hat{\mu}_k)(x^{(i)}-\hat{\mu}_k)^T$
Multinomial (word)	$\theta_{kw}$	$(\text{count of word } w \text{ in class } k) / (\text{total words in class } k)$

The Complete Generative Classification Pipeline

Let's consolidate everything into a complete picture of generative classification, from raw data to predictions.

Generative Classification Pipeline

•Choose the model family — Decide on the parametric form of $P(X|Y)$. Will features be Gaussian? Multinomial? Independent (Naive Bayes) or correlated (full covariance)?
•Estimate priors — Compute $\hat{P}(Y=k) = n_k/n$ from training data class frequencies. Apply smoothing if needed for rare classes.
•Estimate class-conditional parameters — For each class $k$, fit the distribution $P(X|Y=k)$ to examples from that class using MLE or regularized variants.
•For a new input $X$ — Compute the unnormalized log-posterior for each class: $\log P(Y=k|X) \propto \log P(X|Y=k) + \log P(Y=k)$
•Apply MAP decision rule — Assign $\hat{y} = \arg\max_k [\log P(X|Y=k) + \log P(Y=k)]$
•Optionally normalize — If calibrated probabilities are needed, apply softmax to the log-posteriors.

Worked Example: Two-Class Gaussian Classification

Let's trace through a concrete example. Suppose we have two classes with the following learned parameters:

Class 0 (circles):

Prior: $P(Y=0) = 0.6$
Mean: $\mu_0 = [2, 3]$
Covariance: $\Sigma_0 = [[1, 0.5], [0.5, 1]]$

Class 1 (triangles):

Prior: $P(Y=1) = 0.4$
Mean: $\mu_1 = [5, 4]$
Covariance: $\Sigma_1 = [[1.5, -0.3], [-0.3, 0.8]]$

For a new point $x = [3.5, 3.5]$:

Compute log-likelihood for class 0: $\log P(x|Y=0) = -2.847$ (using Gaussian formula)
Compute log-likelihood for class 1: $\log P(x|Y=1) = -1.923$
Add log-priors:
- Class 0: $-2.847 + \log(0.6) = -3.358$
- Class 1: $-1.923 + \log(0.4) = -2.839$
Class 1 has higher log-posterior → Predict class 1

Note how the prior for class 0 partially offsets its lower likelihood, but class 1 still wins due to its much better fit to the data.

Unique Capabilities of Generative Models

Because generative models explicitly model $P(X|Y)$ and $P(Y)$, they unlock capabilities that discriminative models cannot easily provide.

Unique Generative Capabilities

•Sample Generation: Draw new synthetic data points from $P(X|Y=k)$. Essential for data augmentation, simulation, and understanding what the model has learned.
•Missing Data Handling: Marginalize over missing features naturally. If $X_3$ is missing, compute $P(X_1, X_2 | Y)$ by integrating out $X_3$.
•Outlier Detection: Compute $P(X) = \sum_k P(X|Y=k)P(Y=k)$. Points with low $P(X)$ are anomalous—they don't fit any class well.
•Density Estimation: Beyond classification, we get a full probabilistic model of the data that can answer any query about the joint distribution.

Practical Applications

•Medical Diagnosis: Handle incomplete patient records gracefully—not all tests are run for every patient.
•Fraud Detection: Identify unusual transactions that don't match any known pattern of legitimate or fraudulent behavior.
•Semi-supervised Learning: Use unlabeled data to better estimate $P(X)$, improving classification accuracy.
•Model Interpretation: Understanding $P(X|Y=k)$ reveals what features are characteristic of each class.

The Price of Generative Power

These unique capabilities come with a cost: generative models require stronger assumptions about the data distribution. If P(X|Y) is badly misspecified, generative models can perform worse than discriminative alternatives that make minimal assumptions. This tradeoff is central to the generative vs discriminative debate.

Summary: The Generative Paradigm

We've now established the complete theoretical foundation of generative classification. Let's consolidate the key insights:

Key Takeaways

•Generative models learn the joint distribution $P(X,Y)$ by modeling the class-conditional $P(X|Y)$ and prior $P(Y)$ separately.
•Bayes' theorem is the core engine — It transforms learned distributions into posterior probabilities $P(Y|X)$ for classification.
•The generative story matters — These models capture how data is generated, with class as a latent variable that "explains" observed features.
•Modeling assumptions are essential — We must impose structure on $P(X|Y)$ (e.g., Gaussian, independence) to make estimation tractable.
•MLE provides parameter estimates — Priors are estimated from class frequencies; class-conditionals from within-class statistics.
•Unique capabilities emerge — Generative models can sample, handle missing data, detect outliers, and support semi-supervised learning.

What's next:

Now that we understand the generative paradigm, we'll contrast it with the discriminative approach in the next page. Discriminative models take a fundamentally different philosophy: they model $P(Y|X)$ directly without ever modeling how data is generated. This leads to different tradeoffs in accuracy, sample efficiency, and flexibility.

Page Complete

You now understand the fundamental theory of generative classifiers: how they model the joint distribution, use Bayes' theorem for inference, and what unique capabilities this approach provides. Next, we'll explore the contrasting discriminative paradigm.