Naive Bayes & Probabilistic ClassifiersGenerative vs Discriminative

Generative vs Discriminative Models

LevelIntermediate

Duration90 mins

TopicGenerative vs Discriminative

2 / 5

Discriminative Model Formulation

The Art of Drawing Boundaries

Recall our wildlife biologist from the previous page. We discussed two approaches to classifying birds: deeply studying each species versus simply finding the boundary that separates them. We've explored the first approach—generative modeling. Now we turn to the second: discriminative modeling.

The discriminative philosophy is elegantly pragmatic: If your goal is classification, why learn more than you need? Rather than modeling the entire data generation process $P(X,Y)$, discriminative models focus exclusively on what matters for prediction: the conditional distribution $P(Y|X)$.

This seemingly simple shift in perspective has profound consequences for how these models learn, what assumptions they make, and when they excel.

What You Will Learn

By the end of this page, you will understand: (1) The mathematical formulation of discriminative classifiers, (2) Why modeling P(Y|X) directly requires fewer assumptions, (3) Key discriminative algorithms and their loss functions, (4) The decision boundary perspective on classification, and (5) How discriminative models handle complex, high-dimensional feature spaces.

The Discriminative Philosophy

At its heart, a discriminative model asks a fundamentally different question than a generative model:

Generative	Discriminative
"How is data from each class generated?"	"What distinguishes one class from another?"
Models $P(X,Y) = P(X	Y)P(Y)$
Learns full joint distribution	Learns only the decision boundaries

The discriminative approach is inherently task-focused. If classification is the goal, modeling how spam emails are written (generative) is more information than we need. We only need to know: given this email's features, is it spam or not?

The Discriminative Shortcut

Think of discriminative models as learning to discriminate between classes without fully understanding either one. A security guard doesn't need to know the life story of every employee—they just need to distinguish employees from intruders. Similarly, discriminative classifiers learn the minimal distinctions needed for accurate classification.

Mathematical Formulation

A discriminative classifier directly models the posterior probability $P(Y|X)$, typically through a parametric function:

$$P(Y = k | X; \theta) = f_k(X; \theta)$$

where $f_k$ is some function (often involving an exponential family or neural network) parameterized by $\theta$.

Critically, discriminative models do not require:

A model for $P(X|Y)$ — how features are distributed within classes
A model for $P(X)$ — the marginal distribution of features
The ability to generate new samples

This reduced modeling burden is both a strength and a limitation, as we'll explore.

Key Discriminative Classifiers and Their Formulations
Model	Form of $P(Y=1\|X)$	Loss Function	Decision Boundary
Logistic Regression	$\sigma(w^T X + b)$	Binary cross-entropy	Linear hyperplane
Softmax Regression	$\frac{\exp(w_k^T X)}{\sum_j \exp(w_j^T X)}$	Categorical cross-entropy	Multiple linear hyperplanes
SVM (probabilistic)	Platt scaling on margin	Hinge loss + regularization	Maximum-margin hyperplane
Neural Network	$\text{softmax}(f_{\text{NN}}(X; \theta))$	Cross-entropy	Complex nonlinear manifolds
Conditional Random Field	$\frac{1}{Z(X)}\exp(\sum_i \phi_i(X, Y))$	Negative log-likelihood	Depends on feature functions

Logistic Regression: The Prototypical Discriminative Model

Logistic regression is the canonical discriminative classifier. It models the posterior probability of the positive class as:

$$P(Y = 1 | X) = \sigma(w^T X + b) = \frac{1}{1 + \exp(-(w^T X + b))}$$

where $\sigma(\cdot)$ is the sigmoid (logistic) function, $w \in \mathbb{R}^d$ is the weight vector, and $b$ is the bias term.

Key Properties

Direct probability modeling: The output is a well-calibrated probability in $[0, 1]$.
Linear log-odds: The model is linear in log-odds space: $$\log \frac{P(Y=1|X)}{P(Y=0|X)} = w^T X + b$$
Maximum likelihood training: Parameters are found by minimizing the negative log-likelihood (binary cross-entropy loss).
No assumptions on $P(X)$: Unlike generative models, we don't need to assume features are Gaussian, independent, or have any particular distribution.

discriminative_classifiers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from typing import Optional
 
class LogisticRegressionDiscriminative:
    """
    Logistic regression as a discriminative classifier.
    
    Models P(Y|X) directly using the logistic function,
    trained by minimizing binary cross-entropy loss.
    """
    
    def __init__(self, learning_rate: float = 0.1, n_iterations: int = 1000,
                 reg_lambda: float = 0.01):
        """
        Args:
            learning_rate: Step size for gradient descent
            n_iterations: Maximum training iterations
            reg_lambda: L2 regularization strength
        """
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.reg_lambda = reg_lambda
        self.weights: Optional[np.ndarray] = None
        self.bias: float = 0.0
        
    def _sigmoid(self, z: np.ndarray) -> np.ndarray:
        """Numerically stable sigmoid function."""
        # Clip to avoid overflow in exp
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def _compute_loss(self, y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Binary cross-entropy loss with L2 regularization.
        
        L = -1/n * Σ[y*log(p) + (1-y)*log(1-p)] + λ/2 * ||w||²
        """
        n = len(y_true)
        # Clip predictions to avoid log(0)
        eps = 1e-15
        y_pred = np.clip(y_pred, eps, 1 - eps)
        
        # Binary cross-entropy
        bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        
        # L2 regularization
        l2_reg = (self.reg_lambda / 2) * np.sum(self.weights ** 2)
        
        return bce + l2_reg
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LogisticRegressionDiscriminative':
        """
        Train logistic regression via gradient descent.
        
        The gradient of the loss with respect to weights is:
        ∂L/∂w = 1/n * Xᵀ(p - y) + λw
        """
        n_samples, n_features = X.shape
        
        # Initialize weights
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        
        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions
            linear = X @ self.weights + self.bias
            predictions = self._sigmoid(linear)
            
            # Compute gradients
            error = predictions - y
            grad_weights = (1 / n_samples) * (X.T @ error) + self.reg_lambda * self.weights
            grad_bias = (1 / n_samples) * np.sum(error)
            
            # Update parameters
            self.weights -= self.lr * grad_weights
            self.bias -= self.lr * grad_bias
            
            # Optional: print loss periodically
            if iteration % 100 == 0:
                loss = self._compute_loss(y, predictions)
                # print(f"Iteration {iteration}, Loss: {loss:.4f}")
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Return P(Y=1|X) for each sample."""
        linear = X @ self.weights + self.bias
        prob_class_1 = self._sigmoid(linear)
        return np.column_stack([1 - prob_class_1, prob_class_1])
    
    def predict(self, X: np.ndarray, threshold: float = 0.5) -> np.ndarray:
        """Predict class labels using probability threshold."""
        proba = self.predict_proba(X)[:, 1]
        return (proba >= threshold).astype(int)
    
    def decision_function(self, X: np.ndarray) -> np.ndarray:
        """Return the raw linear scores (log-odds)."""
        return X @ self.weights + self.bias
 
 
class SoftmaxRegressionDiscriminative:
    """
    Softmax regression for multi-class discriminative classification.
    
    Models P(Y=k|X) = softmax(Wx + b) for K classes.
    """
    
    def __init__(self, n_classes: int, learning_rate: float = 0.1,
                 n_iterations: int = 1000, reg_lambda: float = 0.01):
        self.n_classes = n_classes
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.reg_lambda = reg_lambda
        self.weights: Optional[np.ndarray] = None  # Shape: (n_features, n_classes)
        self.bias: Optional[np.ndarray] = None      # Shape: (n_classes,)
        
    def _softmax(self, z: np.ndarray) -> np.ndarray:
        """Numerically stable softmax."""
        z_shifted = z - np.max(z, axis=1, keepdims=True)
        exp_z = np.exp(z_shifted)
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    def _one_hot(self, y: np.ndarray) -> np.ndarray:
        """Convert class labels to one-hot encoding."""
        n_samples = len(y)
        one_hot = np.zeros((n_samples, self.n_classes))
        one_hot[np.arange(n_samples), y] = 1
        return one_hot
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'SoftmaxRegressionDiscriminative':
        """Train via gradient descent on categorical cross-entropy."""
        n_samples, n_features = X.shape
        
        # Initialize weights
        self.weights = np.zeros((n_features, self.n_classes))
        self.bias = np.zeros(self.n_classes)
        
        # One-hot encode labels
        y_one_hot = self._one_hot(y)
        
        for iteration in range(self.n_iterations):
            # Forward pass
            logits = X @ self.weights + self.bias
            probabilities = self._softmax(logits)
            
            # Compute gradients
            error = probabilities - y_one_hot  # Shape: (n_samples, n_classes)
            grad_weights = (1 / n_samples) * (X.T @ error) + self.reg_lambda * self.weights
            grad_bias = (1 / n_samples) * np.sum(error, axis=0)
            
            # Update parameters
            self.weights -= self.lr * grad_weights
            self.bias -= self.lr * grad_bias
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Return P(Y=k|X) for all classes."""
        logits = X @ self.weights + self.bias
        return self._softmax(logits)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return np.argmax(proba, axis=1)

The Decision Boundary Perspective

A powerful way to understand discriminative models is through the lens of decision boundaries. The decision boundary is the hypersurface in feature space where the model's predicted class changes—where the posteriors for different classes are equal.

Linear Decision Boundaries

For binary logistic regression, the decision boundary occurs where:

$$P(Y=1|X) = P(Y=0|X) = 0.5$$

This happens when $w^T X + b = 0$, which defines a hyperplane in feature space. Points on one side are classified as class 1; points on the other side as class 0.

The geometry is elegant:

The weight vector $w$ is perpendicular to the decision boundary
The bias $b$ determines the distance of the boundary from the origin
The magnitude of $w$ controls how sharply probability transitions near the boundary

Converting Mermaid diagram...

Nonlinear Decision Boundaries

Real-world problems rarely admit clean linear separation. Discriminative models achieve nonlinear boundaries through two main mechanisms:

1. Feature Engineering / Basis Expansion: Transform inputs via polynomial features, radial basis functions, etc.: $$\phi(X) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2, \ldots]$$

A linear model in $\phi(X)$ space creates nonlinear boundaries in the original $X$ space.

2. Kernelization (Implicit Feature Maps): SVMs use the kernel trick to implicitly work in very high (even infinite) dimensional feature spaces without explicit computation.

3. Neural Networks: Deep networks learn hierarchical nonlinear transformations, creating arbitrarily complex decision boundaries through composition of simple nonlinear functions.

Decision Boundary Complexity Across Discriminative Models
Model	Decision Boundary Type	Expressiveness	Training Complexity
Logistic Regression	Linear hyperplane	Limited (linear)	O(ndk) - fast
Polynomial Logistic	Polynomial surfaces	Moderate	O(nd^p k) - polynomial features
Kernel SVM (RBF)	Smooth nonlinear manifolds	High (universal approximator)	O(n²d) to O(n³) - expensive
Neural Network	Arbitrary complex surfaces	Very high	O(ndk) per epoch - many epochs
Decision Tree	Axis-aligned rectangular regions	High but constrained geometry	O(nd log n) - fast

Loss Functions and the Optimization Perspective

Discriminative models are typically trained by minimizing a loss function that measures how well the model's predictions match the true labels. This optimization perspective is foundational to understanding discriminative learning.

The Empirical Risk Minimization Framework

Given training data ${(x^{(i)}, y^{(i)})}_{i=1}^n$ and a loss function $L(\hat{y}, y)$, we seek parameters $\theta$ that minimize the average loss:

$$\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} L(f(x^{(i)}; \theta), y^{(i)}) + \lambda R(\theta)$$

where $R(\theta)$ is a regularization term (e.g., $||\theta||^2$ for L2 regularization).

Key Loss Functions for Discriminative Classification

•Cross-Entropy Loss (Log Loss): $L = -[y \log(p) + (1-y) \log(1-p)]$ — The natural loss for probability estimation. Heavily penalizes confident wrong predictions. Yields maximum likelihood estimation for logistic regression.
•Hinge Loss: $L = \max(0, 1 - y \cdot f(x))$ where $y \in {-1, +1}$ — Used in SVMs. Only penalizes points within or on wrong side of margin. Leads to sparse solutions (only support vectors matter).
•Exponential Loss: $L = \exp(-y \cdot f(x))$ — Used in AdaBoost. Very sensitive to outliers due to exponential penalty on misclassifications.
•0-1 Loss: $L = \mathbf{1}[\hat{y} \neq y]$ — The ideal loss (counts errors), but non-differentiable and NP-hard to optimize directly. Other losses are differentiable surrogates.

Surrogate Losses

The true objective in classification is to minimize 0-1 loss (number of errors). However, 0-1 loss is non-convex and non-differentiable. Cross-entropy, hinge, and other losses are 'surrogate losses'—differentiable upper bounds on 0-1 loss that are tractable to optimize. Minimizing the surrogate approximately minimizes the true objective.

Optimization Algorithms

Once we have a differentiable loss, standard optimization applies:

Algorithm	Update Rule	Properties
Gradient Descent	$\theta \leftarrow \theta - \eta \nabla L(\theta)$	Full batch, guaranteed convergence for convex
Stochastic GD	$\theta \leftarrow \theta - \eta \nabla L_i(\theta)$	Single sample, noisy but fast
Mini-batch SGD	$\theta \leftarrow \theta - \eta (1/B) \sum_{i \in \text{batch}} \nabla L_i$	Balance of both
Adam	Adaptive learning rates + momentum	Robust default for deep learning
L-BFGS	Quasi-Newton, approximate Hessian	Fast for convex, moderate dimensions

For logistic regression with cross-entropy loss, the optimization problem is convex, guaranteeing a unique global optimum. This is a significant practical advantage over more complex models.

What Discriminative Models Don't Learn

Understanding the limitations of discriminative models is as important as understanding their strengths. By focusing exclusively on $P(Y|X)$, discriminative models deliberately ignore certain aspects of the data.

The Information Hierarchy

Consider what different modeling approaches capture:

$$P(X, Y) \supset P(X|Y), P(Y) \supset P(Y|X)$$

Generative models that learn the joint distribution $P(X,Y)$ can always derive $P(Y|X)$ via Bayes' theorem. But the reverse is not true—$P(Y|X)$ alone cannot recover the full joint distribution.

What Discriminative Models Cannot Do

•Generate new samples: Cannot produce realistic synthetic data points since $P(X|Y)$ is unknown.
•Detect true outliers: Cannot compute $P(X)$ to identify points unlike any training data.
•Handle missing features directly: Must impute or drop missing values; cannot marginalize properly.
•Semi-supervised learning: Cannot leverage unlabeled data to improve $P(X)$ estimation (though some workarounds exist).
•Model feature correlations: Many discriminative models don't explicitly model how features relate to each other.

Why This Trade-off Can Be Worth It

•Fewer assumptions to be wrong about: No need to assume feature distributions (Gaussian, etc.).
•More flexible decision boundaries: Can fit complex boundaries without feature distribution constraints.
•Often better asymptotic accuracy: With enough data, directly modeling $P(Y|X)$ tends to outperform.
•Simpler to extend: Easy to add regularization, feature engineering, deep architectures.
•Robust to irrelevant features: Less affected by features that don't help classification.

The Assumption Trade-off

Discriminative models have fewer assumptions about P(X|Y), but they still make assumptions! Logistic regression assumes linear log-odds. SVMs assume a good kernel exists. Neural networks assume the architecture is appropriate. The 'fewer assumptions' advantage is relative, not absolute.

Expressiveness and Model Capacity

A key advantage of discriminative models is their flexibility in controlling model capacity—the ability to learn increasingly complex decision boundaries through various mechanisms.

The Bias-Variance Perspective

Model capacity directly relates to the bias-variance tradeoff:

Low Capacity	High Capacity
Simple boundaries (e.g., linear)	Complex boundaries (e.g., deep networks)
High bias, low variance	Low bias, high variance
May underfit	May overfit
Good with limited data	Needs more data

Discriminative models offer fine-grained control over this tradeoff through architecture choices, regularization, and feature engineering.

Capacity Control Mechanisms

1. Regularization: Add penalty terms to prevent weights from becoming too large:

L2 regularization: $\lambda ||w||^2$ — Encourages small, spread-out weights
L1 regularization: $\lambda ||w||_1$ — Encourages sparsity (feature selection)
Elastic Net: $\lambda_1 ||w||_1 + \lambda_2 ||w||^2$ — Combination

2. Architecture Constraints:

Limit polynomial degree in feature expansion
Control network depth and width
Use dropout, batch normalization

3. Early Stopping: Monitor validation loss and stop training before overfitting occurs.

4. Ensemble Methods: Combine multiple discriminative models to reduce variance (bagging) or bias (boosting).

regularization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
def regularized_logistic_gradient(X: np.ndarray, y: np.ndarray, 
                                   weights: np.ndarray, bias: float,
                                   reg_type: str = 'l2', 
                                   reg_lambda: float = 0.01) -> tuple:
    """
    Compute gradients with different regularization types.
    
    Demonstrates how regularization adds gradient terms that
    pull weights towards zero, controlling model capacity.
    """
    n_samples = len(y)
    
    # Forward pass
    linear = X @ weights + bias
    predictions = 1 / (1 + np.exp(-np.clip(linear, -500, 500)))
    
    # Base gradients (from cross-entropy loss)
    error = predictions - y
    grad_w_base = (1 / n_samples) * (X.T @ error)
    grad_b = (1 / n_samples) * np.sum(error)
    
    # Add regularization gradient
    if reg_type == 'l2':
        # L2: ∂/∂w (λ/2 ||w||²) = λw
        # Effect: Linear shrinkage, all weights reduced proportionally
        reg_grad = reg_lambda * weights
    elif reg_type == 'l1':
        # L1: ∂/∂w (λ ||w||₁) = λ sign(w)
        # Effect: Constant push towards zero, creates sparse solutions
        reg_grad = reg_lambda * np.sign(weights)
    elif reg_type == 'elastic_net':
        # Elastic Net: combines both
        l1_ratio = 0.5  # Balance between L1 and L2
        reg_grad = (reg_lambda * l1_ratio * np.sign(weights) + 
                   reg_lambda * (1 - l1_ratio) * weights)
    else:
        reg_grad = 0
    
    grad_w = grad_w_base + reg_grad
    
    return grad_w, grad_b
 
 
def analyze_regularization_effects(weights: np.ndarray, reg_lambda: float):
    """
    Show how different regularization affects the loss landscape.
    """
    print("Regularization Analysis:")
    print(f"Original weights: {weights}")
    print(f"L2 norm: ||w||² = {np.sum(weights**2):.4f}")
    print(f"L1 norm: ||w||₁ = {np.sum(np.abs(weights)):.4f}")
    print()
    
    # L2 regularization penalty
    l2_penalty = (reg_lambda / 2) * np.sum(weights**2)
    print(f"L2 penalty (λ={reg_lambda}): {l2_penalty:.4f}")
    
    # L1 regularization penalty  
    l1_penalty = reg_lambda * np.sum(np.abs(weights))
    print(f"L1 penalty (λ={reg_lambda}): {l1_penalty:.4f}")
    
    # L1 tends to zero out small weights
    l1_threshold = reg_lambda  # Weights smaller than this tend to become zero
    sparse_count = np.sum(np.abs(weights) < l1_threshold)
    print(f"Weights likely zeroed by L1 (|w| < {l1_threshold}): {sparse_count}")

Deep Discriminative Models: Neural Networks

The most powerful modern discriminative classifiers are neural networks. They extend the discriminative framework by learning hierarchical feature representations alongside the classification boundary.

From Logistic Regression to Deep Networks

Logistic regression can be viewed as a single-layer neural network: $$P(Y=1|X) = \sigma(w^T X + b)$$

A deep network adds hidden layers that transform the input: $$P(Y=1|X) = \sigma(w^{(L)T} \cdot h^{(L-1)} + b^{(L)})$$

where $h^{(l)} = \text{activation}(W^{(l)} h^{(l-1)} + b^{(l)})$ for layers $l = 1, \ldots, L-1$.

The key insight: the hidden layers learn a nonlinear transformation $\phi(X)$ that makes the classification problem (approximately) linearly separable in the transformed space.

Representation Learning

Deep networks don't just learn decision boundaries—they learn useful representations. The hidden layers extract hierarchical features (edges → shapes → objects in vision; words → phrases → semantics in text) that make classification easier. This 'learned feature engineering' is why deep learning dominates complex perception tasks.

Universal Approximation

The Universal Approximation Theorem states that a neural network with a single hidden layer and sufficient neurons can approximate any continuous function to arbitrary accuracy. With multiple layers, networks can achieve the same approximation with exponentially fewer neurons for many function classes.

This means deep discriminative models can, in principle, learn any decision boundary—given enough data and appropriate training.

The Discriminative Paradigm at Scale

Modern large-scale classification (ImageNet, language models, speech recognition) is fundamentally discriminative:

Input: High-dimensional features (pixels, word embeddings, spectrograms)
Model: Deep neural network with millions/billions of parameters
Output: $P(Y|X)$ via softmax over thousands of classes
Training: Cross-entropy loss + SGD/Adam optimization
No generative modeling: We never model $P(X|Y)$ or generate samples (though parallel generative models like GANs exist)

Summary: The Discriminative Paradigm

We've now established the complete theoretical foundation of discriminative classification. Let's consolidate the key insights:

Key Takeaways

•Discriminative models learn $P(Y|X)$ directly — They focus only on what's needed for classification, without modeling the data generation process.
•Fewer distributional assumptions — No need to specify $P(X|Y)$ (Gaussian, multinomial, etc.), making the approach more flexible.
•Decision boundary perspective — Classification becomes finding the optimal boundary separating classes in feature space.
•Loss function optimization — Training minimizes a surrogate loss (cross-entropy, hinge) via gradient-based methods.
•Controllable capacity — Regularization, architecture, and early stopping provide fine-grained control over model complexity.
•Scales to deep learning — The discriminative framework naturally extends to neural networks for powerful representation learning.

What's next:

Now that we understand both paradigms, we'll directly compare their pros and cons. When does the generative approach's additional modeling effort pay off? When does the discriminative approach's flexibility win? These questions are surprisingly nuanced, and the answers depend on data availability, model specification, and the specific task at hand.

Page Complete

You now understand the fundamental theory of discriminative classifiers: how they model P(Y|X) directly, learn decision boundaries via loss minimization, and scale to deep architectures. Next, we'll compare the two paradigms head-to-head.

2 / 5

Loading learning content...

Naive Bayes & Probabilistic ClassifiersGenerative vs Discriminative

Generative vs Discriminative Models

LevelIntermediate

Duration90 mins

TopicGenerative vs Discriminative

2 / 5

Discriminative Model Formulation

The Art of Drawing Boundaries

This seemingly simple shift in perspective has profound consequences for how these models learn, what assumptions they make, and when they excel.

What You Will Learn

The Discriminative Philosophy

At its heart, a discriminative model asks a fundamentally different question than a generative model:

Generative	Discriminative
"How is data from each class generated?"	"What distinguishes one class from another?"
Models $P(X,Y) = P(X	Y)P(Y)$
Learns full joint distribution	Learns only the decision boundaries

The Discriminative Shortcut

Mathematical Formulation

A discriminative classifier directly models the posterior probability $P(Y|X)$, typically through a parametric function:

$$P(Y = k | X; \theta) = f_k(X; \theta)$$

where $f_k$ is some function (often involving an exponential family or neural network) parameterized by $\theta$.

Critically, discriminative models do not require:

A model for $P(X|Y)$ — how features are distributed within classes
A model for $P(X)$ — the marginal distribution of features
The ability to generate new samples

This reduced modeling burden is both a strength and a limitation, as we'll explore.

Key Discriminative Classifiers and Their Formulations
Model	Form of $P(Y=1\|X)$	Loss Function	Decision Boundary
Logistic Regression	$\sigma(w^T X + b)$	Binary cross-entropy	Linear hyperplane
Softmax Regression	$\frac{\exp(w_k^T X)}{\sum_j \exp(w_j^T X)}$	Categorical cross-entropy	Multiple linear hyperplanes
SVM (probabilistic)	Platt scaling on margin	Hinge loss + regularization	Maximum-margin hyperplane
Neural Network	$\text{softmax}(f_{\text{NN}}(X; \theta))$	Cross-entropy	Complex nonlinear manifolds
Conditional Random Field	$\frac{1}{Z(X)}\exp(\sum_i \phi_i(X, Y))$	Negative log-likelihood	Depends on feature functions

Logistic Regression: The Prototypical Discriminative Model

Logistic regression is the canonical discriminative classifier. It models the posterior probability of the positive class as:

$$P(Y = 1 | X) = \sigma(w^T X + b) = \frac{1}{1 + \exp(-(w^T X + b))}$$

where $\sigma(\cdot)$ is the sigmoid (logistic) function, $w \in \mathbb{R}^d$ is the weight vector, and $b$ is the bias term.

Key Properties

Direct probability modeling: The output is a well-calibrated probability in $[0, 1]$.
Linear log-odds: The model is linear in log-odds space: $$\log \frac{P(Y=1|X)}{P(Y=0|X)} = w^T X + b$$
Maximum likelihood training: Parameters are found by minimizing the negative log-likelihood (binary cross-entropy loss).
No assumptions on $P(X)$: Unlike generative models, we don't need to assume features are Gaussian, independent, or have any particular distribution.

discriminative_classifiers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
import numpy as np
from typing import Optional
 
class LogisticRegressionDiscriminative:
    """
    Logistic regression as a discriminative classifier.
    
    Models P(Y|X) directly using the logistic function,
    trained by minimizing binary cross-entropy loss.
    """
    
    def __init__(self, learning_rate: float = 0.1, n_iterations: int = 1000,
                 reg_lambda: float = 0.01):
        """
        Args:
            learning_rate: Step size for gradient descent
            n_iterations: Maximum training iterations
            reg_lambda: L2 regularization strength
        """
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.reg_lambda = reg_lambda
        self.weights: Optional[np.ndarray] = None
        self.bias: float = 0.0
        
    def _sigmoid(self, z: np.ndarray) -> np.ndarray:
        """Numerically stable sigmoid function."""
        # Clip to avoid overflow in exp
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def _compute_loss(self, y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Binary cross-entropy loss with L2 regularization.
        
        L = -1/n * Σ[y*log(p) + (1-y)*log(1-p)] + λ/2 * ||w||²
        """
        n = len(y_true)
        # Clip predictions to avoid log(0)
        eps = 1e-15
        y_pred = np.clip(y_pred, eps, 1 - eps)
        
        # Binary cross-entropy
        bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
        
        # L2 regularization
        l2_reg = (self.reg_lambda / 2) * np.sum(self.weights ** 2)
        
        return bce + l2_reg
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'LogisticRegressionDiscriminative':
        """
        Train logistic regression via gradient descent.
        
        The gradient of the loss with respect to weights is:
        ∂L/∂w = 1/n * Xᵀ(p - y) + λw
        """
        n_samples, n_features = X.shape
        
        # Initialize weights
        self.weights = np.zeros(n_features)
        self.bias = 0.0
        
        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions
            linear = X @ self.weights + self.bias
            predictions = self._sigmoid(linear)
            
            # Compute gradients
            error = predictions - y
            grad_weights = (1 / n_samples) * (X.T @ error) + self.reg_lambda * self.weights
            grad_bias = (1 / n_samples) * np.sum(error)
            
            # Update parameters
            self.weights -= self.lr * grad_weights
            self.bias -= self.lr * grad_bias
            
            # Optional: print loss periodically
            if iteration % 100 == 0:
                loss = self._compute_loss(y, predictions)
                # print(f"Iteration {iteration}, Loss: {loss:.4f}")
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Return P(Y=1|X) for each sample."""
        linear = X @ self.weights + self.bias
        prob_class_1 = self._sigmoid(linear)
        return np.column_stack([1 - prob_class_1, prob_class_1])
    
    def predict(self, X: np.ndarray, threshold: float = 0.5) -> np.ndarray:
        """Predict class labels using probability threshold."""
        proba = self.predict_proba(X)[:, 1]
        return (proba >= threshold).astype(int)
    
    def decision_function(self, X: np.ndarray) -> np.ndarray:
        """Return the raw linear scores (log-odds)."""
        return X @ self.weights + self.bias
 
 
class SoftmaxRegressionDiscriminative:
    """
    Softmax regression for multi-class discriminative classification.
    
    Models P(Y=k|X) = softmax(Wx + b) for K classes.
    """
    
    def __init__(self, n_classes: int, learning_rate: float = 0.1,
                 n_iterations: int = 1000, reg_lambda: float = 0.01):
        self.n_classes = n_classes
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.reg_lambda = reg_lambda
        self.weights: Optional[np.ndarray] = None  # Shape: (n_features, n_classes)
        self.bias: Optional[np.ndarray] = None      # Shape: (n_classes,)
        
    def _softmax(self, z: np.ndarray) -> np.ndarray:
        """Numerically stable softmax."""
        z_shifted = z - np.max(z, axis=1, keepdims=True)
        exp_z = np.exp(z_shifted)
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)
    
    def _one_hot(self, y: np.ndarray) -> np.ndarray:
        """Convert class labels to one-hot encoding."""
        n_samples = len(y)
        one_hot = np.zeros((n_samples, self.n_classes))
        one_hot[np.arange(n_samples), y] = 1
        return one_hot
    
    def fit(self, X: np.ndarray, y: np.ndarray) -> 'SoftmaxRegressionDiscriminative':
        """Train via gradient descent on categorical cross-entropy."""
        n_samples, n_features = X.shape
        
        # Initialize weights
        self.weights = np.zeros((n_features, self.n_classes))
        self.bias = np.zeros(self.n_classes)
        
        # One-hot encode labels
        y_one_hot = self._one_hot(y)
        
        for iteration in range(self.n_iterations):
            # Forward pass
            logits = X @ self.weights + self.bias
            probabilities = self._softmax(logits)
            
            # Compute gradients
            error = probabilities - y_one_hot  # Shape: (n_samples, n_classes)
            grad_weights = (1 / n_samples) * (X.T @ error) + self.reg_lambda * self.weights
            grad_bias = (1 / n_samples) * np.sum(error, axis=0)
            
            # Update parameters
            self.weights -= self.lr * grad_weights
            self.bias -= self.lr * grad_bias
        
        return self
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """Return P(Y=k|X) for all classes."""
        logits = X @ self.weights + self.bias
        return self._softmax(logits)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels."""
        proba = self.predict_proba(X)
        return np.argmax(proba, axis=1)

The Decision Boundary Perspective

Linear Decision Boundaries

For binary logistic regression, the decision boundary occurs where:

$$P(Y=1|X) = P(Y=0|X) = 0.5$$

This happens when $w^T X + b = 0$, which defines a hyperplane in feature space. Points on one side are classified as class 1; points on the other side as class 0.

The geometry is elegant:

The weight vector $w$ is perpendicular to the decision boundary
The bias $b$ determines the distance of the boundary from the origin
The magnitude of $w$ controls how sharply probability transitions near the boundary

Converting Mermaid diagram...

Nonlinear Decision Boundaries

Real-world problems rarely admit clean linear separation. Discriminative models achieve nonlinear boundaries through two main mechanisms:

1. Feature Engineering / Basis Expansion: Transform inputs via polynomial features, radial basis functions, etc.: $$\phi(X) = [1, x_1, x_2, x_1^2, x_1 x_2, x_2^2, \ldots]$$

A linear model in $\phi(X)$ space creates nonlinear boundaries in the original $X$ space.

2. Kernelization (Implicit Feature Maps): SVMs use the kernel trick to implicitly work in very high (even infinite) dimensional feature spaces without explicit computation.

3. Neural Networks: Deep networks learn hierarchical nonlinear transformations, creating arbitrarily complex decision boundaries through composition of simple nonlinear functions.

Decision Boundary Complexity Across Discriminative Models
Model	Decision Boundary Type	Expressiveness	Training Complexity
Logistic Regression	Linear hyperplane	Limited (linear)	O(ndk) - fast
Polynomial Logistic	Polynomial surfaces	Moderate	O(nd^p k) - polynomial features
Kernel SVM (RBF)	Smooth nonlinear manifolds	High (universal approximator)	O(n²d) to O(n³) - expensive
Neural Network	Arbitrary complex surfaces	Very high	O(ndk) per epoch - many epochs
Decision Tree	Axis-aligned rectangular regions	High but constrained geometry	O(nd log n) - fast

Loss Functions and the Optimization Perspective

The Empirical Risk Minimization Framework

Given training data ${(x^{(i)}, y^{(i)})}_{i=1}^n$ and a loss function $L(\hat{y}, y)$, we seek parameters $\theta$ that minimize the average loss:

$$\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} L(f(x^{(i)}; \theta), y^{(i)}) + \lambda R(\theta)$$

where $R(\theta)$ is a regularization term (e.g., $||\theta||^2$ for L2 regularization).

Key Loss Functions for Discriminative Classification

•Cross-Entropy Loss (Log Loss): $L = -[y \log(p) + (1-y) \log(1-p)]$ — The natural loss for probability estimation. Heavily penalizes confident wrong predictions. Yields maximum likelihood estimation for logistic regression.
•Hinge Loss: $L = \max(0, 1 - y \cdot f(x))$ where $y \in {-1, +1}$ — Used in SVMs. Only penalizes points within or on wrong side of margin. Leads to sparse solutions (only support vectors matter).
•Exponential Loss: $L = \exp(-y \cdot f(x))$ — Used in AdaBoost. Very sensitive to outliers due to exponential penalty on misclassifications.
•0-1 Loss: $L = \mathbf{1}[\hat{y} \neq y]$ — The ideal loss (counts errors), but non-differentiable and NP-hard to optimize directly. Other losses are differentiable surrogates.

Surrogate Losses

Optimization Algorithms

Once we have a differentiable loss, standard optimization applies:

Algorithm	Update Rule	Properties
Gradient Descent	$\theta \leftarrow \theta - \eta \nabla L(\theta)$	Full batch, guaranteed convergence for convex
Stochastic GD	$\theta \leftarrow \theta - \eta \nabla L_i(\theta)$	Single sample, noisy but fast
Mini-batch SGD	$\theta \leftarrow \theta - \eta (1/B) \sum_{i \in \text{batch}} \nabla L_i$	Balance of both
Adam	Adaptive learning rates + momentum	Robust default for deep learning
L-BFGS	Quasi-Newton, approximate Hessian	Fast for convex, moderate dimensions

For logistic regression with cross-entropy loss, the optimization problem is convex, guaranteeing a unique global optimum. This is a significant practical advantage over more complex models.

What Discriminative Models Don't Learn

The Information Hierarchy

Consider what different modeling approaches capture:

$$P(X, Y) \supset P(X|Y), P(Y) \supset P(Y|X)$$

Generative models that learn the joint distribution $P(X,Y)$ can always derive $P(Y|X)$ via Bayes' theorem. But the reverse is not true—$P(Y|X)$ alone cannot recover the full joint distribution.

What Discriminative Models Cannot Do

•Generate new samples: Cannot produce realistic synthetic data points since $P(X|Y)$ is unknown.
•Detect true outliers: Cannot compute $P(X)$ to identify points unlike any training data.
•Handle missing features directly: Must impute or drop missing values; cannot marginalize properly.
•Semi-supervised learning: Cannot leverage unlabeled data to improve $P(X)$ estimation (though some workarounds exist).
•Model feature correlations: Many discriminative models don't explicitly model how features relate to each other.

Why This Trade-off Can Be Worth It

•Fewer assumptions to be wrong about: No need to assume feature distributions (Gaussian, etc.).
•More flexible decision boundaries: Can fit complex boundaries without feature distribution constraints.
•Often better asymptotic accuracy: With enough data, directly modeling $P(Y|X)$ tends to outperform.
•Simpler to extend: Easy to add regularization, feature engineering, deep architectures.
•Robust to irrelevant features: Less affected by features that don't help classification.

The Assumption Trade-off

Expressiveness and Model Capacity

A key advantage of discriminative models is their flexibility in controlling model capacity—the ability to learn increasingly complex decision boundaries through various mechanisms.

The Bias-Variance Perspective

Model capacity directly relates to the bias-variance tradeoff:

Low Capacity	High Capacity
Simple boundaries (e.g., linear)	Complex boundaries (e.g., deep networks)
High bias, low variance	Low bias, high variance
May underfit	May overfit
Good with limited data	Needs more data

Discriminative models offer fine-grained control over this tradeoff through architecture choices, regularization, and feature engineering.

Capacity Control Mechanisms

1. Regularization: Add penalty terms to prevent weights from becoming too large:

L2 regularization: $\lambda ||w||^2$ — Encourages small, spread-out weights
L1 regularization: $\lambda ||w||_1$ — Encourages sparsity (feature selection)
Elastic Net: $\lambda_1 ||w||_1 + \lambda_2 ||w||^2$ — Combination

2. Architecture Constraints:

Limit polynomial degree in feature expansion
Control network depth and width
Use dropout, batch normalization

3. Early Stopping: Monitor validation loss and stop training before overfitting occurs.

4. Ensemble Methods: Combine multiple discriminative models to reduce variance (bagging) or bias (boosting).

regularization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
 
def regularized_logistic_gradient(X: np.ndarray, y: np.ndarray, 
                                   weights: np.ndarray, bias: float,
                                   reg_type: str = 'l2', 
                                   reg_lambda: float = 0.01) -> tuple:
    """
    Compute gradients with different regularization types.
    
    Demonstrates how regularization adds gradient terms that
    pull weights towards zero, controlling model capacity.
    """
    n_samples = len(y)
    
    # Forward pass
    linear = X @ weights + bias
    predictions = 1 / (1 + np.exp(-np.clip(linear, -500, 500)))
    
    # Base gradients (from cross-entropy loss)
    error = predictions - y
    grad_w_base = (1 / n_samples) * (X.T @ error)
    grad_b = (1 / n_samples) * np.sum(error)
    
    # Add regularization gradient
    if reg_type == 'l2':
        # L2: ∂/∂w (λ/2 ||w||²) = λw
        # Effect: Linear shrinkage, all weights reduced proportionally
        reg_grad = reg_lambda * weights
    elif reg_type == 'l1':
        # L1: ∂/∂w (λ ||w||₁) = λ sign(w)
        # Effect: Constant push towards zero, creates sparse solutions
        reg_grad = reg_lambda * np.sign(weights)
    elif reg_type == 'elastic_net':
        # Elastic Net: combines both
        l1_ratio = 0.5  # Balance between L1 and L2
        reg_grad = (reg_lambda * l1_ratio * np.sign(weights) + 
                   reg_lambda * (1 - l1_ratio) * weights)
    else:
        reg_grad = 0
    
    grad_w = grad_w_base + reg_grad
    
    return grad_w, grad_b
 
 
def analyze_regularization_effects(weights: np.ndarray, reg_lambda: float):
    """
    Show how different regularization affects the loss landscape.
    """
    print("Regularization Analysis:")
    print(f"Original weights: {weights}")
    print(f"L2 norm: ||w||² = {np.sum(weights**2):.4f}")
    print(f"L1 norm: ||w||₁ = {np.sum(np.abs(weights)):.4f}")
    print()
    
    # L2 regularization penalty
    l2_penalty = (reg_lambda / 2) * np.sum(weights**2)
    print(f"L2 penalty (λ={reg_lambda}): {l2_penalty:.4f}")
    
    # L1 regularization penalty  
    l1_penalty = reg_lambda * np.sum(np.abs(weights))
    print(f"L1 penalty (λ={reg_lambda}): {l1_penalty:.4f}")
    
    # L1 tends to zero out small weights
    l1_threshold = reg_lambda  # Weights smaller than this tend to become zero
    sparse_count = np.sum(np.abs(weights) < l1_threshold)
    print(f"Weights likely zeroed by L1 (|w| < {l1_threshold}): {sparse_count}")

Deep Discriminative Models: Neural Networks

From Logistic Regression to Deep Networks

Logistic regression can be viewed as a single-layer neural network: $$P(Y=1|X) = \sigma(w^T X + b)$$

A deep network adds hidden layers that transform the input: $$P(Y=1|X) = \sigma(w^{(L)T} \cdot h^{(L-1)} + b^{(L)})$$

where $h^{(l)} = \text{activation}(W^{(l)} h^{(l-1)} + b^{(l)})$ for layers $l = 1, \ldots, L-1$.

The key insight: the hidden layers learn a nonlinear transformation $\phi(X)$ that makes the classification problem (approximately) linearly separable in the transformed space.

Representation Learning

Universal Approximation

This means deep discriminative models can, in principle, learn any decision boundary—given enough data and appropriate training.

The Discriminative Paradigm at Scale

Modern large-scale classification (ImageNet, language models, speech recognition) is fundamentally discriminative:

Input: High-dimensional features (pixels, word embeddings, spectrograms)
Model: Deep neural network with millions/billions of parameters
Output: $P(Y|X)$ via softmax over thousands of classes
Training: Cross-entropy loss + SGD/Adam optimization
No generative modeling: We never model $P(X|Y)$ or generate samples (though parallel generative models like GANs exist)

Summary: The Discriminative Paradigm

We've now established the complete theoretical foundation of discriminative classification. Let's consolidate the key insights:

Key Takeaways

•Discriminative models learn $P(Y|X)$ directly — They focus only on what's needed for classification, without modeling the data generation process.
•Fewer distributional assumptions — No need to specify $P(X|Y)$ (Gaussian, multinomial, etc.), making the approach more flexible.
•Decision boundary perspective — Classification becomes finding the optimal boundary separating classes in feature space.
•Loss function optimization — Training minimizes a surrogate loss (cross-entropy, hinge) via gradient-based methods.
•Controllable capacity — Regularization, architecture, and early stopping provide fine-grained control over model complexity.
•Scales to deep learning — The discriminative framework naturally extends to neural networks for powerful representation learning.

What's next:

Page Complete

2 / 5