Machine LearningNaive Bayes & Probabilistic Classifiers

The Bayes Classifier

LevelIntermediate

Duration75 mins

TopicNaive Bayes & Probabilistic Classifiers

1 / 5

Bayes Decision Rule

The Quest for the Perfect Classifier

Every classification algorithm—from logistic regression to deep neural networks—shares a common aspiration: to predict the correct class label for unseen data with the highest possible accuracy. But what does optimal classification actually mean? Is there a theoretical limit to how good any classifier can be? And if so, what classifier achieves it?

These questions lead us to one of the most profound results in statistical learning theory: the Bayes classifier and its associated Bayes decision rule. This isn't just another algorithm to add to your toolkit—it's the theoretical benchmark against which all classifiers are measured. Understanding it provides deep insight into the fundamental nature of classification and the irreducible limits imposed by inherent data uncertainty.

What You Will Learn

By the end of this page, you will understand the Bayes decision rule and why it produces optimal classification decisions. You'll learn how to derive the rule from first principles using posterior probabilities, explore its geometric interpretation as partitioning feature space, and understand why this theoretical optimum is the standard by which we judge all practical classifiers.

The Classification Problem Revisited

Before deriving the Bayes decision rule, let's establish the precise mathematical framework for classification. While you've likely encountered classification before, appreciating the Bayes classifier requires rigorous foundations.

The Probabilistic Classification Setup:

Consider a classification problem with:

Feature space $\mathcal{X} \subseteq \mathbb{R}^d$: the set of all possible input feature vectors
Label space $\mathcal{Y} = {1, 2, \ldots, K}$: the set of $K$ possible class labels
Joint distribution $P(X, Y)$: the true (but typically unknown) probability distribution over feature-label pairs

Our goal is to construct a classifier $h: \mathcal{X} \to \mathcal{Y}$ that maps feature vectors to class labels. But what makes one classifier better than another?

The Fundamental Assumption

The entire framework rests on the assumption that data points $(x, y)$ are drawn independently from a fixed joint distribution $P(X, Y)$. This means the relationship between features and labels is governed by probabilistic rules that don't change over time. While this assumption may not hold perfectly in practice (data distributions can shift), it provides the essential foundation for theoretical analysis.

The Risk Framework:

To compare classifiers, we need a measure of quality. The most natural choice is the probability of misclassification (0-1 loss):

$$L(y, \hat{y}) = \mathbb{1}[y eq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y eq \hat{y} \end{cases}$$

The risk (or expected loss) of a classifier $h$ is:

$$R(h) = \mathbb{E}_{(X,Y) \sim P}[L(Y, h(X))] = P(h(X) eq Y)$$

This is simply the probability that the classifier makes an error on a randomly drawn data point. Our objective is to find the classifier $h^*$ that minimizes this risk:

$$h^* = \arg\min_h R(h) = \arg\min_h P(h(X) eq Y)$$

The Key Insight: Pointwise Optimization

Here's the crucial observation that leads to the Bayes classifier. The risk can be written as an expectation over the feature space:

$$R(h) = \mathbb{E}X\left[\mathbb{E}{Y|X}[L(Y, h(X))]\right] = \int_{\mathcal{X}} \underbrace{P(h(x) eq Y | X = x)}_{\text{conditional error at } x} \cdot p(x) , dx$$

Since $p(x) \geq 0$ for all $x$, minimizing the overall risk is equivalent to minimizing the conditional error at each point $x$ separately. This decomposition is powerful: instead of searching over all possible classifiers globally, we can optimize pointwise, choosing the best prediction for each $x$ independently.

Why Pointwise Optimization Works

The ability to optimize pointwise comes from the structure of the 0-1 loss. For each point $x$, the classifier must output a single class label. The label that minimizes the probability of error at $x$ can be chosen without considering what the classifier does at other points. This decoupling is what makes the Bayes classifier elegant and tractable to analyze.

Deriving the Bayes Decision Rule

Now we derive the optimal classifier from first principles. For a fixed feature vector $x$, we want to find the class label that minimizes the conditional probability of error:

$$h^*(x) = \arg\min_{k \in \mathcal{Y}} P(Y eq k | X = x)$$

Since probabilities sum to one:

$$P(Y eq k | X = x) = 1 - P(Y = k | X = x)$$

Minimizing this is equivalent to maximizing the posterior probability:

$$h^*(x) = \arg\max_{k \in \mathcal{Y}} P(Y = k | X = x)$$

This is the Bayes decision rule: classify each point $x$ to the class with the highest posterior probability.

The Bayes Decision Rule

$$h^*(x) = \arg\max_{k \in {1, 2, \ldots, K}} P(Y = k | X = x)$$

The optimal classifier assigns each point $x$ to whichever class has the highest posterior probability given the observed features. This classifier is called the Bayes classifier or Bayes optimal classifier.

Intuitive Interpretation:

The Bayes decision rule is remarkably intuitive. Given the evidence $x$, we're betting on the most likely class. If we had to make one prediction for each $x$ and wanted to maximize our expected accuracy, we'd pick the class with probability greater than all others.

Consider a medical diagnosis example:

A patient presents with symptoms $x$
We compute $P(\text{Disease} | x)$ and $P(\text{Healthy} | x)$
If $P(\text{Disease} | x) = 0.7$ and $P(\text{Healthy} | x) = 0.3$, we classify as Disease
This prediction will be correct 70% of the time for patients with these exact symptoms

By always picking the most probable class, we maximize our accuracy over the long run.

Formal Statement and Notation:

Let's formalize this with consistent notation. Define:

Prior probability: $\pi_k = P(Y = k)$ — the overall prevalence of class $k$
Class-conditional density: $p_k(x) = p(x | Y = k)$ — the feature distribution for class $k$
Marginal density: $p(x) = \sum_{k=1}^K \pi_k \cdot p_k(x)$ — the overall feature distribution
Posterior probability: $\eta_k(x) = P(Y = k | X = x)$ — the probability of class $k$ given features $x$

By Bayes' theorem:

$$\eta_k(x) = P(Y = k | X = x) = \frac{p(x | Y = k) \cdot P(Y = k)}{p(x)} = \frac{\pi_k \cdot p_k(x)}{\sum_{j=1}^K \pi_j \cdot p_j(x)}$$

The Bayes classifier becomes:

$$h^*(x) = \arg\max_k \eta_k(x) = \arg\max_k \frac{\pi_k \cdot p_k(x)}{p(x)} = \arg\max_k \left[ \pi_k \cdot p_k(x) \right]$$

Note that the denominator $p(x)$ is constant across classes and can be dropped from the argmax.

Key Quantities in the Bayes Framework
Quantity	Notation	Formula	Interpretation
Prior	$\pi_k$	$P(Y = k)$	Base rate of class $k$ in population
Class-conditional	$p_k(x)$	$p(x \| Y = k)$	Feature distribution within class $k$
Posterior	$\eta_k(x)$	$P(Y = k \| X = x)$	Updated probability after observing $x$
Marginal	$p(x)$	$\sum_k \pi_k p_k(x)$	Overall density of observing $x$

Why the Bayes Classifier is Optimal

We've claimed that the Bayes classifier is optimal, but let's prove this rigorously and understand exactly what we mean by optimality.

Theorem (Optimality of the Bayes Classifier):

Let $h^*$ be the Bayes classifier and let $h$ be any other classifier. Then:

$$R(h^*) \leq R(h)$$

That is, the Bayes classifier achieves the lowest possible risk (error rate) among all classifiers.

Proof:

For any classifier $h$, the risk can be decomposed as:

$$R(h) = \int_{\mathcal{X}} P(Y eq h(x) | X = x) \cdot p(x) , dx$$

At each point $x$, the conditional error is:

$$P(Y eq h(x) | X = x) = 1 - P(Y = h(x) | X = x) = 1 - \eta_{h(x)}(x)$$

The Bayes classifier chooses $h^*(x) = \arg\max_k \eta_k(x)$, so for any other choice $h(x)$:

$$\eta_{h^*(x)}(x) \geq \eta_{h(x)}(x)$$

Therefore:

$$1 - \eta_{h^*(x)}(x) \leq 1 - \eta_{h(x)}(x)$$

Integrating over $x$:

$$R(h^) = \int (1 - \eta_{h^(x)}(x)) p(x) dx \leq \int (1 - \eta_{h(x)}(x)) p(x) dx = R(h) \quad \square$$

Optimality Has Conditions

The Bayes classifier is optimal with respect to 0-1 loss and assuming we know the true posterior probabilities. Different loss functions lead to different optimal classifiers. And in practice, we never know the true posteriors—we must estimate them from data, which introduces error. The Bayes classifier is a theoretical ideal, not a practical algorithm.

Understanding the Optimality:

The proof reveals something profound: no classifier can beat the Bayes classifier, regardless of how sophisticated or computationally expensive it is. Deep neural networks, ensemble methods, kernel machines—all are fundamentally limited by the Bayes error rate.

This might seem discouraging, but it's actually liberating. It tells us that:

There's a minimum achievable error set by the data itself
Our goal isn't to achieve zero error but to approach the Bayes error
If our classifier matches the Bayes error, we've solved the problem optimally

The Bayes classifier is theoretically realizable only when we have complete knowledge of the joint distribution $P(X, Y)$. Since this information is never available in practice, we use approximations—and that's where all of machine learning comes in.

Relating to Information Theory:

The optimality of the Bayes classifier has deep connections to information theory. The posterior $\eta_k(x)$ represents all the information that $x$ provides about $Y$. Any classifier that doesn't use the posterior optimally is, in some sense, throwing away information.

The Bayes decision rule is essentially asking: given everything the features tell us about the class, what's our best guess? By choosing the most probable class, we're making the decision that will be correct most often.

This connects to the concept of sufficient statistics: the posterior $\eta(x)$ is a sufficient statistic for deciding the class label. Any transformation of $x$ that doesn't change the posterior doesn't change the optimal decision.

Geometric Interpretation: Partitioning Feature Space

The Bayes decision rule has an elegant geometric interpretation. It partitions the feature space $\mathcal{X}$ into $K$ decision regions, one for each class:

$$\mathcal{R}_k = {x \in \mathcal{X} : h^*(x) = k} = {x : \eta_k(x) \geq \eta_j(x) \text{ for all } j eq k}$$

Each decision region $\mathcal{R}_k$ contains all points where class $k$ has the highest posterior probability. The boundaries between regions—where two or more classes have equal posterior probabilities—form the Bayes decision boundaries.

Decision Boundary Equations:

The boundary between classes $i$ and $j$ is the set where:

$$\eta_i(x) = \eta_j(x)$$

Using Bayes' theorem (and dropping the common denominator):

$$\pi_i \cdot p_i(x) = \pi_j \cdot p_j(x)$$

Taking logs:

$$\log \pi_i + \log p_i(x) = \log \pi_j + \log p_j(x)$$

Rearranging:

$$\log \frac{p_i(x)}{p_j(x)} = \log \frac{\pi_j}{\pi_i}$$

This is the log-likelihood ratio equals the log prior odds. The shape of this boundary depends entirely on the class-conditional distributions $p_i(x)$ and $p_j(x)$.

Decision Boundary Shapes for Different Class-Conditional Distributions
Class-Conditional Distribution	Boundary Type	Example Classifier
Gaussian with equal covariances	Linear (hyperplane)	Linear Discriminant Analysis (LDA)
Gaussian with different covariances	Quadratic (conic section)	Quadratic Discriminant Analysis (QDA)
Mixture of Gaussians	Complex, nonlinear	GMM-based classifiers
Arbitrary distributions	Can be arbitrarily complex	Kernel-based methods, Neural networks

Visualizing the Bayes Classifier:

Consider binary classification in 2D with Gaussian class-conditionals:

Class 0: $p_0(x) = \mathcal{N}(\mu_0, \Sigma_0)$ with $\mu_0 = (0, 0)^T$
Class 1: $p_1(x) = \mathcal{N}(\mu_1, \Sigma_1)$ with $\mu_1 = (3, 0)^T$

If $\Sigma_0 = \Sigma_1 = I$ (equal covariances), the decision boundary is a straight line halfway between the means (adjusted for prior differences). The Bayes classifier assigns points to whichever Gaussian center is closer (in Mahalanobis distance).

If $\Sigma_0 eq \Sigma_1$, the boundary becomes a quadratic curve—a hyperbola, ellipse, or parabola depending on the covariance matrices.

Complexity of Bayes Boundaries

The Bayes decision boundary can be arbitrarily complex depending on the true data distribution. Two overlapping spirals with opposite class labels would have a spiral-shaped Bayes boundary. This is why flexible models like neural networks can outperform simple models—they can approximate arbitrarily complex decision boundaries.

Binary Classification: The Special Case

Binary classification ($K = 2$) is both the most common case and the one that admits the simplest analysis. Let's denote the classes as 0 and 1, with posterior:

$$\eta(x) \triangleq P(Y = 1 | X = x)$$

Then $P(Y = 0 | X = x) = 1 - \eta(x)$, and the Bayes decision rule simplifies to:

$$h^*(x) = \begin{cases} 1 & \text{if } \eta(x) \geq 0.5 \ 0 & \text{if } \eta(x) < 0.5 \end{cases}$$

Or equivalently:

$$h^*(x) = \mathbb{1}[\eta(x) \geq 0.5]$$

The decision boundary is the set where $\eta(x) = 0.5$, i.e., where both classes are equally likely.

The Threshold Perspective:

Binary classification can be viewed through the lens of thresholding a continuous score. The posterior $\eta(x)$ maps each point to $[0, 1]$, and we threshold at 0.5 to make the decision.

This perspective connects to:

Logistic regression: Models $\eta(x) = \sigma(w^T x + b)$ and thresholds at 0.5
ROC analysis: Studies performance across all thresholds
Calibration: Ensures predicted probabilities match true frequencies

The threshold 0.5 is optimal only for 0-1 loss. Different thresholds become optimal when:

The costs of false positives and false negatives differ
Class imbalance affects practical decisions
Operating constraints require specific sensitivity/specificity tradeoffs

Log-Odds Formulation:

Working with the log-odds (logit) is often more convenient:

$$\text{logit}(\eta(x)) = \log \frac{\eta(x)}{1 - \eta(x)} = \log \frac{P(Y = 1 | x)}{P(Y = 0 | x)}$$

Using Bayes' theorem:

$$\text{logit}(\eta(x)) = \log \frac{\pi_1 \cdot p_1(x)}{\pi_0 \cdot p_0(x)} = \underbrace{\log \frac{\pi_1}{\pi_0}}{\text{log prior odds}} + \underbrace{\log \frac{p_1(x)}{p_0(x)}}{\text{log-likelihood ratio}}$$

The decision boundary is where this equals zero:

$$\log \frac{p_1(x)}{p_0(x)} = \log \frac{\pi_0}{\pi_1}$$

This formulation is central to logistic regression and log-linear models.

The Role of Prior Odds

The prior odds $\pi_1 / \pi_0$ shift the decision boundary. In the extreme case where class 1 is very rare ($\pi_1 \ll \pi_0$), the boundary shifts toward class 1, requiring stronger evidence (higher likelihood ratio) to classify as class 1. This is a built-in adjustment for class imbalance—rare classes need more evidence to trigger a positive classification.

Multiple Equivalent Formulations

The Bayes decision rule can be expressed in several equivalent forms, each offering different insights and computational advantages.

Formulation 1: Maximum A Posteriori (MAP)

$$h^*(x) = \arg\max_k P(Y = k | X = x) = \arg\max_k \eta_k(x)$$

This is the most direct expression: choose the class with maximum posterior probability.

Formulation 2: Maximum Posterior-Proportional Score

$$h^*(x) = \arg\max_k \left[ \pi_k \cdot p_k(x) \right]$$

Since the denominator $p(x)$ is constant across classes, we can drop it from the optimization.

Formulation 3: Maximum Log-Score

$$h^*(x) = \arg\max_k \left[ \log \pi_k + \log p_k(x) \right]$$

Taking logs doesn't change the argmax (log is monotonic) and turns products into sums, which is numerically more stable.

bayes_decision_rule.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from typing import List, Tuple
 
def bayes_decision_rule_log_score(
    x: np.ndarray,
    priors: np.ndarray,            # Shape: (K,) - prior probabilities
    log_likelihoods: np.ndarray    # Shape: (K,) - log p(x | Y=k) for each class
) -> int:
    """
    Bayes decision rule using log-score formulation.
    
    For numerical stability, we work in log-space:
    - Avoid underflow from multiplying small probabilities
    - Addition is cheaper than multiplication
    
    Parameters:
        x: Feature vector (for logging/debugging)
        priors: Prior probabilities for each class
        log_likelihoods: Log class-conditional probabilities for x
    
    Returns:
        Class label with maximum posterior probability
    """
    # Log-score: log(prior) + log(likelihood)
    log_scores = np.log(priors + 1e-300) + log_likelihoods
    
    # MAP decision: argmax of log-scores
    return int(np.argmax(log_scores))
 
 
def bayes_classifier(
    X: np.ndarray,               # Shape: (n, d)
    priors: np.ndarray,          # Shape: (K,)
    class_conditional_log_pdfs: List[callable]  # List of K functions
) -> np.ndarray:
    """
    Apply Bayes decision rule to a dataset.
    
    Parameters:
        X: Feature matrix
        priors: Prior probabilities
        class_conditional_log_pdfs: Functions that compute log p(x | Y=k)
    
    Returns:
        Array of predicted class labels
    """
    n = X.shape[0]
    K = len(priors)
    predictions = np.zeros(n, dtype=int)
    
    for i in range(n):
        x = X[i]
        log_likelihoods = np.array([
            clf(x) for clf in class_conditional_log_pdfs
        ])
        predictions[i] = bayes_decision_rule_log_score(x, priors, log_likelihoods)
    
    return predictions
 
 
# Example: Gaussian class-conditionals
from scipy.stats import multivariate_normal
 
def make_gaussian_log_pdf(mean: np.ndarray, cov: np.ndarray):
    """Create log-pdf function for a multivariate Gaussian."""
    dist = multivariate_normal(mean=mean, cov=cov)
    return lambda x: dist.logpdf(x)
 
# Example usage
if __name__ == "__main__":
    # Two Gaussian classes in 2D
    priors = np.array([0.4, 0.6])  # Class 1 more common
    
    mean_0 = np.array([0, 0])
    mean_1 = np.array([3, 0])
    cov = np.eye(2)  # Shared covariance
    
    log_pdfs = [
        make_gaussian_log_pdf(mean_0, cov),
        make_gaussian_log_pdf(mean_1, cov)
    ]
    
    # Classify some test points
    test_points = np.array([
        [0, 0],    # Near class 0 mean
        [3, 0],    # Near class 1 mean
        [1.5, 0],  # Midpoint
        [1.0, 2],  # Equidistant in Euclidean, but priors differ
    ])
    
    predictions = bayes_classifier(test_points, priors, log_pdfs)
    print("Predictions:", predictions)
    # Output: [0, 1, 1, 1]  (midpoint goes to class 1 due to higher prior)

Formulation 4: Minimum Expected Loss

For general loss functions $L(y, \hat{y})$, the optimal decision at $x$ is:

$$h^*(x) = \arg\min_k \mathbb{E}{Y|X=x}[L(Y, k)] = \arg\min_k \sum{j=1}^K L(j, k) \cdot \eta_j(x)$$

For 0-1 loss, $L(j, k) = \mathbb{1}[j eq k]$, so this reduces to:

$$h^*(x) = \arg\min_k \sum_{j eq k} \eta_j(x) = \arg\min_k (1 - \eta_k(x)) = \arg\max_k \eta_k(x)$$

Recovering our earlier result. But this formulation generalizes to asymmetric costs and other loss functions.

Connection to Regression Functions

There's a beautiful connection between the Bayes classifier and regression that provides additional insight.

The Regression Function:

In regression, we seek the function that minimizes mean squared error:

$$f^*(x) = \arg\min_f \mathbb{E}[(Y - f(X))^2 | X = x] = \mathbb{E}[Y | X = x]$$

The optimal regression function is the conditional expectation (conditional mean).

Binary Classification as Regression:

For binary classification with labels ${0, 1}$, the regression function is:

$$f^*(x) = \mathbb{E}[Y | X = x] = 1 \cdot P(Y = 1 | x) + 0 \cdot P(Y = 0 | x) = \eta(x)$$

The posterior probability is exactly the regression function! The Bayes classifier then thresholds this regression function at 0.5.

Why Logistic Regression Works

This connection explains why logistic regression is effective for classification. It directly models the regression function $\eta(x) = P(Y = 1 | X = x)$ using the logistic (sigmoid) function. By estimating the posterior, it implicitly learns the Bayes classifier's decision rule.

Multi-Class Extension:

For multi-class problems, consider the vector-valued regression function:

$$\boldsymbol{\eta}(x) = (\eta_1(x), \eta_2(x), \ldots, \eta_K(x))^T$$

where $\eta_k(x) = P(Y = k | X = x)$. This vector lies on the probability simplex (entries non-negative, sum to 1).

The Bayes classifier takes the argmax of this vector. Methods like softmax regression directly model this vector-valued function.

The Pointwise View:

This perspective emphasizes that classification is fundamentally about learning a function—the posterior probability function $\eta(x)$. Once we have $\eta(x)$, classification is trivial (just take argmax). The challenge of machine learning is estimating $\eta(x)$ accurately from finite data.

Summary: The Foundation of Optimal Classification

We've established the theoretical foundation for optimal classification through the Bayes decision rule. Let's consolidate the key insights:

Key Takeaways

•The Bayes decision rule assigns each point $x$ to the class with maximum posterior probability: $h^*(x) = \arg\max_k P(Y = k | X = x)$
•The Bayes classifier is provably optimal—no other classifier can achieve lower error with respect to 0-1 loss
•Posterior probabilities encode everything we need for optimal decisions; the features' role is solely to inform the posterior
•Decision boundaries arise where classes have equal posterior probability, and their shape depends on the class-conditional distributions
•Binary classification simplifies to thresholding the posterior at 0.5 (for symmetric costs)
•The posterior equals the regression function for binary classification, connecting classification to regression theory
•In practice, we never know the true posterior—all of machine learning is about estimating it from data

What's Next:

The Bayes classifier is the theoretical gold standard, but it requires knowing the true posterior probabilities—which we never have in practice. In the next page, we'll explore the optimal classifier concept more deeply, including the Bayes error rate and the fundamental limits it imposes on any learning algorithm.

Page Complete

You now understand the Bayes decision rule and why it produces optimal classification decisions. This theoretical framework will guide our exploration of practical classifiers—each of which can be understood as attempting to approximate the Bayes classifier using limited data.

1 / 5

Loading learning content...

Machine LearningNaive Bayes & Probabilistic Classifiers

The Bayes Classifier

LevelIntermediate

Duration75 mins

TopicNaive Bayes & Probabilistic Classifiers

1 / 5

Bayes Decision Rule

The Quest for the Perfect Classifier

What You Will Learn

The Classification Problem Revisited

The Probabilistic Classification Setup:

Consider a classification problem with:

Feature space $\mathcal{X} \subseteq \mathbb{R}^d$: the set of all possible input feature vectors
Label space $\mathcal{Y} = {1, 2, \ldots, K}$: the set of $K$ possible class labels
Joint distribution $P(X, Y)$: the true (but typically unknown) probability distribution over feature-label pairs

Our goal is to construct a classifier $h: \mathcal{X} \to \mathcal{Y}$ that maps feature vectors to class labels. But what makes one classifier better than another?

The Fundamental Assumption

The Risk Framework:

To compare classifiers, we need a measure of quality. The most natural choice is the probability of misclassification (0-1 loss):

$$L(y, \hat{y}) = \mathbb{1}[y eq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y eq \hat{y} \end{cases}$$

The risk (or expected loss) of a classifier $h$ is:

$$R(h) = \mathbb{E}_{(X,Y) \sim P}[L(Y, h(X))] = P(h(X) eq Y)$$

This is simply the probability that the classifier makes an error on a randomly drawn data point. Our objective is to find the classifier $h^*$ that minimizes this risk:

$$h^* = \arg\min_h R(h) = \arg\min_h P(h(X) eq Y)$$

The Key Insight: Pointwise Optimization

Here's the crucial observation that leads to the Bayes classifier. The risk can be written as an expectation over the feature space:

$$R(h) = \mathbb{E}X\left[\mathbb{E}{Y|X}[L(Y, h(X))]\right] = \int_{\mathcal{X}} \underbrace{P(h(x) eq Y | X = x)}_{\text{conditional error at } x} \cdot p(x) , dx$$

Why Pointwise Optimization Works

Deriving the Bayes Decision Rule

Now we derive the optimal classifier from first principles. For a fixed feature vector $x$, we want to find the class label that minimizes the conditional probability of error:

$$h^*(x) = \arg\min_{k \in \mathcal{Y}} P(Y eq k | X = x)$$

Since probabilities sum to one:

$$P(Y eq k | X = x) = 1 - P(Y = k | X = x)$$

Minimizing this is equivalent to maximizing the posterior probability:

$$h^*(x) = \arg\max_{k \in \mathcal{Y}} P(Y = k | X = x)$$

This is the Bayes decision rule: classify each point $x$ to the class with the highest posterior probability.

The Bayes Decision Rule

$$h^*(x) = \arg\max_{k \in {1, 2, \ldots, K}} P(Y = k | X = x)$$

Intuitive Interpretation:

Consider a medical diagnosis example:

A patient presents with symptoms $x$
We compute $P(\text{Disease} | x)$ and $P(\text{Healthy} | x)$
If $P(\text{Disease} | x) = 0.7$ and $P(\text{Healthy} | x) = 0.3$, we classify as Disease
This prediction will be correct 70% of the time for patients with these exact symptoms

By always picking the most probable class, we maximize our accuracy over the long run.

Formal Statement and Notation:

Let's formalize this with consistent notation. Define:

Prior probability: $\pi_k = P(Y = k)$ — the overall prevalence of class $k$
Class-conditional density: $p_k(x) = p(x | Y = k)$ — the feature distribution for class $k$
Marginal density: $p(x) = \sum_{k=1}^K \pi_k \cdot p_k(x)$ — the overall feature distribution
Posterior probability: $\eta_k(x) = P(Y = k | X = x)$ — the probability of class $k$ given features $x$

By Bayes' theorem:

$$\eta_k(x) = P(Y = k | X = x) = \frac{p(x | Y = k) \cdot P(Y = k)}{p(x)} = \frac{\pi_k \cdot p_k(x)}{\sum_{j=1}^K \pi_j \cdot p_j(x)}$$

The Bayes classifier becomes:

$$h^*(x) = \arg\max_k \eta_k(x) = \arg\max_k \frac{\pi_k \cdot p_k(x)}{p(x)} = \arg\max_k \left[ \pi_k \cdot p_k(x) \right]$$

Note that the denominator $p(x)$ is constant across classes and can be dropped from the argmax.

Key Quantities in the Bayes Framework
Quantity	Notation	Formula	Interpretation
Prior	$\pi_k$	$P(Y = k)$	Base rate of class $k$ in population
Class-conditional	$p_k(x)$	$p(x \| Y = k)$	Feature distribution within class $k$
Posterior	$\eta_k(x)$	$P(Y = k \| X = x)$	Updated probability after observing $x$
Marginal	$p(x)$	$\sum_k \pi_k p_k(x)$	Overall density of observing $x$

Why the Bayes Classifier is Optimal

We've claimed that the Bayes classifier is optimal, but let's prove this rigorously and understand exactly what we mean by optimality.

Theorem (Optimality of the Bayes Classifier):

Let $h^*$ be the Bayes classifier and let $h$ be any other classifier. Then:

$$R(h^*) \leq R(h)$$

That is, the Bayes classifier achieves the lowest possible risk (error rate) among all classifiers.

Proof:

For any classifier $h$, the risk can be decomposed as:

$$R(h) = \int_{\mathcal{X}} P(Y eq h(x) | X = x) \cdot p(x) , dx$$

At each point $x$, the conditional error is:

$$P(Y eq h(x) | X = x) = 1 - P(Y = h(x) | X = x) = 1 - \eta_{h(x)}(x)$$

The Bayes classifier chooses $h^*(x) = \arg\max_k \eta_k(x)$, so for any other choice $h(x)$:

$$\eta_{h^*(x)}(x) \geq \eta_{h(x)}(x)$$

Therefore:

$$1 - \eta_{h^*(x)}(x) \leq 1 - \eta_{h(x)}(x)$$

Integrating over $x$:

$$R(h^) = \int (1 - \eta_{h^(x)}(x)) p(x) dx \leq \int (1 - \eta_{h(x)}(x)) p(x) dx = R(h) \quad \square$$

Optimality Has Conditions

Understanding the Optimality:

This might seem discouraging, but it's actually liberating. It tells us that:

There's a minimum achievable error set by the data itself
Our goal isn't to achieve zero error but to approach the Bayes error
If our classifier matches the Bayes error, we've solved the problem optimally

Relating to Information Theory:

Geometric Interpretation: Partitioning Feature Space

The Bayes decision rule has an elegant geometric interpretation. It partitions the feature space $\mathcal{X}$ into $K$ decision regions, one for each class:

$$\mathcal{R}_k = {x \in \mathcal{X} : h^*(x) = k} = {x : \eta_k(x) \geq \eta_j(x) \text{ for all } j eq k}$$

Decision Boundary Equations:

The boundary between classes $i$ and $j$ is the set where:

$$\eta_i(x) = \eta_j(x)$$

Using Bayes' theorem (and dropping the common denominator):

$$\pi_i \cdot p_i(x) = \pi_j \cdot p_j(x)$$

Taking logs:

$$\log \pi_i + \log p_i(x) = \log \pi_j + \log p_j(x)$$

Rearranging:

$$\log \frac{p_i(x)}{p_j(x)} = \log \frac{\pi_j}{\pi_i}$$

This is the log-likelihood ratio equals the log prior odds. The shape of this boundary depends entirely on the class-conditional distributions $p_i(x)$ and $p_j(x)$.

Decision Boundary Shapes for Different Class-Conditional Distributions
Class-Conditional Distribution	Boundary Type	Example Classifier
Gaussian with equal covariances	Linear (hyperplane)	Linear Discriminant Analysis (LDA)
Gaussian with different covariances	Quadratic (conic section)	Quadratic Discriminant Analysis (QDA)
Mixture of Gaussians	Complex, nonlinear	GMM-based classifiers
Arbitrary distributions	Can be arbitrarily complex	Kernel-based methods, Neural networks

Visualizing the Bayes Classifier:

Consider binary classification in 2D with Gaussian class-conditionals:

Class 0: $p_0(x) = \mathcal{N}(\mu_0, \Sigma_0)$ with $\mu_0 = (0, 0)^T$
Class 1: $p_1(x) = \mathcal{N}(\mu_1, \Sigma_1)$ with $\mu_1 = (3, 0)^T$

If $\Sigma_0 eq \Sigma_1$, the boundary becomes a quadratic curve—a hyperbola, ellipse, or parabola depending on the covariance matrices.

Complexity of Bayes Boundaries

Binary Classification: The Special Case

Binary classification ($K = 2$) is both the most common case and the one that admits the simplest analysis. Let's denote the classes as 0 and 1, with posterior:

$$\eta(x) \triangleq P(Y = 1 | X = x)$$

Then $P(Y = 0 | X = x) = 1 - \eta(x)$, and the Bayes decision rule simplifies to:

$$h^*(x) = \begin{cases} 1 & \text{if } \eta(x) \geq 0.5 \ 0 & \text{if } \eta(x) < 0.5 \end{cases}$$

Or equivalently:

$$h^*(x) = \mathbb{1}[\eta(x) \geq 0.5]$$

The decision boundary is the set where $\eta(x) = 0.5$, i.e., where both classes are equally likely.

The Threshold Perspective:

Binary classification can be viewed through the lens of thresholding a continuous score. The posterior $\eta(x)$ maps each point to $[0, 1]$, and we threshold at 0.5 to make the decision.

This perspective connects to:

Logistic regression: Models $\eta(x) = \sigma(w^T x + b)$ and thresholds at 0.5
ROC analysis: Studies performance across all thresholds
Calibration: Ensures predicted probabilities match true frequencies

The threshold 0.5 is optimal only for 0-1 loss. Different thresholds become optimal when:

The costs of false positives and false negatives differ
Class imbalance affects practical decisions
Operating constraints require specific sensitivity/specificity tradeoffs

Log-Odds Formulation:

Working with the log-odds (logit) is often more convenient:

$$\text{logit}(\eta(x)) = \log \frac{\eta(x)}{1 - \eta(x)} = \log \frac{P(Y = 1 | x)}{P(Y = 0 | x)}$$

Using Bayes' theorem:

The decision boundary is where this equals zero:

$$\log \frac{p_1(x)}{p_0(x)} = \log \frac{\pi_0}{\pi_1}$$

This formulation is central to logistic regression and log-linear models.

The Role of Prior Odds

Multiple Equivalent Formulations

The Bayes decision rule can be expressed in several equivalent forms, each offering different insights and computational advantages.

Formulation 1: Maximum A Posteriori (MAP)

$$h^*(x) = \arg\max_k P(Y = k | X = x) = \arg\max_k \eta_k(x)$$

This is the most direct expression: choose the class with maximum posterior probability.

Formulation 2: Maximum Posterior-Proportional Score

$$h^*(x) = \arg\max_k \left[ \pi_k \cdot p_k(x) \right]$$

Since the denominator $p(x)$ is constant across classes, we can drop it from the optimization.

Formulation 3: Maximum Log-Score

$$h^*(x) = \arg\max_k \left[ \log \pi_k + \log p_k(x) \right]$$

Taking logs doesn't change the argmax (log is monotonic) and turns products into sums, which is numerically more stable.

bayes_decision_rule.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from typing import List, Tuple
 
def bayes_decision_rule_log_score(
    x: np.ndarray,
    priors: np.ndarray,            # Shape: (K,) - prior probabilities
    log_likelihoods: np.ndarray    # Shape: (K,) - log p(x | Y=k) for each class
) -> int:
    """
    Bayes decision rule using log-score formulation.
    
    For numerical stability, we work in log-space:
    - Avoid underflow from multiplying small probabilities
    - Addition is cheaper than multiplication
    
    Parameters:
        x: Feature vector (for logging/debugging)
        priors: Prior probabilities for each class
        log_likelihoods: Log class-conditional probabilities for x
    
    Returns:
        Class label with maximum posterior probability
    """
    # Log-score: log(prior) + log(likelihood)
    log_scores = np.log(priors + 1e-300) + log_likelihoods
    
    # MAP decision: argmax of log-scores
    return int(np.argmax(log_scores))
 
 
def bayes_classifier(
    X: np.ndarray,               # Shape: (n, d)
    priors: np.ndarray,          # Shape: (K,)
    class_conditional_log_pdfs: List[callable]  # List of K functions
) -> np.ndarray:
    """
    Apply Bayes decision rule to a dataset.
    
    Parameters:
        X: Feature matrix
        priors: Prior probabilities
        class_conditional_log_pdfs: Functions that compute log p(x | Y=k)
    
    Returns:
        Array of predicted class labels
    """
    n = X.shape[0]
    K = len(priors)
    predictions = np.zeros(n, dtype=int)
    
    for i in range(n):
        x = X[i]
        log_likelihoods = np.array([
            clf(x) for clf in class_conditional_log_pdfs
        ])
        predictions[i] = bayes_decision_rule_log_score(x, priors, log_likelihoods)
    
    return predictions
 
 
# Example: Gaussian class-conditionals
from scipy.stats import multivariate_normal
 
def make_gaussian_log_pdf(mean: np.ndarray, cov: np.ndarray):
    """Create log-pdf function for a multivariate Gaussian."""
    dist = multivariate_normal(mean=mean, cov=cov)
    return lambda x: dist.logpdf(x)
 
# Example usage
if __name__ == "__main__":
    # Two Gaussian classes in 2D
    priors = np.array([0.4, 0.6])  # Class 1 more common
    
    mean_0 = np.array([0, 0])
    mean_1 = np.array([3, 0])
    cov = np.eye(2)  # Shared covariance
    
    log_pdfs = [
        make_gaussian_log_pdf(mean_0, cov),
        make_gaussian_log_pdf(mean_1, cov)
    ]
    
    # Classify some test points
    test_points = np.array([
        [0, 0],    # Near class 0 mean
        [3, 0],    # Near class 1 mean
        [1.5, 0],  # Midpoint
        [1.0, 2],  # Equidistant in Euclidean, but priors differ
    ])
    
    predictions = bayes_classifier(test_points, priors, log_pdfs)
    print("Predictions:", predictions)
    # Output: [0, 1, 1, 1]  (midpoint goes to class 1 due to higher prior)

Formulation 4: Minimum Expected Loss

For general loss functions $L(y, \hat{y})$, the optimal decision at $x$ is:

$$h^*(x) = \arg\min_k \mathbb{E}{Y|X=x}[L(Y, k)] = \arg\min_k \sum{j=1}^K L(j, k) \cdot \eta_j(x)$$

For 0-1 loss, $L(j, k) = \mathbb{1}[j eq k]$, so this reduces to:

$$h^*(x) = \arg\min_k \sum_{j eq k} \eta_j(x) = \arg\min_k (1 - \eta_k(x)) = \arg\max_k \eta_k(x)$$

Recovering our earlier result. But this formulation generalizes to asymmetric costs and other loss functions.

Connection to Regression Functions

There's a beautiful connection between the Bayes classifier and regression that provides additional insight.

The Regression Function:

In regression, we seek the function that minimizes mean squared error:

$$f^*(x) = \arg\min_f \mathbb{E}[(Y - f(X))^2 | X = x] = \mathbb{E}[Y | X = x]$$

The optimal regression function is the conditional expectation (conditional mean).

Binary Classification as Regression:

For binary classification with labels ${0, 1}$, the regression function is:

$$f^*(x) = \mathbb{E}[Y | X = x] = 1 \cdot P(Y = 1 | x) + 0 \cdot P(Y = 0 | x) = \eta(x)$$

The posterior probability is exactly the regression function! The Bayes classifier then thresholds this regression function at 0.5.

Why Logistic Regression Works

Multi-Class Extension:

For multi-class problems, consider the vector-valued regression function:

$$\boldsymbol{\eta}(x) = (\eta_1(x), \eta_2(x), \ldots, \eta_K(x))^T$$

where $\eta_k(x) = P(Y = k | X = x)$. This vector lies on the probability simplex (entries non-negative, sum to 1).

The Bayes classifier takes the argmax of this vector. Methods like softmax regression directly model this vector-valued function.

The Pointwise View:

Summary: The Foundation of Optimal Classification

We've established the theoretical foundation for optimal classification through the Bayes decision rule. Let's consolidate the key insights:

Key Takeaways

•The Bayes decision rule assigns each point $x$ to the class with maximum posterior probability: $h^*(x) = \arg\max_k P(Y = k | X = x)$
•The Bayes classifier is provably optimal—no other classifier can achieve lower error with respect to 0-1 loss
•Posterior probabilities encode everything we need for optimal decisions; the features' role is solely to inform the posterior
•Decision boundaries arise where classes have equal posterior probability, and their shape depends on the class-conditional distributions
•Binary classification simplifies to thresholding the posterior at 0.5 (for symmetric costs)
•The posterior equals the regression function for binary classification, connecting classification to regression theory
•In practice, we never know the true posterior—all of machine learning is about estimating it from data

What's Next:

Page Complete

1 / 5