Loading learning content...
Every classification algorithm—from logistic regression to deep neural networks—shares a common aspiration: to predict the correct class label for unseen data with the highest possible accuracy. But what does optimal classification actually mean? Is there a theoretical limit to how good any classifier can be? And if so, what classifier achieves it?
These questions lead us to one of the most profound results in statistical learning theory: the Bayes classifier and its associated Bayes decision rule. This isn't just another algorithm to add to your toolkit—it's the theoretical benchmark against which all classifiers are measured. Understanding it provides deep insight into the fundamental nature of classification and the irreducible limits imposed by inherent data uncertainty.
By the end of this page, you will understand the Bayes decision rule and why it produces optimal classification decisions. You'll learn how to derive the rule from first principles using posterior probabilities, explore its geometric interpretation as partitioning feature space, and understand why this theoretical optimum is the standard by which we judge all practical classifiers.
Before deriving the Bayes decision rule, let's establish the precise mathematical framework for classification. While you've likely encountered classification before, appreciating the Bayes classifier requires rigorous foundations.
The Probabilistic Classification Setup:
Consider a classification problem with:
Our goal is to construct a classifier $h: \mathcal{X} \to \mathcal{Y}$ that maps feature vectors to class labels. But what makes one classifier better than another?
The entire framework rests on the assumption that data points $(x, y)$ are drawn independently from a fixed joint distribution $P(X, Y)$. This means the relationship between features and labels is governed by probabilistic rules that don't change over time. While this assumption may not hold perfectly in practice (data distributions can shift), it provides the essential foundation for theoretical analysis.
The Risk Framework:
To compare classifiers, we need a measure of quality. The most natural choice is the probability of misclassification (0-1 loss):
$$L(y, \hat{y}) = \mathbb{1}[y \neq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y \neq \hat{y} \end{cases}$$
The risk (or expected loss) of a classifier $h$ is:
$$R(h) = \mathbb{E}_{(X,Y) \sim P}[L(Y, h(X))] = P(h(X) \neq Y)$$
This is simply the probability that the classifier makes an error on a randomly drawn data point. Our objective is to find the classifier $h^*$ that minimizes this risk:
$$h^* = \arg\min_h R(h) = \arg\min_h P(h(X) \neq Y)$$
The Key Insight: Pointwise Optimization
Here's the crucial observation that leads to the Bayes classifier. The risk can be written as an expectation over the feature space:
$$R(h) = \mathbb{E}X\left[\mathbb{E}{Y|X}[L(Y, h(X))]\right] = \int_{\mathcal{X}} \underbrace{P(h(x) \neq Y | X = x)}_{\text{conditional error at } x} \cdot p(x) , dx$$
Since $p(x) \geq 0$ for all $x$, minimizing the overall risk is equivalent to minimizing the conditional error at each point $x$ separately. This decomposition is powerful: instead of searching over all possible classifiers globally, we can optimize pointwise, choosing the best prediction for each $x$ independently.
The ability to optimize pointwise comes from the structure of the 0-1 loss. For each point $x$, the classifier must output a single class label. The label that minimizes the probability of error at $x$ can be chosen without considering what the classifier does at other points. This decoupling is what makes the Bayes classifier elegant and tractable to analyze.
Now we derive the optimal classifier from first principles. For a fixed feature vector $x$, we want to find the class label that minimizes the conditional probability of error:
$$h^*(x) = \arg\min_{k \in \mathcal{Y}} P(Y \neq k | X = x)$$
Since probabilities sum to one:
$$P(Y \neq k | X = x) = 1 - P(Y = k | X = x)$$
Minimizing this is equivalent to maximizing the posterior probability:
$$h^*(x) = \arg\max_{k \in \mathcal{Y}} P(Y = k | X = x)$$
This is the Bayes decision rule: classify each point $x$ to the class with the highest posterior probability.
$$h^*(x) = \arg\max_{k \in {1, 2, \ldots, K}} P(Y = k | X = x)$$
The optimal classifier assigns each point $x$ to whichever class has the highest posterior probability given the observed features. This classifier is called the Bayes classifier or Bayes optimal classifier.
Intuitive Interpretation:
The Bayes decision rule is remarkably intuitive. Given the evidence $x$, we're betting on the most likely class. If we had to make one prediction for each $x$ and wanted to maximize our expected accuracy, we'd pick the class with probability greater than all others.
Consider a medical diagnosis example:
By always picking the most probable class, we maximize our accuracy over the long run.
Formal Statement and Notation:
Let's formalize this with consistent notation. Define:
By Bayes' theorem:
$$\eta_k(x) = P(Y = k | X = x) = \frac{p(x | Y = k) \cdot P(Y = k)}{p(x)} = \frac{\pi_k \cdot p_k(x)}{\sum_{j=1}^K \pi_j \cdot p_j(x)}$$
The Bayes classifier becomes:
$$h^*(x) = \arg\max_k \eta_k(x) = \arg\max_k \frac{\pi_k \cdot p_k(x)}{p(x)} = \arg\max_k \left[ \pi_k \cdot p_k(x) \right]$$
Note that the denominator $p(x)$ is constant across classes and can be dropped from the argmax.
| Quantity | Notation | Formula | Interpretation |
|---|---|---|---|
| Prior | $\pi_k$ | $P(Y = k)$ | Base rate of class $k$ in population |
| Class-conditional | $p_k(x)$ | $p(x | Y = k)$ | Feature distribution within class $k$ |
| Posterior | $\eta_k(x)$ | $P(Y = k | X = x)$ | Updated probability after observing $x$ |
| Marginal | $p(x)$ | $\sum_k \pi_k p_k(x)$ | Overall density of observing $x$ |
We've claimed that the Bayes classifier is optimal, but let's prove this rigorously and understand exactly what we mean by optimality.
Theorem (Optimality of the Bayes Classifier):
Let $h^*$ be the Bayes classifier and let $h$ be any other classifier. Then:
$$R(h^*) \leq R(h)$$
That is, the Bayes classifier achieves the lowest possible risk (error rate) among all classifiers.
Proof:
For any classifier $h$, the risk can be decomposed as:
$$R(h) = \int_{\mathcal{X}} P(Y \neq h(x) | X = x) \cdot p(x) , dx$$
At each point $x$, the conditional error is:
$$P(Y \neq h(x) | X = x) = 1 - P(Y = h(x) | X = x) = 1 - \eta_{h(x)}(x)$$
The Bayes classifier chooses $h^*(x) = \arg\max_k \eta_k(x)$, so for any other choice $h(x)$:
$$\eta_{h^*(x)}(x) \geq \eta_{h(x)}(x)$$
Therefore:
$$1 - \eta_{h^*(x)}(x) \leq 1 - \eta_{h(x)}(x)$$
Integrating over $x$:
$$R(h^) = \int (1 - \eta_{h^(x)}(x)) p(x) dx \leq \int (1 - \eta_{h(x)}(x)) p(x) dx = R(h) \quad \square$$
The Bayes classifier is optimal with respect to 0-1 loss and assuming we know the true posterior probabilities. Different loss functions lead to different optimal classifiers. And in practice, we never know the true posteriors—we must estimate them from data, which introduces error. The Bayes classifier is a theoretical ideal, not a practical algorithm.
Understanding the Optimality:
The proof reveals something profound: no classifier can beat the Bayes classifier, regardless of how sophisticated or computationally expensive it is. Deep neural networks, ensemble methods, kernel machines—all are fundamentally limited by the Bayes error rate.
This might seem discouraging, but it's actually liberating. It tells us that:
The Bayes classifier is theoretically realizable only when we have complete knowledge of the joint distribution $P(X, Y)$. Since this information is never available in practice, we use approximations—and that's where all of machine learning comes in.
Relating to Information Theory:
The optimality of the Bayes classifier has deep connections to information theory. The posterior $\eta_k(x)$ represents all the information that $x$ provides about $Y$. Any classifier that doesn't use the posterior optimally is, in some sense, throwing away information.
The Bayes decision rule is essentially asking: given everything the features tell us about the class, what's our best guess? By choosing the most probable class, we're making the decision that will be correct most often.
This connects to the concept of sufficient statistics: the posterior $\eta(x)$ is a sufficient statistic for deciding the class label. Any transformation of $x$ that doesn't change the posterior doesn't change the optimal decision.
The Bayes decision rule has an elegant geometric interpretation. It partitions the feature space $\mathcal{X}$ into $K$ decision regions, one for each class:
$$\mathcal{R}_k = {x \in \mathcal{X} : h^*(x) = k} = {x : \eta_k(x) \geq \eta_j(x) \text{ for all } j \neq k}$$
Each decision region $\mathcal{R}_k$ contains all points where class $k$ has the highest posterior probability. The boundaries between regions—where two or more classes have equal posterior probabilities—form the Bayes decision boundaries.
Decision Boundary Equations:
The boundary between classes $i$ and $j$ is the set where:
$$\eta_i(x) = \eta_j(x)$$
Using Bayes' theorem (and dropping the common denominator):
$$\pi_i \cdot p_i(x) = \pi_j \cdot p_j(x)$$
Taking logs:
$$\log \pi_i + \log p_i(x) = \log \pi_j + \log p_j(x)$$
Rearranging:
$$\log \frac{p_i(x)}{p_j(x)} = \log \frac{\pi_j}{\pi_i}$$
This is the log-likelihood ratio equals the log prior odds. The shape of this boundary depends entirely on the class-conditional distributions $p_i(x)$ and $p_j(x)$.
| Class-Conditional Distribution | Boundary Type | Example Classifier |
|---|---|---|
| Gaussian with equal covariances | Linear (hyperplane) | Linear Discriminant Analysis (LDA) |
| Gaussian with different covariances | Quadratic (conic section) | Quadratic Discriminant Analysis (QDA) |
| Mixture of Gaussians | Complex, nonlinear | GMM-based classifiers |
| Arbitrary distributions | Can be arbitrarily complex | Kernel-based methods, Neural networks |
Visualizing the Bayes Classifier:
Consider binary classification in 2D with Gaussian class-conditionals:
If $\Sigma_0 = \Sigma_1 = I$ (equal covariances), the decision boundary is a straight line halfway between the means (adjusted for prior differences). The Bayes classifier assigns points to whichever Gaussian center is closer (in Mahalanobis distance).
If $\Sigma_0 \neq \Sigma_1$, the boundary becomes a quadratic curve—a hyperbola, ellipse, or parabola depending on the covariance matrices.
The Bayes decision boundary can be arbitrarily complex depending on the true data distribution. Two overlapping spirals with opposite class labels would have a spiral-shaped Bayes boundary. This is why flexible models like neural networks can outperform simple models—they can approximate arbitrarily complex decision boundaries.
Binary classification ($K = 2$) is both the most common case and the one that admits the simplest analysis. Let's denote the classes as 0 and 1, with posterior:
$$\eta(x) \triangleq P(Y = 1 | X = x)$$
Then $P(Y = 0 | X = x) = 1 - \eta(x)$, and the Bayes decision rule simplifies to:
$$h^*(x) = \begin{cases} 1 & \text{if } \eta(x) \geq 0.5 \ 0 & \text{if } \eta(x) < 0.5 \end{cases}$$
Or equivalently:
$$h^*(x) = \mathbb{1}[\eta(x) \geq 0.5]$$
The decision boundary is the set where $\eta(x) = 0.5$, i.e., where both classes are equally likely.
The Threshold Perspective:
Binary classification can be viewed through the lens of thresholding a continuous score. The posterior $\eta(x)$ maps each point to $[0, 1]$, and we threshold at 0.5 to make the decision.
This perspective connects to:
The threshold 0.5 is optimal only for 0-1 loss. Different thresholds become optimal when:
Log-Odds Formulation:
Working with the log-odds (logit) is often more convenient:
$$\text{logit}(\eta(x)) = \log \frac{\eta(x)}{1 - \eta(x)} = \log \frac{P(Y = 1 | x)}{P(Y = 0 | x)}$$
Using Bayes' theorem:
$$\text{logit}(\eta(x)) = \log \frac{\pi_1 \cdot p_1(x)}{\pi_0 \cdot p_0(x)} = \underbrace{\log \frac{\pi_1}{\pi_0}}{\text{log prior odds}} + \underbrace{\log \frac{p_1(x)}{p_0(x)}}{\text{log-likelihood ratio}}$$
The decision boundary is where this equals zero:
$$\log \frac{p_1(x)}{p_0(x)} = \log \frac{\pi_0}{\pi_1}$$
This formulation is central to logistic regression and log-linear models.
The prior odds $\pi_1 / \pi_0$ shift the decision boundary. In the extreme case where class 1 is very rare ($\pi_1 \ll \pi_0$), the boundary shifts toward class 1, requiring stronger evidence (higher likelihood ratio) to classify as class 1. This is a built-in adjustment for class imbalance—rare classes need more evidence to trigger a positive classification.
The Bayes decision rule can be expressed in several equivalent forms, each offering different insights and computational advantages.
Formulation 1: Maximum A Posteriori (MAP)
$$h^*(x) = \arg\max_k P(Y = k | X = x) = \arg\max_k \eta_k(x)$$
This is the most direct expression: choose the class with maximum posterior probability.
Formulation 2: Maximum Posterior-Proportional Score
$$h^*(x) = \arg\max_k \left[ \pi_k \cdot p_k(x) \right]$$
Since the denominator $p(x)$ is constant across classes, we can drop it from the optimization.
Formulation 3: Maximum Log-Score
$$h^*(x) = \arg\max_k \left[ \log \pi_k + \log p_k(x) \right]$$
Taking logs doesn't change the argmax (log is monotonic) and turns products into sums, which is numerically more stable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom typing import List, Tuple def bayes_decision_rule_log_score( x: np.ndarray, priors: np.ndarray, # Shape: (K,) - prior probabilities log_likelihoods: np.ndarray # Shape: (K,) - log p(x | Y=k) for each class) -> int: """ Bayes decision rule using log-score formulation. For numerical stability, we work in log-space: - Avoid underflow from multiplying small probabilities - Addition is cheaper than multiplication Parameters: x: Feature vector (for logging/debugging) priors: Prior probabilities for each class log_likelihoods: Log class-conditional probabilities for x Returns: Class label with maximum posterior probability """ # Log-score: log(prior) + log(likelihood) log_scores = np.log(priors + 1e-300) + log_likelihoods # MAP decision: argmax of log-scores return int(np.argmax(log_scores)) def bayes_classifier( X: np.ndarray, # Shape: (n, d) priors: np.ndarray, # Shape: (K,) class_conditional_log_pdfs: List[callable] # List of K functions) -> np.ndarray: """ Apply Bayes decision rule to a dataset. Parameters: X: Feature matrix priors: Prior probabilities class_conditional_log_pdfs: Functions that compute log p(x | Y=k) Returns: Array of predicted class labels """ n = X.shape[0] K = len(priors) predictions = np.zeros(n, dtype=int) for i in range(n): x = X[i] log_likelihoods = np.array([ clf(x) for clf in class_conditional_log_pdfs ]) predictions[i] = bayes_decision_rule_log_score(x, priors, log_likelihoods) return predictions # Example: Gaussian class-conditionalsfrom scipy.stats import multivariate_normal def make_gaussian_log_pdf(mean: np.ndarray, cov: np.ndarray): """Create log-pdf function for a multivariate Gaussian.""" dist = multivariate_normal(mean=mean, cov=cov) return lambda x: dist.logpdf(x) # Example usageif __name__ == "__main__": # Two Gaussian classes in 2D priors = np.array([0.4, 0.6]) # Class 1 more common mean_0 = np.array([0, 0]) mean_1 = np.array([3, 0]) cov = np.eye(2) # Shared covariance log_pdfs = [ make_gaussian_log_pdf(mean_0, cov), make_gaussian_log_pdf(mean_1, cov) ] # Classify some test points test_points = np.array([ [0, 0], # Near class 0 mean [3, 0], # Near class 1 mean [1.5, 0], # Midpoint [1.0, 2], # Equidistant in Euclidean, but priors differ ]) predictions = bayes_classifier(test_points, priors, log_pdfs) print("Predictions:", predictions) # Output: [0, 1, 1, 1] (midpoint goes to class 1 due to higher prior)Formulation 4: Minimum Expected Loss
For general loss functions $L(y, \hat{y})$, the optimal decision at $x$ is:
$$h^*(x) = \arg\min_k \mathbb{E}{Y|X=x}[L(Y, k)] = \arg\min_k \sum{j=1}^K L(j, k) \cdot \eta_j(x)$$
For 0-1 loss, $L(j, k) = \mathbb{1}[j \neq k]$, so this reduces to:
$$h^*(x) = \arg\min_k \sum_{j \neq k} \eta_j(x) = \arg\min_k (1 - \eta_k(x)) = \arg\max_k \eta_k(x)$$
Recovering our earlier result. But this formulation generalizes to asymmetric costs and other loss functions.
There's a beautiful connection between the Bayes classifier and regression that provides additional insight.
The Regression Function:
In regression, we seek the function that minimizes mean squared error:
$$f^*(x) = \arg\min_f \mathbb{E}[(Y - f(X))^2 | X = x] = \mathbb{E}[Y | X = x]$$
The optimal regression function is the conditional expectation (conditional mean).
Binary Classification as Regression:
For binary classification with labels ${0, 1}$, the regression function is:
$$f^*(x) = \mathbb{E}[Y | X = x] = 1 \cdot P(Y = 1 | x) + 0 \cdot P(Y = 0 | x) = \eta(x)$$
The posterior probability is exactly the regression function! The Bayes classifier then thresholds this regression function at 0.5.
This connection explains why logistic regression is effective for classification. It directly models the regression function $\eta(x) = P(Y = 1 | X = x)$ using the logistic (sigmoid) function. By estimating the posterior, it implicitly learns the Bayes classifier's decision rule.
Multi-Class Extension:
For multi-class problems, consider the vector-valued regression function:
$$\boldsymbol{\eta}(x) = (\eta_1(x), \eta_2(x), \ldots, \eta_K(x))^T$$
where $\eta_k(x) = P(Y = k | X = x)$. This vector lies on the probability simplex (entries non-negative, sum to 1).
The Bayes classifier takes the argmax of this vector. Methods like softmax regression directly model this vector-valued function.
The Pointwise View:
This perspective emphasizes that classification is fundamentally about learning a function—the posterior probability function $\eta(x)$. Once we have $\eta(x)$, classification is trivial (just take argmax). The challenge of machine learning is estimating $\eta(x)$ accurately from finite data.
We've established the theoretical foundation for optimal classification through the Bayes decision rule. Let's consolidate the key insights:
What's Next:
The Bayes classifier is the theoretical gold standard, but it requires knowing the true posterior probabilities—which we never have in practice. In the next page, we'll explore the optimal classifier concept more deeply, including the Bayes error rate and the fundamental limits it imposes on any learning algorithm.
You now understand the Bayes decision rule and why it produces optimal classification decisions. This theoretical framework will guide our exploration of practical classifiers—each of which can be understood as attempting to approximate the Bayes classifier using limited data.