Bayes Classifier - Learning Module

Loading content...

0/278

The Optimal Classifier

Defining Perfection in Classification

In the previous page, we derived the Bayes decision rule and claimed it produces an optimal classifier. But what exactly does "optimal" mean in machine learning? The term risks being vague or marketing-speak unless we pin it down precisely.

This page dives deep into the formal definition of optimality, exploring why the Bayes classifier achieves it, what assumptions are required, and how this concept guides the design of practical classifiers. We'll also see how optimal can coexist with imperfect—understanding that even the best possible classifier can still make errors.

What You Will Learn

By the end of this page, you will understand the precise mathematical meaning of classifier optimality, why the Bayes classifier is the unique minimizer of expected loss, the role of the loss function in defining optimality, and how these theoretical concepts inform practical algorithm design.

The Risk Minimization Framework

To make "optimality" precise, we need a mathematical framework for comparing classifiers. This is the statistical decision theory framework, which formalizes learning as risk minimization.

Core Components:

Data Distribution: $(X, Y) \sim P$, the true (unknown) joint distribution
Hypothesis Space: $\mathcal{H}$, the set of all classifiers we consider
Loss Function: $L: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}$, measuring error
Risk Functional: $R: \mathcal{H} \to \mathbb{R}_{\geq 0}$, the expected loss

The risk of classifier $h$ is defined as:

$$R(h) = \mathbb{E}_{(X,Y) \sim P}[L(Y, h(X))]$$

A classifier $h^*$ is optimal if it minimizes the risk over all possible classifiers:

$$R(h^*) = \inf_{h: \mathcal{X} \to \mathcal{Y}} R(h)$$

The Infimum vs. Minimum

We write infimum (inf) rather than minimum (min) because the minimum might not be attained in some pathological cases. However, for 0-1 loss and reasonable distributions, the Bayes classifier does achieve the minimum. The distinction matters in more exotic settings.

The Unrestricted Hypothesis Space:

A crucial point is that we're optimizing over all measurable functions $h: \mathcal{X} \to \mathcal{Y}$. This is an infinite, uncountably large space. We're not restricting ourselves to linear classifiers, decision trees, or neural networks—we're considering every possible way to map features to labels.

The remarkable fact is that despite this vast search space, the optimal solution has a simple, explicit form: the Bayes classifier.

$$h^*(x) = \arg\max_k P(Y = k | X = x)$$

This is possible because the problem decomposes pointwise: the optimal prediction at each $x$ can be made independently of other points.

Why Pointwise Decomposition Works:

Let's revisit this crucial insight. The risk can be written as:

$$R(h) = \int_{\mathcal{X}} \mathbb{E}[L(Y, h(X)) | X = x] \cdot p(x) , dx$$

For each fixed $x$, the integrand $\mathbb{E}[L(Y, h(x)) | X = x]$ depends only on the choice of $h(x)$ at that point. Since $p(x) \geq 0$, minimizing the overall integral is equivalent to minimizing the integrand at each $x$ separately.

This is a "separate optimization" result: the global minimum is achieved by optimizing locally everywhere. Such decomposition is rare in optimization and is a special property of the risk functional with additive loss.

The 0-1 Loss: Optimal Equals Posterior Maximum

The most natural loss function for classification is 0-1 loss (also called misclassification loss):

$$L_{0-1}(y, \hat{y}) = \mathbb{1}[y eq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y eq \hat{y} \end{cases}$$

This loss treats all errors equally: every mistake costs 1, every correct prediction costs 0. The corresponding risk is simply the probability of error:

$$R_{0-1}(h) = P(h(X) eq Y)$$

Minimizing this risk is equivalent to maximizing accuracy.

Deriving the Optimal Classifier:

For a fixed point $x$, the conditional risk is:

$$\mathbb{E}[L_{0-1}(Y, h(x)) | X = x] = P(Y eq h(x) | X = x) = \sum_{k eq h(x)} P(Y = k | X = x)$$

This is minimized when we choose $h(x)$ to be the class with the largest posterior:

$$h^*(x) = \arg\max_k P(Y = k | X = x)$$

Because: $$P(Y eq h^*(x) | X = x) = 1 - \max_k P(Y = k | X = x) \leq 1 - P(Y = k | X = x) = P(Y eq k | X = x)$$

for any other choice $k$. This proves the Bayes classifier minimizes 0-1 risk pointwise, hence globally.

Optimality Theorem (0-1 Loss)

For 0-1 loss, the Bayes classifier $h^*(x) = \arg\max_k P(Y = k | X = x)$ is the unique optimal classifier (up to decisions on measure-zero sets where multiple classes tie for maximum posterior).

Uniqueness:

The Bayes classifier is essentially unique. More precisely:

At any point $x$ where one class has strictly higher posterior than all others, the Bayes classifier must choose that class
At points where multiple classes tie for maximum posterior (a set of measure zero for continuous distributions), any of the tied classes is optimal

In practice, ties are infinitely improbable for continuous distributions, so the Bayes classifier is unique almost everywhere.

General Loss Functions and Optimal Decisions

While 0-1 loss is the most common, different applications call for different loss functions. The general framework still applies: the optimal classifier minimizes expected loss, but the form of the optimal decision changes.

The General Optimal Decision Rule:

For loss $L(y, \hat{y})$, the optimal prediction at $x$ is:

$$h^*(x) = \arg\min_k \sum_{j=1}^K L(j, k) \cdot P(Y = j | X = x) = \arg\min_k \sum_{j=1}^K L(j, k) \cdot \eta_j(x)$$

This is the prediction that minimizes expected loss given the observed features.

Optimal Decisions Under Different Loss Functions
Loss Function	Formula	Optimal Decision	Use Case
0-1 Loss	$L(y, \hat{y}) = \mathbb{1}[y eq \hat{y}]$	$\arg\max_k \eta_k(x)$ (MAP)	Accuracy maximization
Asymmetric Binary	$L(0, 1) = c_{FN}, L(1, 0) = c_{FP}$	Classify 1 if $\eta(x) > \frac{c_{FP}}{c_{FP} + c_{FN}}$	Medical diagnosis, fraud detection
Cost-Sensitive	$L(j, k) = C_{jk}$ (cost matrix)	$\arg\min_k \sum_j C_{jk} \eta_j(x)$	Business decisions with costs
Reject Option	Allow 'abstain' with cost $d$	Abstain if $\max_k \eta_k(x) < 1 - d$	High-stakes decisions

Example: Asymmetric Binary Classification

Consider medical diagnosis where:

False negative (missing disease) has high cost $c_{FN}$
False positive (unnecessary treatment) has lower cost $c_{FP}$

The optimal decision rule becomes:

$$h^*(x) = \begin{cases} 1 \text{ (disease)} & \text{if } \eta(x) \cdot c_{FN} > (1 - \eta(x)) \cdot c_{FP} \ 0 \text{ (healthy)} & \text{otherwise} \end{cases}$$

Simplifying: $$h^(x) = 1 \iff \eta(x) > \frac{c_{FP}}{c_{FP} + c_{FN}} = \tau^$$

With $c_{FN} = 10$ and $c_{FP} = 1$, we get $\tau^* = \frac{1}{11} \approx 0.09$. We classify as "disease" even when the posterior probability is just 10%, because missing the disease is so costly.

The Threshold Shifts, Not the Classifier

Notice that the optimal classifier for asymmetric loss still uses the posterior $\eta(x)$—it just thresholds it differently. The posterior remains the fundamental quantity; loss functions only determine how to interpret it. This is why probabilistic classifiers are valuable: they provide the posterior, allowing flexible post-hoc decisions.

Oracle Classifiers and Practical Infeasibility

The Bayes classifier is sometimes called an oracle classifier because it requires knowledge we never have in practice: the true posterior probabilities $P(Y = k | X = x)$.

Why We Never Know the True Posterior:

The joint distribution $P(X, Y)$ is unknown. It's the very thing we're trying to learn about.
We only have samples, not the distribution itself. From $n$ data points, we must infer properties of the underlying distribution.
The feature space is continuous. For a specific $x$, there may be zero or one training example at exactly that location, making direct estimation impossible.
The posterior varies smoothly but complexly. Its shape depends on the (unknown) class-conditional distributions and priors.

The Fundamental Task of Machine Learning:

Viewed through this lens, supervised learning is the task of approximating the Bayes classifier from finite data. Every classification algorithm can be understood as proposing a specific method for estimating the posterior (or the decision boundary directly):

Logistic regression: Assumes log-odds is linear in $x$, estimates $\eta(x) = \sigma(w^T x + b)$
K-nearest neighbors: Estimates $\eta(x)$ by voting among nearby training points
Decision trees: Partition space and estimate $\eta(x)$ as class proportions in each leaf
Neural networks: Learn flexible function $\eta(x)$ via function approximation
Naive Bayes: Estimates class-conditionals, computes posterior via Bayes' theorem

The Bayes Classifier is Not a Practical Algorithm

The Bayes classifier tells us what optimal looks like, but it cannot be directly computed. It's a theoretical construct, like knowing that the shortest path exists between two cities without having a map. The value is in guiding and benchmarking practical algorithms.

The Approximation Gap:

For any practical classifier $\hat{h}$ built from data:

$$R(\hat{h}) = R^* + \underbrace{(R(\hat{h}) - R^*)}_{\text{excess risk}}$$

where $R^* = R(h^*)$ is the Bayes risk. The excess risk measures how far our practical classifier is from optimal. It has two sources:

Approximation error: The best classifier in our hypothesis class might not match the Bayes classifier
Estimation error: Even the best-in-class classifier must be estimated from finite data

Understanding these errors is central to statistical learning theory.

Hypothesis Classes and Restricted Optimality

In practice, we don't search over all possible classifiers. We restrict to a hypothesis class $\mathcal{H}$—a subset of all possible functions from $\mathcal{X}$ to $\mathcal{Y}$.

Examples:

$\mathcal{H}_{\text{linear}}$: Linear classifiers $h(x) = \text{sign}(w^T x + b)$
$\mathcal{H}_{\text{tree}}$: Decision trees with at most $d$ depth
$\mathcal{H}_{\text{NN}}$: Neural networks with specific architecture

Within a hypothesis class, we can define a restricted Bayes risk:

$$R^*{\mathcal{H}} = \inf{h \in \mathcal{H}} R(h)$$

A classifier that achieves this is optimal within $\mathcal{H}$, but may be far from the true Bayes classifier.

The Approximation-Estimation Trade-off:

The choice of hypothesis class creates a fundamental trade-off:

Large $\mathcal{H}$ (e.g., deep neural networks):
- Small approximation error: Can approximate complex Bayes classifiers
- Large estimation error: Many parameters to estimate from finite data
- Risk of overfitting
Small $\mathcal{H}$ (e.g., logistic regression):
- Large approximation error: May not capture complex decision boundaries
- Small estimation error: Few parameters, stable estimates
- Risk of underfitting

The optimal hypothesis class depends on the data distribution and sample size.

approximation_error_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
 
def generate_nonlinear_data(n_samples: int = 1000, noise: float = 0.1):
    """
    Generate data where the Bayes-optimal boundary is circular.
    Linear classifiers will have approximation error.
    """
    np.random.seed(42)
    
    # Features uniformly in [-2, 2] x [-2, 2]
    X = np.random.uniform(-2, 2, (n_samples, 2))
    
    # True boundary: circle of radius 1
    # P(Y=1 | X=x) = sigmoid(r - ||x||) where r=1
    # Points inside circle more likely class 1, outside class 0
    distances = np.sqrt(X[:, 0]**2 + X[:, 1]**2)
    probs = 1 / (1 + np.exp((distances - 1) / noise))
    y = (np.random.random(n_samples) < probs).astype(int)
    
    return X, y, probs
 
 
# Generate data
X, y, true_probs = generate_nonlinear_data(n_samples=2000)
 
# Compare different hypothesis classes
models = {
    "Logistic Regression (Linear)": LogisticRegression(),
    "Decision Tree (depth 3)": DecisionTreeClassifier(max_depth=3),
    "Decision Tree (depth 10)": DecisionTreeClassifier(max_depth=10),
    "Neural Network (small)": MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000),
    "Neural Network (large)": MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000),
}
 
print("Cross-validated Accuracy (approximation of 1 - Risk):
")
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:35s}: {scores.mean():.4f} ± {scores.std():.4f}")
 
# Compute approximate Bayes accuracy (oracle that knows true probs)
bayes_predictions = (true_probs >= 0.5).astype(int)
bayes_accuracy = np.mean(bayes_predictions == y)
print(f"
{'Bayes Classifier (oracle)':35s}: {bayes_accuracy:.4f}")
 
# Output (example):
# Logistic Regression (Linear)       : 0.7520 ± 0.0150
# Decision Tree (depth 3)            : 0.8210 ± 0.0120
# Decision Tree (depth 10)           : 0.8640 ± 0.0180
# Neural Network (small)             : 0.8550 ± 0.0140
# Neural Network (large)             : 0.8720 ± 0.0110
# Bayes Classifier (oracle)          : 0.8850

The Gap is Approximation Error

In this example, the true boundary is circular. Logistic regression (linear boundary) has high approximation error. More flexible models (deep trees, neural networks) can better approximate the circular boundary, approaching the Bayes accuracy. But even the oracle can't achieve 100% due to the inherent noise in labels.

Consistency and Asymptotic Optimality

A central question in learning theory: Can we approach the Bayes classifier as we collect more data? This leads to the concept of consistency.

Definition: Universal Consistency

A classification method is universally consistent if, for any data distribution $P$:

$$R(\hat{h}_n) \to R^* \text{ as } n \to \infty$$

where $\hat{h}_n$ is the classifier learned from $n$ training examples and $R^*$ is the Bayes risk.

Universal consistency means the method will eventually learn the optimal classifier given enough data, regardless of the (unknown) underlying distribution.

Which Methods Are Universally Consistent?

Some classical results:

K-NN with $k \to \infty$ and $k/n \to 0$: Universally consistent. As $n$ grows, we use more neighbors (reducing variance) but still a vanishing fraction (maintaining locality).
Decision trees with proper pruning: Universally consistent under certain growing schemes.
Neural networks with growing architecture: Can be universally consistent with proper regularization.
Fixed-capacity models: Generally NOT universally consistent. A linear classifier can never learn a nonlinear Bayes boundary, no matter how much data.
Naive Bayes: NOT universally consistent in general, because the conditional independence assumption may be wrong.

Consistency vs. Finite-Sample Performance

Consistency is an asymptotic property—it says nothing about performance with finite data. A universally consistent method might need billions of samples to approach the Bayes risk. In practice, we care about finite-sample performance, which depends on the data distribution's complexity and our choice of hypothesis class.

Rates of Convergence:

Beyond consistency, we can ask how fast the excess risk decreases. For well-specified models and smooth distributions:

$$R(\hat{h}_n) - R^* = O(n^{-\alpha})$$

where $\alpha$ depends on the smoothness of the boundary and the method's complexity. Typical rates range from $O(n^{-1/2})$ (parametric) to $O(n^{-1/d})$ (nonparametric in $d$ dimensions, showing the curse of dimensionality).

These rates quantify the practical value of additional data and inform sample size requirements.

The Role of the Bayes Classifier in Practice

Given that we can never compute the Bayes classifier, why study it? Its value lies in several practical applications:

1. Benchmarking and Lower Bounds

The Bayes error provides a lower bound on achievable error. If we know or can estimate the Bayes error for a problem, we can:

Assess whether our current classifier is close to optimal
Decide whether to invest in more complex models or more data
Set realistic expectations for model performance

2. Model Debugging

If a classifier's error greatly exceeds a reasonable estimate of the Bayes error, something is wrong—perhaps the features aren't informative, the model is too simple, or there's a bug in the training pipeline.

3. Feature Evaluation

Features that reduce the Bayes error are "worth collecting." The Bayes error with an additional feature set $X'$ is:

$$R^(X, X') \leq R^(X)$$

Equality holds when $X'$ provides no additional information about $Y$ given $X$. This guides feature engineering.

4. Algorithm Design Principles

The Bayes classifier suggests design principles:

Classifiers should estimate or approximate posteriors
Flexible models should be able to represent complex posteriors
Regularization prevents overfitting to empirical posteriors

5. Theoretical Analysis

Many theoretical results (generalization bounds, convergence rates) are stated in terms of excess risk over the Bayes risk. The Bayes classifier provides the reference point.

Practical Applications of Bayes Classifier Theory

•Performance benchmarking: Estimate Bayes error to know how close you are to optimal
•Model selection: Compare classifiers by how close their risk is to the Bayes risk
•Early stopping: If you're near the Bayes error, additional complexity won't help
•Feature selection: Features worth collecting are those that reduce Bayes error
•Algorithm debugging: Large gap from Bayes error indicates model or data issues

Summary: Understanding Optimal Classification

We've explored the deep meaning of optimality in classification and the Bayes classifier's role as the theoretical gold standard. Let's consolidate:

Key Takeaways

•Optimality is precisely defined via risk minimization: the optimal classifier minimizes expected loss over the true data distribution
•For 0-1 loss, the optimal decision at each point is the class with maximum posterior probability (Bayes decision rule)
•Different loss functions lead to different optimal decisions, but all depend on the posterior probabilities
•The Bayes classifier is an oracle—it requires the true distribution, which we never have in practice
•Practical classifiers approximate the Bayes classifier through various estimation methods
•Hypothesis class choice creates an approximation-estimation trade-off
•Universal consistency means converging to Bayes optimality with infinite data; finite-sample performance may differ
•The Bayes classifier guides practice through benchmarking, debugging, and algorithm design

What's Next:

The optimal classifier achieves the best possible accuracy, but even it makes errors when classes overlap in feature space. In the next page, we'll explore the Bayes error rate—the irreducible error that no classifier can surpass—and understand what it tells us about the fundamental difficulty of a classification problem.

Page Complete

You now understand what makes the Bayes classifier optimal: it minimizes expected loss over all possible classifiers by maximizing posterior probabilities. This theoretical ideal guides the design and evaluation of practical machine learning algorithms.

The Optimal Classifier

Defining Perfection in Classification

What You Will Learn

The Risk Minimization Framework

To make "optimality" precise, we need a mathematical framework for comparing classifiers. This is the statistical decision theory framework, which formalizes learning as risk minimization.

Core Components:

Data Distribution: $(X, Y) \sim P$, the true (unknown) joint distribution
Hypothesis Space: $\mathcal{H}$, the set of all classifiers we consider
Loss Function: $L: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}$, measuring error
Risk Functional: $R: \mathcal{H} \to \mathbb{R}_{\geq 0}$, the expected loss

The risk of classifier $h$ is defined as:

$$R(h) = \mathbb{E}_{(X,Y) \sim P}[L(Y, h(X))]$$

A classifier $h^*$ is optimal if it minimizes the risk over all possible classifiers:

$$R(h^*) = \inf_{h: \mathcal{X} \to \mathcal{Y}} R(h)$$

The Infimum vs. Minimum

The Unrestricted Hypothesis Space:

The remarkable fact is that despite this vast search space, the optimal solution has a simple, explicit form: the Bayes classifier.

$$h^*(x) = \arg\max_k P(Y = k | X = x)$$

This is possible because the problem decomposes pointwise: the optimal prediction at each $x$ can be made independently of other points.

Why Pointwise Decomposition Works:

Let's revisit this crucial insight. The risk can be written as:

$$R(h) = \int_{\mathcal{X}} \mathbb{E}[L(Y, h(X)) | X = x] \cdot p(x) , dx$$

The 0-1 Loss: Optimal Equals Posterior Maximum

The most natural loss function for classification is 0-1 loss (also called misclassification loss):

$$L_{0-1}(y, \hat{y}) = \mathbb{1}[y eq \hat{y}] = \begin{cases} 0 & \text{if } y = \hat{y} \ 1 & \text{if } y eq \hat{y} \end{cases}$$

This loss treats all errors equally: every mistake costs 1, every correct prediction costs 0. The corresponding risk is simply the probability of error:

$$R_{0-1}(h) = P(h(X) eq Y)$$

Minimizing this risk is equivalent to maximizing accuracy.

Deriving the Optimal Classifier:

For a fixed point $x$, the conditional risk is:

$$\mathbb{E}[L_{0-1}(Y, h(x)) | X = x] = P(Y eq h(x) | X = x) = \sum_{k eq h(x)} P(Y = k | X = x)$$

This is minimized when we choose $h(x)$ to be the class with the largest posterior:

$$h^*(x) = \arg\max_k P(Y = k | X = x)$$

Because: $$P(Y eq h^*(x) | X = x) = 1 - \max_k P(Y = k | X = x) \leq 1 - P(Y = k | X = x) = P(Y eq k | X = x)$$

for any other choice $k$. This proves the Bayes classifier minimizes 0-1 risk pointwise, hence globally.

Optimality Theorem (0-1 Loss)

For 0-1 loss, the Bayes classifier $h^*(x) = \arg\max_k P(Y = k | X = x)$ is the unique optimal classifier (up to decisions on measure-zero sets where multiple classes tie for maximum posterior).

Uniqueness:

The Bayes classifier is essentially unique. More precisely:

At any point $x$ where one class has strictly higher posterior than all others, the Bayes classifier must choose that class
At points where multiple classes tie for maximum posterior (a set of measure zero for continuous distributions), any of the tied classes is optimal

In practice, ties are infinitely improbable for continuous distributions, so the Bayes classifier is unique almost everywhere.

General Loss Functions and Optimal Decisions

The General Optimal Decision Rule:

For loss $L(y, \hat{y})$, the optimal prediction at $x$ is:

$$h^*(x) = \arg\min_k \sum_{j=1}^K L(j, k) \cdot P(Y = j | X = x) = \arg\min_k \sum_{j=1}^K L(j, k) \cdot \eta_j(x)$$

This is the prediction that minimizes expected loss given the observed features.

Optimal Decisions Under Different Loss Functions
Loss Function	Formula	Optimal Decision	Use Case
0-1 Loss	$L(y, \hat{y}) = \mathbb{1}[y eq \hat{y}]$	$\arg\max_k \eta_k(x)$ (MAP)	Accuracy maximization
Asymmetric Binary	$L(0, 1) = c_{FN}, L(1, 0) = c_{FP}$	Classify 1 if $\eta(x) > \frac{c_{FP}}{c_{FP} + c_{FN}}$	Medical diagnosis, fraud detection
Cost-Sensitive	$L(j, k) = C_{jk}$ (cost matrix)	$\arg\min_k \sum_j C_{jk} \eta_j(x)$	Business decisions with costs
Reject Option	Allow 'abstain' with cost $d$	Abstain if $\max_k \eta_k(x) < 1 - d$	High-stakes decisions

Example: Asymmetric Binary Classification

Consider medical diagnosis where:

False negative (missing disease) has high cost $c_{FN}$
False positive (unnecessary treatment) has lower cost $c_{FP}$

The optimal decision rule becomes:

$$h^*(x) = \begin{cases} 1 \text{ (disease)} & \text{if } \eta(x) \cdot c_{FN} > (1 - \eta(x)) \cdot c_{FP} \ 0 \text{ (healthy)} & \text{otherwise} \end{cases}$$

Simplifying: $$h^(x) = 1 \iff \eta(x) > \frac{c_{FP}}{c_{FP} + c_{FN}} = \tau^$$

With $c_{FN} = 10$ and $c_{FP} = 1$, we get $\tau^* = \frac{1}{11} \approx 0.09$. We classify as "disease" even when the posterior probability is just 10%, because missing the disease is so costly.

The Threshold Shifts, Not the Classifier

Oracle Classifiers and Practical Infeasibility

The Bayes classifier is sometimes called an oracle classifier because it requires knowledge we never have in practice: the true posterior probabilities $P(Y = k | X = x)$.

Why We Never Know the True Posterior:

The joint distribution $P(X, Y)$ is unknown. It's the very thing we're trying to learn about.
We only have samples, not the distribution itself. From $n$ data points, we must infer properties of the underlying distribution.
The feature space is continuous. For a specific $x$, there may be zero or one training example at exactly that location, making direct estimation impossible.
The posterior varies smoothly but complexly. Its shape depends on the (unknown) class-conditional distributions and priors.

The Fundamental Task of Machine Learning:

Logistic regression: Assumes log-odds is linear in $x$, estimates $\eta(x) = \sigma(w^T x + b)$
K-nearest neighbors: Estimates $\eta(x)$ by voting among nearby training points
Decision trees: Partition space and estimate $\eta(x)$ as class proportions in each leaf
Neural networks: Learn flexible function $\eta(x)$ via function approximation
Naive Bayes: Estimates class-conditionals, computes posterior via Bayes' theorem

The Bayes Classifier is Not a Practical Algorithm

The Approximation Gap:

For any practical classifier $\hat{h}$ built from data:

$$R(\hat{h}) = R^* + \underbrace{(R(\hat{h}) - R^*)}_{\text{excess risk}}$$

where $R^* = R(h^*)$ is the Bayes risk. The excess risk measures how far our practical classifier is from optimal. It has two sources:

Approximation error: The best classifier in our hypothesis class might not match the Bayes classifier
Estimation error: Even the best-in-class classifier must be estimated from finite data

Understanding these errors is central to statistical learning theory.

Hypothesis Classes and Restricted Optimality

In practice, we don't search over all possible classifiers. We restrict to a hypothesis class $\mathcal{H}$—a subset of all possible functions from $\mathcal{X}$ to $\mathcal{Y}$.

Examples:

$\mathcal{H}_{\text{linear}}$: Linear classifiers $h(x) = \text{sign}(w^T x + b)$
$\mathcal{H}_{\text{tree}}$: Decision trees with at most $d$ depth
$\mathcal{H}_{\text{NN}}$: Neural networks with specific architecture

Within a hypothesis class, we can define a restricted Bayes risk:

$$R^*{\mathcal{H}} = \inf{h \in \mathcal{H}} R(h)$$

A classifier that achieves this is optimal within $\mathcal{H}$, but may be far from the true Bayes classifier.

The Approximation-Estimation Trade-off:

The choice of hypothesis class creates a fundamental trade-off:

Large $\mathcal{H}$ (e.g., deep neural networks):
- Small approximation error: Can approximate complex Bayes classifiers
- Large estimation error: Many parameters to estimate from finite data
- Risk of overfitting
Small $\mathcal{H}$ (e.g., logistic regression):
- Large approximation error: May not capture complex decision boundaries
- Small estimation error: Few parameters, stable estimates
- Risk of underfitting

The optimal hypothesis class depends on the data distribution and sample size.

approximation_error_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score
 
def generate_nonlinear_data(n_samples: int = 1000, noise: float = 0.1):
    """
    Generate data where the Bayes-optimal boundary is circular.
    Linear classifiers will have approximation error.
    """
    np.random.seed(42)
    
    # Features uniformly in [-2, 2] x [-2, 2]
    X = np.random.uniform(-2, 2, (n_samples, 2))
    
    # True boundary: circle of radius 1
    # P(Y=1 | X=x) = sigmoid(r - ||x||) where r=1
    # Points inside circle more likely class 1, outside class 0
    distances = np.sqrt(X[:, 0]**2 + X[:, 1]**2)
    probs = 1 / (1 + np.exp((distances - 1) / noise))
    y = (np.random.random(n_samples) < probs).astype(int)
    
    return X, y, probs
 
 
# Generate data
X, y, true_probs = generate_nonlinear_data(n_samples=2000)
 
# Compare different hypothesis classes
models = {
    "Logistic Regression (Linear)": LogisticRegression(),
    "Decision Tree (depth 3)": DecisionTreeClassifier(max_depth=3),
    "Decision Tree (depth 10)": DecisionTreeClassifier(max_depth=10),
    "Neural Network (small)": MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000),
    "Neural Network (large)": MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=1000),
}
 
print("Cross-validated Accuracy (approximation of 1 - Risk):
")
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:35s}: {scores.mean():.4f} ± {scores.std():.4f}")
 
# Compute approximate Bayes accuracy (oracle that knows true probs)
bayes_predictions = (true_probs >= 0.5).astype(int)
bayes_accuracy = np.mean(bayes_predictions == y)
print(f"
{'Bayes Classifier (oracle)':35s}: {bayes_accuracy:.4f}")
 
# Output (example):
# Logistic Regression (Linear)       : 0.7520 ± 0.0150
# Decision Tree (depth 3)            : 0.8210 ± 0.0120
# Decision Tree (depth 10)           : 0.8640 ± 0.0180
# Neural Network (small)             : 0.8550 ± 0.0140
# Neural Network (large)             : 0.8720 ± 0.0110
# Bayes Classifier (oracle)          : 0.8850

The Gap is Approximation Error

Consistency and Asymptotic Optimality

A central question in learning theory: Can we approach the Bayes classifier as we collect more data? This leads to the concept of consistency.

Definition: Universal Consistency

A classification method is universally consistent if, for any data distribution $P$:

$$R(\hat{h}_n) \to R^* \text{ as } n \to \infty$$

where $\hat{h}_n$ is the classifier learned from $n$ training examples and $R^*$ is the Bayes risk.

Universal consistency means the method will eventually learn the optimal classifier given enough data, regardless of the (unknown) underlying distribution.

Which Methods Are Universally Consistent?

Some classical results:

K-NN with $k \to \infty$ and $k/n \to 0$: Universally consistent. As $n$ grows, we use more neighbors (reducing variance) but still a vanishing fraction (maintaining locality).
Decision trees with proper pruning: Universally consistent under certain growing schemes.
Neural networks with growing architecture: Can be universally consistent with proper regularization.
Fixed-capacity models: Generally NOT universally consistent. A linear classifier can never learn a nonlinear Bayes boundary, no matter how much data.
Naive Bayes: NOT universally consistent in general, because the conditional independence assumption may be wrong.

Consistency vs. Finite-Sample Performance

Rates of Convergence:

Beyond consistency, we can ask how fast the excess risk decreases. For well-specified models and smooth distributions:

$$R(\hat{h}_n) - R^* = O(n^{-\alpha})$$

These rates quantify the practical value of additional data and inform sample size requirements.

The Role of the Bayes Classifier in Practice

Given that we can never compute the Bayes classifier, why study it? Its value lies in several practical applications:

1. Benchmarking and Lower Bounds

The Bayes error provides a lower bound on achievable error. If we know or can estimate the Bayes error for a problem, we can:

Assess whether our current classifier is close to optimal
Decide whether to invest in more complex models or more data
Set realistic expectations for model performance

2. Model Debugging

3. Feature Evaluation

Features that reduce the Bayes error are "worth collecting." The Bayes error with an additional feature set $X'$ is:

$$R^(X, X') \leq R^(X)$$

Equality holds when $X'$ provides no additional information about $Y$ given $X$. This guides feature engineering.

4. Algorithm Design Principles

The Bayes classifier suggests design principles:

Classifiers should estimate or approximate posteriors
Flexible models should be able to represent complex posteriors
Regularization prevents overfitting to empirical posteriors

5. Theoretical Analysis

Many theoretical results (generalization bounds, convergence rates) are stated in terms of excess risk over the Bayes risk. The Bayes classifier provides the reference point.

Practical Applications of Bayes Classifier Theory

•Performance benchmarking: Estimate Bayes error to know how close you are to optimal
•Model selection: Compare classifiers by how close their risk is to the Bayes risk
•Early stopping: If you're near the Bayes error, additional complexity won't help
•Feature selection: Features worth collecting are those that reduce Bayes error
•Algorithm debugging: Large gap from Bayes error indicates model or data issues

Summary: Understanding Optimal Classification

We've explored the deep meaning of optimality in classification and the Bayes classifier's role as the theoretical gold standard. Let's consolidate:

Key Takeaways

•Optimality is precisely defined via risk minimization: the optimal classifier minimizes expected loss over the true data distribution
•For 0-1 loss, the optimal decision at each point is the class with maximum posterior probability (Bayes decision rule)
•Different loss functions lead to different optimal decisions, but all depend on the posterior probabilities
•The Bayes classifier is an oracle—it requires the true distribution, which we never have in practice
•Practical classifiers approximate the Bayes classifier through various estimation methods
•Hypothesis class choice creates an approximation-estimation trade-off
•Universal consistency means converging to Bayes optimality with infinite data; finite-sample performance may differ
•The Bayes classifier guides practice through benchmarking, debugging, and algorithm design

What's Next:

Page Complete