Bayes Classifier - Learning Module

Loading content...

0/245

Bayes Error Rate

The Irreducible Floor of Error

Even the perfect classifier makes mistakes. This counterintuitive fact lies at the heart of probabilistic classification: when classes overlap in feature space, some errors are unavoidable. No algorithm, no amount of data, no computational power can reduce error below a certain threshold.

This threshold is the Bayes error rate—the error achieved by the Bayes classifier. Understanding this fundamental limit is crucial for setting realistic expectations, diagnosing model performance, and knowing when to seek better features versus better algorithms.

What You Will Learn

By the end of this page, you will understand the mathematical definition of the Bayes error rate, visualize it geometrically as overlap in class distributions, explore methods to estimate it from data, and appreciate its implications for machine learning practice and the value of feature engineering.

Defining the Bayes Error Rate

The Bayes error rate (also called Bayes risk or Bayes optimal error) is the expected error of the Bayes classifier:

$$R^* = R(h^) = \mathbb{E}_{(X,Y)}[\mathbb{1}[h^(X) \neq Y]] = P(h^*(X) \neq Y)$$

where $h^*(x) = \arg\max_k P(Y = k | X = x)$ is the Bayes classifier.

Alternatively, using the posterior probabilities:

$$R^* = \mathbb{E}_X\left[1 - \max_k P(Y = k | X = x)\right] = \mathbb{E}_X\left[\min_k P(Y \neq k | X = x)\right]$$

Key Properties of the Bayes Error Rate:

1. It's Irreducible

No classifier (of any form, trained on any amount of data) can achieve lower error
It represents the fundamental uncertainty in the data itself

2. It's Non-Negative

$R^* \geq 0$, with equality iff classes are perfectly separable

3. It's Bounded Above

For $K$ classes with random guessing: $R^* \leq 1 - \frac{1}{K}$
For binary classification: $R^* \leq 0.5$

4. It Depends Only on the Data Distribution

$R^*$ is a property of $P(X, Y)$, not of any algorithm
Different problems have different Bayes error rates

The Source of Irreducible Error

Why can't we achieve zero error? Because the same feature vector $x$ can arise from different classes. Medical symptoms don't uniquely determine diseases; customer features don't uniquely predict purchases. This inherent ambiguity is captured by the posterior probability being less than 1.

The Bayes Error Formula (Binary Case):

For binary classification with posterior $\eta(x) = P(Y = 1 | X = x)$:

$$R^* = \mathbb{E}_X[\min(\eta(X), 1 - \eta(X))]$$

This formula shows that the error at each point is the probability of the minority class. When $\eta(x) = 0.7$, the Bayes classifier predicts class 1, but is wrong 30% of the time for such points. The Bayes error is the average of these pointwise error rates.

Geometric Interpretation: Class Overlap

The Bayes error has a beautiful geometric interpretation in terms of class distribution overlap.

Class-Conditional Perspective:

Using class-conditional densities and Bayes' theorem:

$$P(Y = k | X = x) = \frac{\pi_k \cdot p_k(x)}{\sum_j \pi_j \cdot p_j(x)}$$

The Bayes error occurs where the "wrong" class has significant probability. For binary classification:

$$R^* = \int_{\mathcal{X}} \min(\pi_0 p_0(x), \pi_1 p_1(x)) , dx$$

This is the probability mass in the overlap region where one class's weighted density intrudes on the other's.

Visualizing the Overlap:

Imagine two Gaussian distributions in 1D:

Class 0: $\mathcal{N}(0, 1)$ with prior $\pi_0 = 0.5$
Class 1: $\mathcal{N}(3, 1)$ with prior $\pi_1 = 0.5$

The weighted densities are $0.5 \cdot p_0(x)$ and $0.5 \cdot p_1(x)$. The decision boundary is at $x = 1.5$ (midpoint). The Bayes error is the area where the curves cross—the probability mass assigned to the "wrong" class.

As the means move closer, overlap increases, and Bayes error rises. As they separate, overlap decreases, and Bayes error falls. When distributions are identical, Bayes error is 0.5 (random guessing).

Bayes Error for Gaussian Distributions with Equal Variance
Mean Separation ($d$)	Bayes Error	Interpretation
$d = 0$	0.500	Identical distributions, random guessing
$d = 1\sigma$	0.309	Considerable overlap
$d = 2\sigma$	0.159	Moderate overlap
$d = 3\sigma$	0.067	Small overlap
$d = 4\sigma$	0.023	Minimal overlap
$d \to \infty$	0.000	Perfect separation

Multi-Class Geometry:

For $K$ classes, the feature space is partitioned into $K$ decision regions. The Bayes error is the sum of probability masses where each class intrudes on another's region:

$$R^* = \sum_{k=1}^K \int_{\mathcal{R}_j, j \neq k} \pi_k \cdot p_k(x) , dx$$

This is the total "misplaced" probability—class $k$ points that fall in region $j$'s territory.

Why Better Features Reduce Bayes Error

Adding informative features reduces class overlap in the expanded feature space. A feature that perfectly separates classes would make the overlap (and Bayes error) zero. This is why feature engineering is so powerful—it attacks the irreducible error itself, not just the approximation.

Analytic Computation of Bayes Error

In special cases, the Bayes error can be computed exactly. These closed-form solutions provide valuable benchmarks and intuition.

Case 1: Univariate Gaussians with Equal Variance

For binary classification with:

Class 0: $\mathcal{N}(\mu_0, \sigma^2)$, prior $\pi_0$
Class 1: $\mathcal{N}(\mu_1, \sigma^2)$, prior $\pi_1$

The decision boundary is at: $$x^* = \frac{\mu_0 + \mu_1}{2} + \frac{\sigma^2}{\mu_1 - \mu_0} \log \frac{\pi_0}{\pi_1}$$

The Bayes error is: $$R^* = \pi_0 \cdot \Phi\left(\frac{x^* - \mu_0}{\sigma}\right) + \pi_1 \cdot \left(1 - \Phi\left(\frac{x^* - \mu_1}{\sigma}\right)\right)$$

where $\Phi$ is the standard normal CDF.

Case 2: Equal Priors Simplification

With equal priors ($\pi_0 = \pi_1 = 0.5$) and equal variances, the boundary is the midpoint $x^* = (\mu_0 + \mu_1)/2$, and:

$$R^* = \Phi\left(-\frac{|\mu_1 - \mu_0|}{2\sigma}\right) = \Phi\left(-\frac{d}{2}\right)$$

where $d = |\mu_1 - \mu_0|/\sigma$ is the standardized distance between means.

Case 3: Multivariate Gaussians (LDA Setting)

For $d$-dimensional Gaussians with shared covariance $\Sigma$: $$R^* = \Phi\left(-\frac{\Delta}{2}\right)$$

where $\Delta = \sqrt{(\mu_1 - \mu_0)^T \Sigma^{-1} (\mu_1 - \mu_0)}$ is the Mahalanobis distance.

This formula shows that Bayes error depends on the statistical distance between classes, not just Euclidean distance.

bayes_error_computation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
import numpy as np
from scipy.stats import norm
from scipy.linalg import sqrtm, inv
 
def bayes_error_1d_gaussian(
    mu0: float, mu1: float, 
    sigma: float, 
    pi0: float = 0.5
) -> float:
    """
    Compute exact Bayes error for 1D Gaussian classes with equal variance.
    
    Parameters:
        mu0, mu1: Class means
        sigma: Shared standard deviation
        pi0: Prior probability of class 0 (pi1 = 1 - pi0)
    
    Returns:
        Bayes error rate
    """
    pi1 = 1 - pi0
    
    # Decision boundary
    if mu1 != mu0:
        x_star = (mu0 + mu1) / 2 + (sigma**2 / (mu1 - mu0)) * np.log(pi0 / pi1)
    else:
        # Identical means: use prior-based threshold in log-odds space
        # With identical means, boundary is at ±∞ depending on priors
        return min(pi0, pi1)  # Always predict majority class
    
    # Bayes error: P(X > x* | class 0) * pi0 + P(X < x* | class 1) * pi1
    error_class_0 = norm.cdf(x_star, mu0, sigma) if mu1 > mu0 else 1 - norm.cdf(x_star, mu0, sigma)
    error_class_1 = 1 - norm.cdf(x_star, mu1, sigma) if mu1 > mu0 else norm.cdf(x_star, mu1, sigma)
    
    # Correct computation
    # If mu1 > mu0: boundary at x_star, class 0 predicted for x < x_star
    # Error for class 0: P(X > x_star | Y=0) = 1 - Phi((x_star - mu0)/sigma)
    # Error for class 1: P(X < x_star | Y=1) = Phi((x_star - mu1)/sigma)
    
    z0 = (x_star - mu0) / sigma
    z1 = (x_star - mu1) / sigma
    
    # Class 0 errors when X > x_star, class 1 errors when X < x_star
    if mu1 > mu0:
        error = pi0 * (1 - norm.cdf(z0)) + pi1 * norm.cdf(z1)
    else:
        error = pi0 * norm.cdf(z0) + pi1 * (1 - norm.cdf(z1))
    
    return error
 
 
def bayes_error_gaussian_equal_cov(
    mu0: np.ndarray, 
    mu1: np.ndarray, 
    Sigma: np.ndarray,
    pi0: float = 0.5
) -> float:
    """
    Compute exact Bayes error for multivariate Gaussian with shared covariance.
    Equal priors case.
    
    Parameters:
        mu0, mu1: Class mean vectors (d,)
        Sigma: Shared covariance matrix (d, d)
        pi0: Prior probability of class 0
    
    Returns:
        Bayes error rate
    """
    # Mahalanobis distance
    diff = mu1 - mu0
    Sigma_inv = inv(Sigma)
    delta_squared = diff.T @ Sigma_inv @ diff
    delta = np.sqrt(delta_squared)
    
    # For equal priors
    if abs(pi0 - 0.5) < 1e-10:
        return float(norm.cdf(-delta / 2))
    else:
        # General case with unequal priors
        pi1 = 1 - pi0
        log_ratio = np.log(pi0 / pi1)
        
        # Threshold in the direction of (mu1 - mu0)
        # Boundary shifts according to prior ratio
        t = delta / 2 + log_ratio / delta
        
        return float(pi0 * (1 - norm.cdf(t)) + pi1 * norm.cdf(t - delta))
 
 
# Examples
print("=== Bayes Error Examples ===\n")
 
# 1D case with different separations
for d in [0, 1, 2, 3, 4]:
    error = bayes_error_1d_gaussian(0, d, 1.0, 0.5)
    print(f"1D Gaussians (separation = {d}σ): R* = {error:.4f}")
 
print()
 
# 2D case
mu0_2d = np.array([0, 0])
mu1_2d = np.array([2, 0])
Sigma_2d = np.eye(2)
error_2d = bayes_error_gaussian_equal_cov(mu0_2d, mu1_2d, Sigma_2d)
print(f"2D Gaussians (Mahalanobis = 2): R* = {error_2d:.4f}")
 
# Effect of correlation
Sigma_corr = np.array([[1, 0.8], [0.8, 1]])
error_corr = bayes_error_gaussian_equal_cov(mu0_2d, mu1_2d, Sigma_corr)
print(f"2D Correlated (ρ=0.8): R* = {error_corr:.4f}")

Bounds on the Bayes Error Rate

When analytic computation isn't possible, we can derive bounds on the Bayes error using information-theoretic and distribution-based arguments.

Upper Bounds:

1. Prior-Based Bound $$R^* \leq \min_k \pi_k = 1 - \max_k \pi_k$$

A classifier that always predicts the most common class achieves this error. The Bayes classifier must do at least as well.

2. Random Guessing Bound $$R^* \leq 1 - \frac{1}{K}$$

Uniform random guessing achieves this for $K$ classes. For binary: $R^* \leq 0.5$.

3. Total Variation Bound For binary classification: $$R^* \leq \frac{1 - \text{TV}(P_0, P_1)}{2}$$ where $\text{TV}(P_0, P_1) = \frac{1}{2}\int |p_0(x) - p_1(x)| dx$ is the total variation distance.

Lower Bounds:

1. Entropy-Based Bound $$R^* \geq H_b^{-1}\left(H(Y|X)\right)$$

where $H(Y|X)$ is the conditional entropy and $H_b^{-1}$ is the inverse binary entropy function. This bound connects Bayes error to information theory.

2. Hellinger Distance Bound $$R^* \geq \frac{1 - H(P_0, P_1)^2}{2}$$

where $H(P_0, P_1) = \sqrt{1 - \int \sqrt{p_0(x) p_1(x)} dx}$ is the Hellinger distance.

3. KL Divergence Bound (Bretagnolle-Huber) $$R^* \geq \frac{1}{2} \exp\left(-D_{KL}(P_0 | P_1)\right)$$

where $D_{KL}$ is the Kullback-Leibler divergence.

Relationships Between Bounds
Distance Measure	Bound Type	Formula	Tightness
Total Variation	Upper	$R^* \leq (1 - \text{TV})/2$	Tight when posteriors are 0/1
Hellinger	Lower	$R^* \geq (1 - H^2)/2$	Often loose
KL Divergence	Lower	$R^* \geq e^{-D_{KL}}/2$	Loose for small divergence
Bhattacharyya	Upper	$R^* \leq \sqrt{\pi_0 \pi_1} \cdot BC$	Commonly used in practice

Bounds vs. Exact Values

These bounds can be loose, especially for complex distributions. They're most useful for: (1) quick feasibility assessments, (2) comparing problems without full computation, and (3) theoretical analysis. For precise benchmarking, estimation methods are preferred.

Estimating Bayes Error from Data

In practice, we don't know the true distribution, so we must estimate the Bayes error from data. This is challenging because we're trying to estimate the performance of an unknown optimal classifier.

Method 1: Plug-In Estimation

Estimate the class-conditional densities $\hat{p}_k(x)$ and priors $\hat{\pi}_k$
Compute the estimated posterior $\hat{\eta}_k(x) = \frac{\hat{\pi}_k \hat{p}_k(x)}{\sum_j \hat{\pi}_j \hat{p}_j(x)}$
Construct the plug-in Bayes classifier $\hat{h}^*(x) = \arg\max_k \hat{\eta}_k(x)$
Estimate Bayes error as $\hat{R}^* = \frac{1}{n} \sum_{i=1}^n \left(1 - \max_k \hat{\eta}_k(x_i)\right)$

This approach's accuracy depends heavily on the quality of density estimation.

Method 2: K-NN Based Estimation

The K-NN classifier provides a natural Bayes error estimator. With appropriate $k$:

$$\hat{R}^*{\text{NN}} = \frac{1}{n} \sum{i=1}^n \left(1 - \frac{1}{k} \sum_{j \in N_k(x_i)} \mathbb{1}[y_j = \hat{y}_i]\right)$$

where $N_k(x_i)$ is the set of $k$ nearest neighbors.

Key result (Cover & Hart, 1967): For 1-NN: $$R^* \leq R_{\text{1-NN}} \leq 2R^(1 - R^) \leq 2R^*$$

The 1-NN error is at most twice the Bayes error. As $n \to \infty$ and $k/n \to 0$ with $k \to \infty$, the K-NN error converges to $R^*$.

bayes_error_estimation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from scipy.stats import multivariate_normal
 
def estimate_bayes_error_knn(X, y, k_values=[1, 3, 5, 10, 20]):
    """
    Estimate Bayes error using K-NN classifiers.
    
    The 1-NN error upper bounds 2 * R*.
    As k increases, K-NN error approaches R*.
    
    Returns dict of k -> estimated error
    """
    estimates = {}
    
    for k in k_values:
        knn = KNeighborsClassifier(n_neighbors=k)
        # Use cross-validation to avoid overfitting on training data
        scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
        error = 1 - np.mean(scores)
        estimates[k] = error
    
    return estimates
 
 
def estimate_bayes_error_plugin(X, y, bandwidth='scott'):
    """
    Plug-in Bayes error estimation using kernel density estimation.
    
    1. Estimate class-conditional densities
    2. Compute posterior at each test point
    3. Estimate R* as E[1 - max_k η_k(x)]
    """
    from sklearn.neighbors import KernelDensity
    
    classes = np.unique(y)
    n = len(y)
    
    # Estimate priors
    priors = {c: np.mean(y == c) for c in classes}
    
    # Estimate class-conditional densities
    kdes = {}
    for c in classes:
        X_c = X[y == c]
        kde = KernelDensity(bandwidth=0.5)  # Simplified; should tune
        kde.fit(X_c)
        kdes[c] = kde
    
    # Compute posterior at each point
    posterior_max = np.zeros(n)
    
    for i in range(n):
        x_i = X[i:i+1]  # Keep 2D shape
        
        # Compute log p(x | y=c) for each class
        log_likelihoods = {c: kdes[c].score_samples(x_i)[0] for c in classes}
        
        # Compute posterior (in log space for stability)
        log_joint = {c: np.log(priors[c]) + log_likelihoods[c] for c in classes}
        max_log = max(log_joint.values())
        
        # Normalize via log-sum-exp
        log_normalizer = max_log + np.log(sum(np.exp(log_joint[c] - max_log) for c in classes))
        
        posterior_max[i] = np.exp(max(log_joint.values()) - log_normalizer)
    
    # Bayes error estimate
    return np.mean(1 - posterior_max)
 
 
# Example: Estimate Bayes error for known distribution
np.random.seed(42)
n_samples = 500
 
# True Bayes error for these Gaussians is ~0.159 (2σ separation)
mu0, mu1 = np.array([0, 0]), np.array([2, 0])
cov = np.eye(2)
 
X0 = np.random.multivariate_normal(mu0, cov, n_samples // 2)
X1 = np.random.multivariate_normal(mu1, cov, n_samples // 2)
X = np.vstack([X0, X1])
y = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))
 
# Shuffle
shuffle_idx = np.random.permutation(len(y))
X, y = X[shuffle_idx], y[shuffle_idx]
 
print("K-NN Bayes Error Estimates:")
knn_estimates = estimate_bayes_error_knn(X, y)
for k, error in knn_estimates.items():
    print(f"  k={k:2d}: estimated error = {error:.4f}")
 
print(f"\nPlug-in estimate: {estimate_bayes_error_plugin(X, y):.4f}")
print("True Bayes error: ~0.159")

Method 3: Cross-Validation with Flexible Classifiers

If we use a very flexible classifier (e.g., large random forest, well-tuned neural network) with proper regularization:

$$\hat{R}^* \approx R_{\text{CV}}(\text{flexible classifier})$$

The test error of a highly flexible, properly regularized classifier serves as an upper bound on the Bayes error. This is a practical approach used in deep learning.

Method 4: Human Performance Baseline

For tasks where human labels define ground truth, human error rate provides an estimate:

$$R_{\text{human}} \approx R^* + \text{human-specific errors}$$

If humans achieve 95% accuracy on image classification, the Bayes error is at most 5%.

Implications for Machine Learning Practice

Understanding the Bayes error has profound implications for how we approach machine learning problems.

1. Setting Realistic Expectations

If the Bayes error is 20%, no amount of algorithmic sophistication will achieve 1% error. Knowing the approximate Bayes error helps set realistic targets and allocate resources wisely.

Error >> estimated Bayes error → Model/algorithm can be improved
Error ≈ estimated Bayes error → Near optimal; seek better features
Error < estimated Bayes error → Overfitting or measurement error

2. Features vs. Algorithms

The Bayes error is determined by the features, not the algorithm. This insight redirects effort:

High Bayes error with current features → Collect better features
Low model error far from Bayes → Improve model/algorithm

Feature engineering attacks the problem at its root—reducing the irreducible error itself.

3. Data Quality Assessment

Bayes error reveals data quality issues:

Very high Bayes error → Features may not be predictive of labels
High label noise → Artificial inflation of Bayes error
Missing important features → Bayes error higher than necessary

Decision Framework Based on Bayes Error

•Current error >> Bayes error: Your model/algorithm has room for improvement. Try more complex models, better regularization, or more data.
•Current error ≈ Bayes error: You're near optimal with current features. Seek additional features or accept current performance.
•Bayes error itself is too high: The problem isn't solvable with current features. Collect different or additional data.
•Bayes error is low but hard to achieve: May need more data or computational resources to learn the complex boundary.

The Feature Engineering Priority

In many practical applications, improving features yields larger gains than improving algorithms. A simple model with excellent features often beats a complex model with poor features. The Bayes error framework explains why: good features reduce the irreducible error, while better algorithms only reduce the reducible error.

Bayes Error in Multi-Class Problems

Multi-class classification ($K > 2$) introduces additional complexity to Bayes error analysis.

General Formula:

$$R^* = \mathbb{E}X\left[1 - \max{k=1}^K P(Y = k | X)\right] = 1 - \mathbb{E}_X\left[\max_k \eta_k(X)\right]$$

The error at each point is $1$ minus the largest class probability.

Baseline Comparisons:

Random guessing: $R_{\text{random}} = 1 - 1/K = (K-1)/K$
Always predict most common: $R_{\text{majority}} = 1 - \max_k \pi_k$
Bayes error: $R^* \leq \min(R_{\text{random}}, R_{\text{majority}})$

Confusion Patterns:

In multi-class problems, Bayes error often concentrates at specific class boundaries:

Similar classes (e.g., cat vs. dog) may have high pairwise Bayes error
Dissimilar classes (e.g., cat vs. car) may be nearly perfectly separable

This motivates hierarchical classification and confusion analysis.

Per-Class Bayes Error:

We can define class-specific error rates: $$R^_k = P(h^(X) \neq k | Y = k) = \mathbb{E}[1 - \mathbb{1}[\arg\max_j \eta_j(X) = k] | Y = k]$$

This helps identify which classes are inherently hard to classify.

Bayes Error Characteristics by Number of Classes
Property	$K = 2$	$K > 2$
Random guessing error	0.5	$(K-1)/K$
Minimum possible	0	0 (if separable)
Error formula	$\mathbb{E}[\min(\eta, 1-\eta)]$	$\mathbb{E}[1 - \max_k \eta_k]$
Confusion structure	Single boundary	$(K-1)K/2$ pairwise boundaries
Estimation difficulty	Moderate	Increases with $K$

Summary: The Fundamental Limit of Classification

We've explored the Bayes error rate as the fundamental limit on classification accuracy. Let's consolidate the key insights:

Key Takeaways

•The Bayes error rate is the irreducible minimum—no classifier can achieve lower error using the same features
•Geometrically, Bayes error measures the class distribution overlap in feature space
•Analytic formulas exist for simple cases (Gaussians) and provide valuable benchmarks
•Bounds from information theory relate Bayes error to divergence measures between class distributions
•Estimation from data uses K-NN, plug-in methods, or cross-validation with flexible classifiers
•Practical implications: Bayes error guides resource allocation between algorithm improvement and feature engineering
•Better features directly reduce Bayes error, while better algorithms only reduce the gap between achieved and Bayes error
•Multi-class problems have richer structure with per-class and pairwise Bayes error analysis

What's Next:

We've established the theoretical framework: the Bayes classifier maximizes posterior probabilities to achieve the Bayes error rate. The fundamental challenge is that computing posteriors requires the unknown distribution. In the next page, we'll explore posterior computation—how to calculate and approximate posteriors in practice, setting the stage for practical generative classifiers like Naive Bayes.

Page Complete

You now understand the Bayes error rate as the fundamental limit on classification performance. This knowledge provides crucial context for evaluating classifiers and making strategic decisions about where to invest effort in machine learning projects.