Gaussian Naive Bayes - Learning Module

Loading content...

0/278

Parameter Estimation

Learning from Data: The Estimation Problem

Gaussian Naive Bayes is elegant in its assumptions: each feature follows a Gaussian distribution within each class. But where do the Gaussians come from? How do we determine the mean $\mu_{jk}$ and variance $\sigma^2_{jk}$ for feature $j$ in class $k$?

The answer lies in parameter estimation—the statistical process of inferring distribution parameters from observed data. Given a training set with labeled examples, we estimate the Gaussian parameters that best explain the observed feature values within each class.

This process is remarkably simple for Gaussians: the maximum likelihood estimates are the sample mean and sample variance. Yet beneath this simplicity lies deep statistical theory that illuminates when these estimates are reliable, when they might fail, and how to handle edge cases.

This page develops the complete theory of parameter estimation for Gaussian Naive Bayes, from first principles to practical implementation.

What You Will Learn

By the end of this page, you will understand: (1) the principle of maximum likelihood estimation (MLE), (2) derivation of Gaussian MLE for mean and variance, (3) bias correction in variance estimation, (4) handling edge cases like zero variance, (5) variance smoothing and regularization, and (6) complete implementation of Gaussian NB training.

Maximum Likelihood Estimation Principle

Maximum likelihood estimation (MLE) is the foundational approach to parameter estimation in statistics. The principle is elegantly simple: choose parameters that maximize the probability (or density) of the observed data.

The Likelihood Function

Given:

Observations $\mathcal{D} = {x_1, x_2, \ldots, x_n}$
A parametric distribution family $f(x | \theta)$ with unknown parameter $\theta$

The likelihood function is the probability of observing the data as a function of the parameter: $$\mathcal{L}(\theta | \mathcal{D}) = \prod_{i=1}^{n} f(x_i | \theta)$$

Assuming observations are independent and identically distributed (i.i.d.).

The Maximum Likelihood Estimator

The MLE is the value of $\theta$ that maximizes the likelihood: $$\hat{\theta}{\text{MLE}} = \arg\max\theta \mathcal{L}(\theta | \mathcal{D})$$

Log-Likelihood

Since $\log$ is monotonically increasing, maximizing likelihood is equivalent to maximizing log-likelihood: $$\ell(\theta | \mathcal{D}) = \log \mathcal{L}(\theta | \mathcal{D}) = \sum_{i=1}^{n} \log f(x_i | \theta)$$

This is computationally preferable (sums instead of products) and analytically convenient (derivatives are simpler).

Intuition

MLE asks: "Which parameter value(s) would have made the observed data most likely to occur?"

If we observe many values clustered around 50, the MLE for the mean will be near 50. If values are spread out, the MLE for variance will be large. MLE formalizes this intuition precisely.

MLE Properties

MLEs have desirable asymptotic properties: they are consistent (converge to true value with more data), asymptotically efficient (lowest variance among unbiased estimators), and asymptotically normal (follow a Gaussian distribution for large samples). These properties justify the widespread use of MLE in machine learning.

mle_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def demonstrate_mle_intuition():
    """
    Show how MLE finds the parameters that maximize data likelihood.
    """
    print("=" * 60)
    print("MAXIMUM LIKELIHOOD ESTIMATION INTUITION")
    print("=" * 60)
    
    # Generate data from a known distribution
    np.random.seed(42)
    true_mu = 5.0
    true_sigma = 2.0
    n = 50
    data = np.random.normal(true_mu, true_sigma, n)
    
    print(f"
Generated {n} samples from N({true_mu}, {true_sigma}²)")
    print(f"Sample mean: {data.mean():.4f}")
    print(f"Sample std: {data.std():.4f}")
    
    # Compute likelihood for different mu values (fixing sigma at true value)
    print("
--- Likelihood as a function of μ (σ fixed at true value) ---")
    
    mu_range = np.linspace(3, 7, 41)
    log_likelihoods = []
    
    for mu in mu_range:
        # Log-likelihood = sum of log(PDF) for each observation
        log_lik = np.sum(stats.norm(mu, true_sigma).logpdf(data))
        log_likelihoods.append(log_lik)
    
    log_likelihoods = np.array(log_likelihoods)
    best_mu_idx = np.argmax(log_likelihoods)
    best_mu = mu_range[best_mu_idx]
    
    print(f"
Log-likelihood evaluated at different μ values:")
    for i in range(0, len(mu_range), 8):
        print(f"  μ = {mu_range[i]:.1f}: log L = {log_likelihoods[i]:.2f}")
    
    print(f"
MLE for μ: {best_mu:.2f}")
    print(f"Sample mean: {data.mean():.4f}")
    print(f"True μ: {true_mu}")
    print("
→ MLE equals the sample mean (for known σ)")
    
    # Now compute likelihood over both μ and σ²
    print("
--- Likelihood as a function of both μ and σ ---")
    
    mu_grid = np.linspace(4, 6, 21)
    sigma_grid = np.linspace(1, 3, 21)
    
    log_lik_grid = np.zeros((len(sigma_grid), len(mu_grid)))
    
    for i, sigma in enumerate(sigma_grid):
        for j, mu in enumerate(mu_grid):
            log_lik_grid[i, j] = np.sum(stats.norm(mu, sigma).logpdf(data))
    
    # Find maximum
    max_idx = np.unravel_index(np.argmax(log_lik_grid), log_lik_grid.shape)
    mle_sigma = sigma_grid[max_idx[0]]
    mle_mu = mu_grid[max_idx[1]]
    
    print(f"MLE for (μ, σ): ({mle_mu:.2f}, {mle_sigma:.2f})")
    print(f"Sample statistics: ({data.mean():.2f}, {data.std():.2f})")
    print(f"True parameters: ({true_mu}, {true_sigma})")
    
    # Verify analytically
    print("
--- Analytical MLE formulas ---")
    mle_mu_analytical = data.mean()
    mle_var_analytical = data.var()  # Note: this is the MLE variance (biased)
    mle_sigma_analytical = np.sqrt(mle_var_analytical)
    
    print(f"MLE μ = sample mean = {mle_mu_analytical:.4f}")
    print(f"MLE σ² = (1/n)Σ(x_i - x̄)² = {mle_var_analytical:.4f}")
    print(f"MLE σ = {mle_sigma_analytical:.4f}")
    
    # Compare to unbiased variance estimator
    unbiased_var = data.var(ddof=1)
    print(f"
Unbiased variance (1/(n-1)Σ(x_i - x̄)²) = {unbiased_var:.4f}")
    print(f"True σ² = {true_sigma**2}")
 
demonstrate_mle_intuition()

MLE for Gaussian Parameters

Let us derive the maximum likelihood estimators for the Gaussian distribution parameters rigorously.

Setup

Given i.i.d. observations $x_1, x_2, \ldots, x_n$ from $\mathcal{N}(\mu, \sigma^2)$:

$$f(x_i | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$

Log-Likelihood Function

The log-likelihood is: $$\ell(\mu, \sigma^2) = \sum_{i=1}^{n} \log f(x_i | \mu, \sigma^2)$$

$$= \sum_{i=1}^{n} \left[ -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right]$$

$$= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Maximizing with Respect to $\mu$

Taking the partial derivative and setting to zero: $$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0$$

$$\sum_{i=1}^{n}x_i - n\mu = 0$$

$$\hat{\mu}{\text{MLE}} = \frac{1}{n}\sum{i=1}^{n}x_i = \bar{x}$$

The MLE for the mean is the sample mean.

Maximizing with Respect to $\sigma^2$

Taking the partial derivative: $$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^{n}(x_i - \mu)^2 = 0$$

$$\frac{n}{2\sigma^2} = \frac{1}{2(\sigma^2)^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$

Substituting $\hat{\mu}{\text{MLE}} = \bar{x}$: $$\hat{\sigma}^2{\text{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

The MLE for variance is the sample variance with $n$ in the denominator.

Biased Variance Estimator

The MLE for variance $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum(x_i - \bar{x})^2$ is biased: $E[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2$. It systematically underestimates the true variance. The unbiased estimator uses $n-1$: $s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$. For large $n$, the difference is negligible, but for small samples it matters.

Second Derivative Test

To confirm these are maxima, we check the second derivatives (Hessian matrix):

$$\frac{\partial^2 \ell}{\partial \mu^2} = -\frac{n}{\sigma^2} < 0$$

$$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2(\sigma^2)^2} - \frac{1}{(\sigma^2)^3}\sum_{i=1}^{n}(x_i - \mu)^2$$

At the MLE values, this is negative (for $n \geq 2$), confirming a maximum.

Summary of Gaussian MLE

Parameter	MLE Formula	Interpretation
Mean $\mu$	$\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i = \bar{x}$	Sample average
Variance $\sigma^2$	$\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$	Sample variance (biased)

Properties of Gaussian MLE
Property	Mean ($\hat{\mu}$)	Variance ($\hat{\sigma}^2$)
Unbiased?	Yes: $E[\hat{\mu}] = \mu$	No: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$
Consistent?	Yes	Yes
Variance	$\text{Var}[\hat{\mu}] = \frac{\sigma^2}{n}$	$\text{Var}[\hat{\sigma}^2] = \frac{2\sigma^4}{n}$ (approx)
Efficient?	Yes (achieves Cramér-Rao bound)	Asymptotically

Class-Specific Parameter Estimation

In Gaussian Naive Bayes, we estimate separate Gaussian parameters for each feature within each class. This section details the complete estimation procedure.

The Complete Parameter Set

For a classification problem with:

$K$ classes: ${0, 1, \ldots, K-1}$
$d$ features: ${1, 2, \ldots, d}$

We estimate:

Class priors: $\pi_k = P(y = k)$ for each class
Class-conditional means: $\mu_{jk}$ for each feature $j$ in each class $k$
Class-conditional variances: $\sigma^2_{jk}$ for each feature $j$ in each class $k$

Total: $(K-1) + K \cdot d + K \cdot d = 2Kd + K - 1$ parameters

Prior Estimation

The MLE for class priors is the class frequency: $$\hat{\pi}_k = \frac{n_k}{n}$$

Where $n_k$ is the number of training samples in class $k$ and $n$ is the total sample count.

Mean Estimation (Per Feature, Per Class)

$$\hat{\mu}{jk} = \frac{1}{n_k} \sum{i: y_i = k} x_{ij}$$

The mean of feature $j$ over all training samples belonging to class $k$.

Variance Estimation (Per Feature, Per Class)

$$\hat{\sigma}^2_{jk} = \frac{1}{n_k} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$$

The variance of feature $j$ over all training samples belonging to class $k$.

class_specific_estimation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from collections import defaultdict
 
def estimate_gaussian_nb_parameters(X, y):
    """
    Estimate all Gaussian Naive Bayes parameters from training data.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Training feature matrix
    y : np.ndarray of shape (n_samples,)
        Training class labels
    
    Returns:
    --------
    params : dict
        Dictionary containing:
        - priors: array of shape (n_classes,)
        - means: array of shape (n_classes, n_features)
        - variances: array of shape (n_classes, n_features)
        - classes: array of unique class labels
    """
    n_samples, n_features = X.shape
    classes = np.unique(y)
    n_classes = len(classes)
    
    # Initialize parameter arrays
    priors = np.zeros(n_classes)
    means = np.zeros((n_classes, n_features))
    variances = np.zeros((n_classes, n_features))
    
    print("=" * 60)
    print("GAUSSIAN NAIVE BAYES PARAMETER ESTIMATION")
    print("=" * 60)
    print(f"
Dataset: {n_samples} samples, {n_features} features, {n_classes} classes")
    
    for k, cls in enumerate(classes):
        # Get samples for this class
        class_mask = (y == cls)
        X_class = X[class_mask]
        n_class = len(X_class)
        
        print(f"
--- Class {cls} ---")
        print(f"Number of samples: {n_class}")
        
        # Prior probability
        priors[k] = n_class / n_samples
        print(f"Prior π_{cls} = {n_class}/{n_samples} = {priors[k]:.4f}")
        
        # Mean for each feature
        means[k, :] = X_class.mean(axis=0)
        
        # Variance for each feature
        variances[k, :] = X_class.var(axis=0)  # MLE (n in denominator)
        
        print(f"
Feature statistics:")
        for j in range(min(n_features, 5)):  # Show first 5 features
            print(f"  Feature {j}: μ = {means[k, j]:.4f}, σ² = {variances[k, j]:.4f}")
        if n_features > 5:
            print(f"  ... ({n_features - 5} more features)")
    
    return {
        'priors': priors,
        'means': means,
        'variances': variances,
        'classes': classes
    }
 
 
def verify_estimation_correctness():
    """
    Verify our estimation matches sklearn's GaussianNB.
    """
    from sklearn.naive_bayes import GaussianNB
    
    # Generate synthetic data
    np.random.seed(42)
    n_per_class = 100
    n_features = 4
    
    # Class 0: centered at origin
    X_0 = np.random.randn(n_per_class, n_features) * 2 + 0
    # Class 1: shifted
    X_1 = np.random.randn(n_per_class, n_features) * 1.5 + 3
    # Class 2: different variance
    X_2 = np.random.randn(n_per_class, n_features) * 0.5 + -1
    
    X = np.vstack([X_0, X_1, X_2])
    y = np.array([0]*n_per_class + [1]*n_per_class + [2]*n_per_class)
    
    # Shuffle
    shuffle_idx = np.random.permutation(len(y))
    X, y = X[shuffle_idx], y[shuffle_idx]
    
    # Our estimation
    print("
" + "="*60)
    print("VERIFICATION: COMPARING TO SKLEARN")
    print("="*60)
    
    params = estimate_gaussian_nb_parameters(X, y)
    
    # sklearn estimation
    gnb = GaussianNB()
    gnb.fit(X, y)
    
    print("
--- Comparison with sklearn ---")
    print("
Priors:")
    print(f"  Ours:    {params['priors']}")
    print(f"  sklearn: {gnb.class_prior_}")
    print(f"  Match:   {np.allclose(params['priors'], gnb.class_prior_)}")
    
    print("
Means (first 2 features shown):")
    print(f"  Ours:
{params['means'][:, :2]}")
    print(f"  sklearn:
{gnb.theta_[:, :2]}")
    print(f"  Match: {np.allclose(params['means'], gnb.theta_)}")
    
    print("
Variances (first 2 features shown):")
    print(f"  Ours:
{params['variances'][:, :2]}")
    print(f"  sklearn:
{gnb.var_[:, :2]}")
    print(f"  Match: {np.allclose(params['variances'], gnb.var_)}")
 
verify_estimation_correctness()

Efficient Implementation

For large datasets, use numpy's vectorized operations: X[y == k].mean(axis=0) and X[y == k].var(axis=0) compute all feature statistics for a class in one operation. This is much faster than looping over features.

Handling Edge Cases

Real-world data presents challenges that require careful handling. Several edge cases can cause problems in Gaussian Naive Bayes parameter estimation.

Zero Variance

If a feature has the same value for all samples in a class, its variance estimate is zero: $$\hat{\sigma}^2_{jk} = 0$$

Problem: Division by zero when computing log-likelihood: $$\log f(x_j | k) = -\frac{(x_j - \mu_{jk})^2}{2\sigma^2_{jk}} - \frac{1}{2}\log(2\pi\sigma^2_{jk})$$

Both terms involve division by or $\log$ of zero.

Solutions:

Variance floor: Replace $\sigma^2 = 0$ with a small minimum value $\epsilon$
Add smoothing constant: $\tilde{\sigma}^2 = \sigma^2 + \epsilon$
Remove constant features: Preprocessing step to eliminate zero-variance features

Small Sample Size

With few samples per class, variance estimates are unreliable (high variance of the variance estimate!).

Solutions:

Pooled variance: Share variance across classes
Prior variance: Regularize toward a prior estimate
Bayesian estimation: Full Bayesian treatment with conjugate priors

Missing Classes in Training

If a class has no training examples, we cannot estimate its parameters.

Solutions:

Uniform prior: Use uninformative mean and large variance
Error handling: Raise exception during training
Data augmentation: Ensure all classes are represented

edge_case_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
from scipy import stats
 
class RobustGaussianNB:
    """
    Gaussian Naive Bayes with robust handling of edge cases.
    """
    
    def __init__(self, var_smoothing=1e-9, min_samples_per_class=1):
        """
        Parameters:
        -----------
        var_smoothing : float
            Minimum variance to prevent division by zero.
            Added to all variances.
        min_samples_per_class : int
            Minimum samples required per class (for warning).
        """
        self.var_smoothing = var_smoothing
        self.min_samples_per_class = min_samples_per_class
        self.classes_ = None
        self.priors_ = None
        self.means_ = None
        self.variances_ = None
        
    def fit(self, X, y):
        """
        Fit Gaussian NB with edge case handling.
        """
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        self.priors_ = np.zeros(n_classes)
        self.means_ = np.zeros((n_classes, n_features))
        self.variances_ = np.zeros((n_classes, n_features))
        
        print("=" * 60)
        print("ROBUST GAUSSIAN NB: EDGE CASE HANDLING")
        print("=" * 60)
        
        for k, cls in enumerate(self.classes_):
            class_mask = (y == cls)
            X_class = X[class_mask]
            n_class = len(X_class)
            
            # Edge case 1: Too few samples
            if n_class < self.min_samples_per_class:
                print(f"
⚠ WARNING: Class {cls} has only {n_class} samples!")
            
            # Prior
            self.priors_[k] = n_class / n_samples
            
            # Mean (works even with 1 sample)
            self.means_[k, :] = X_class.mean(axis=0)
            
            # Variance
            raw_variance = X_class.var(axis=0)
            
            # Edge case 2: Zero variance detection
            zero_var_features = np.sum(raw_variance == 0)
            if zero_var_features > 0:
                print(f"
⚠ Class {cls}: {zero_var_features} features have zero variance")
                print(f"   These will be smoothed with ε = {self.var_smoothing}")
            
            # Apply variance smoothing
            self.variances_[k, :] = raw_variance + self.var_smoothing
        
        print("
✓ Training complete with variance smoothing applied")
        return self
    
    def _log_likelihood(self, X, class_idx):
        """
        Compute log-likelihood for samples under a class.
        """
        means = self.means_[class_idx]
        variances = self.variances_[class_idx]
        
        # Log PDF of Gaussian: -0.5 * [(x-μ)²/σ² + log(2πσ²)]
        log_pdf = -0.5 * (
            ((X - means) ** 2) / variances +
            np.log(2 * np.pi * variances)
        )
        
        return log_pdf.sum(axis=1)
    
    def predict_log_proba(self, X):
        """
        Compute log posterior probabilities.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        log_proba = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            log_proba[:, k] = np.log(self.priors_[k]) + self._log_likelihood(X, k)
        
        # Normalize
        log_sum = np.log(np.sum(np.exp(log_proba - log_proba.max(axis=1, keepdims=True)), axis=1, keepdims=True))
        log_proba = log_proba - log_proba.max(axis=1, keepdims=True) - log_sum
        
        return log_proba
    
    def predict(self, X):
        return self.classes_[np.argmax(self.predict_log_proba(X), axis=1)]
 
 
def demonstrate_edge_cases():
    """
    Show edge case handling in action.
    """
    np.random.seed(42)
    
    # Create problematic dataset
    print("
--- Creating dataset with edge cases ---")
    
    # Class 0: normal data
    X_0 = np.random.randn(50, 5)
    
    # Class 1: one feature is constant (zero variance)
    X_1 = np.random.randn(50, 5)
    X_1[:, 2] = 3.0  # Constant feature
    
    # Class 2: very few samples (small sample size)
    X_2 = np.random.randn(3, 5) + 5
    
    X = np.vstack([X_0, X_1, X_2])
    y = np.array([0]*50 + [1]*50 + [2]*3)
    
    print(f"Class 0: {50} samples, all features variable")
    print(f"Class 1: {50} samples, feature 2 is constant (zero variance)")
    print(f"Class 2: {3} samples (small sample size)")
    
    # Train robust model
    model = RobustGaussianNB(var_smoothing=1e-9)
    model.fit(X, y)
    
    print("
--- Estimated variances (Class 1, showing zero-var feature) ---")
    print(f"Feature 2 raw variance: 0 (constant)")
    print(f"Feature 2 smoothed variance: {model.variances_[1, 2]:.2e}")
    
    # Test prediction
    X_test = np.array([
        [0, 0, 3, 0, 0],  # Matches class 1's constant feature
        [5, 5, 5, 5, 5],  # Near class 2
    ])
    
    predictions = model.predict(X_test)
    log_proba = model.predict_log_proba(X_test)
    
    print("
--- Test predictions ---")
    for i, (pred, lp) in enumerate(zip(predictions, log_proba)):
        print(f"Sample {i}: Predicted class {pred}")
        print(f"  Log-probabilities: {lp}")
 
demonstrate_edge_cases()

sklearn's Default Smoothing

sklearn's GaussianNB uses var_smoothing=1e-9 by default. This tiny value is added to all variances, preventing numerical issues while having negligible effect on normal variances. The smoothing effectively places a small lower bound on feature variability.

Variance Regularization Strategies

Beyond simple smoothing, several variance regularization strategies can improve Gaussian Naive Bayes performance, especially with limited data.

Strategy 1: Pooled Variance

Instead of separate variances per class, share a single variance per feature across all classes:

$$\hat{\sigma}^2_j = \frac{1}{n} \sum_{k=1}^{K} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$$

This is equivalent to assuming homoscedasticity (equal variance across classes).

Advantages:

More stable estimates with limited data
Reduces number of parameters from $2Kd$ to $Kd + d$
Leads to linear decision boundaries (connection to LDA)

Disadvantages:

Ignores genuine variance differences between classes
May hurt if classes truly have different spreads

Strategy 2: Shrinkage Toward Pooled Variance

Blend class-specific and pooled variances:

$$\tilde{\sigma}^2_{jk} = (1-\lambda) \hat{\sigma}^2_{jk} + \lambda \hat{\sigma}^2_j$$

Where $\lambda \in [0, 1]$ controls the shrinkage strength.

$\lambda = 0$: Class-specific variances (standard GNB)
$\lambda = 1$: Fully pooled variances
$\lambda \in (0, 1)$: Compromise between extremes

Strategy 3: Bayesian Variance Estimation

Use an inverse-gamma prior on variance (conjugate prior for Gaussian variance):

$$\sigma^2 \sim \text{Inverse-Gamma}(\alpha_0, \beta_0)$$

The posterior mean is: $$\hat{\sigma}^2_{\text{Bayes}} = \frac{\beta_0 + \frac{1}{2}\sum(x_i - \bar{x})^2}{\alpha_0 + \frac{n}{2} + 1}$$

This naturally shrinks small-sample variance estimates toward the prior.

Variance Estimation Strategies Comparison
Strategy	Parameters	Best For	Decision Boundary
Class-specific variance	$K \times d$	Large data, different class spreads	Quadratic (curved)
Pooled variance	$d$	Small data, similar class spreads	Linear
Shrinkage	$K \times d + 1$ (λ)	Moderate data, uncertain structure	Between linear and quadratic
Bayesian	$K \times d + 2$ (α, β)	Prior knowledge available	Regularized quadratic

variance_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
 
class RegularizedGaussianNB(BaseEstimator, ClassifierMixin):
    """
    Gaussian Naive Bayes with configurable variance regularization.
    """
    
    def __init__(self, variance_estimator='class_specific', shrinkage=0.5):
        """
        Parameters:
        -----------
        variance_estimator : str
            One of: 'class_specific', 'pooled', 'shrinkage'
        shrinkage : float
            Shrinkage parameter λ when using 'shrinkage' estimator.
            0 = class-specific, 1 = fully pooled.
        """
        self.variance_estimator = variance_estimator
        self.shrinkage = shrinkage
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        self.priors_ = np.zeros(n_classes)
        self.means_ = np.zeros((n_classes, n_features))
        class_variances = np.zeros((n_classes, n_features))
        
        # First pass: compute class-specific means and variances
        for k, cls in enumerate(self.classes_):
            X_class = X[y == cls]
            self.priors_[k] = len(X_class) / n_samples
            self.means_[k] = X_class.mean(axis=0)
            class_variances[k] = X_class.var(axis=0)
        
        # Compute pooled variance
        pooled_variance = np.zeros(n_features)
        for k, cls in enumerate(self.classes_):
            X_class = X[y == cls]
            pooled_variance += np.sum((X_class - self.means_[k])**2, axis=0)
        pooled_variance /= n_samples
        
        # Apply regularization strategy
        if self.variance_estimator == 'class_specific':
            self.variances_ = class_variances
        elif self.variance_estimator == 'pooled':
            self.variances_ = np.tile(pooled_variance, (n_classes, 1))
        elif self.variance_estimator == 'shrinkage':
            λ = self.shrinkage
            self.variances_ = (1 - λ) * class_variances + λ * pooled_variance
        else:
            raise ValueError(f"Unknown variance estimator: {self.variance_estimator}")
        
        # Add small smoothing
        self.variances_ = np.maximum(self.variances_, 1e-9)
        
        return self
    
    def _log_likelihood(self, X):
        """Compute log-likelihood for all classes."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        log_lik = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            diff = X - self.means_[k]
            log_lik[:, k] = -0.5 * np.sum(
                diff**2 / self.variances_[k] + np.log(2 * np.pi * self.variances_[k]),
                axis=1
            )
        
        return log_lik
    
    def predict_proba(self, X):
        log_lik = self._log_likelihood(X)
        log_prior = np.log(self.priors_)
        log_posterior = log_lik + log_prior
        
        # Log-sum-exp for numerical stability
        max_log = log_posterior.max(axis=1, keepdims=True)
        log_sum = max_log + np.log(np.sum(np.exp(log_posterior - max_log), axis=1, keepdims=True))
        log_posterior -= log_sum
        
        return np.exp(log_posterior)
    
    def predict(self, X):
        return self.classes_[np.argmax(self.predict_proba(X), axis=1)]
 
 
def compare_variance_strategies():
    """
    Compare different variance estimation strategies.
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.datasets import make_classification
    
    print("=" * 60)
    print("VARIANCE REGULARIZATION COMPARISON")
    print("=" * 60)
    
    # Generate dataset
    np.random.seed(42)
    X, y = make_classification(
        n_samples=200,
        n_features=10,
        n_informative=5,
        n_redundant=2,
        n_classes=3,
        n_clusters_per_class=1,
        random_state=42
    )
    
    print(f"
Dataset: {X.shape[0]} samples, {X.shape[1]} features, 3 classes")
    
    # Compare strategies
    strategies = [
        ('Class-specific', RegularizedGaussianNB(variance_estimator='class_specific')),
        ('Pooled', RegularizedGaussianNB(variance_estimator='pooled')),
        ('Shrinkage (λ=0.3)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.3)),
        ('Shrinkage (λ=0.5)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.5)),
        ('Shrinkage (λ=0.7)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.7)),
    ]
    
    print("
5-Fold Cross-Validation Accuracy:")
    print("-" * 40)
    
    for name, model in strategies:
        scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
        print(f"{name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")
 
compare_variance_strategies()

Complete Training Algorithm

Let us consolidate everything into a complete, production-ready training algorithm for Gaussian Naive Bayes.

Algorithm: Gaussian Naive Bayes Training

Input:

Training data: ${(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}$
Variance smoothing parameter: $\epsilon$ (default: $10^{-9}$)

Output:

Class priors: ${\pi_0, \ldots, \pi_{K-1}}$
Class means: ${\boldsymbol{\mu}0, \ldots, \boldsymbol{\mu}{K-1}}$
Class variances: ${\boldsymbol{\sigma}^2_0, \ldots, \boldsymbol{\sigma}^2_{K-1}}$

Procedure:

1. Identify unique classes: classes = unique(y)
2. For each class k in classes:
   a. Select samples: X_k = X[y == k]
   b. Compute prior: π_k = |X_k| / n
   c. Compute mean: μ_k = mean(X_k, axis=0)
   d. Compute variance: σ²_k = var(X_k, axis=0)
3. Apply smoothing: σ²_k = σ²_k + ε for all k
4. Return (classes, priors, means, variances)

Computational Complexity

Time complexity: $O(n \cdot d)$

Single pass through data to compute means
Second pass to compute variances
Both operations are $O(n \cdot d)$

Space complexity: $O(K \cdot d)$

Store means and variances for each class/feature combination

This is remarkably efficient—linear in both samples and features, making Gaussian NB suitable for large-scale applications.

complete_gaussian_nb.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import numpy as np
from scipy.special import logsumexp
 
class GaussianNaiveBayes:
    """
    Complete, production-quality Gaussian Naive Bayes implementation.
    
    This implementation includes:
    - Robust variance smoothing
    - Numerically stable log-space computation
    - Proper handling of edge cases
    - Efficient vectorized operations
    """
    
    def __init__(self, var_smoothing=1e-9):
        """
        Parameters:
        -----------
        var_smoothing : float
            Portion of the largest variance to add to all variances
            for numerical stability.
        """
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.class_count_ = None
        self.class_prior_ = None
        self.theta_ = None  # Means
        self.var_ = None    # Variances
        self.epsilon_ = None  # Actual smoothing value
        
    def fit(self, X, y):
        """
        Fit Gaussian Naive Bayes according to X, y.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training vectors
        y : array-like of shape (n_samples,)
            Target values
            
        Returns:
        --------
        self : object
            Fitted estimator
        """
        X = np.asarray(X, dtype=np.float64)
        y = np.asarray(y)
        
        n_samples, n_features = X.shape
        
        # Identify classes
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        # Initialize arrays
        self.class_count_ = np.zeros(n_classes, dtype=np.float64)
        self.theta_ = np.zeros((n_classes, n_features), dtype=np.float64)
        self.var_ = np.zeros((n_classes, n_features), dtype=np.float64)
        
        # Compute statistics for each class
        for k, cls in enumerate(self.classes_):
            X_class = X[y == cls]
            self.class_count_[k] = X_class.shape[0]
            self.theta_[k, :] = X_class.mean(axis=0)
            self.var_[k, :] = X_class.var(axis=0)
        
        # Compute priors
        self.class_prior_ = self.class_count_ / self.class_count_.sum()
        
        # Compute variance smoothing (relative to largest variance)
        self.epsilon_ = self.var_smoothing * self.var_.max()
        self.var_ += self.epsilon_
        
        return self
    
    def _joint_log_likelihood(self, X):
        """
        Compute joint log-likelihood: log P(y, x) = log P(x|y) + log P(y)
        
        Returns:
        --------
        jll : array of shape (n_samples, n_classes)
        """
        X = np.asarray(X, dtype=np.float64)
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        
        jll = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            # Log-likelihood for class k: sum over features
            # log N(x_j | μ_jk, σ²_jk) = -0.5 * [(x_j - μ_jk)² / σ²_jk + log(2π σ²_jk)]
            
            diff = X - self.theta_[k]
            log_lik = -0.5 * np.sum(
                (diff ** 2) / self.var_[k] + np.log(2 * np.pi * self.var_[k]),
                axis=1
            )
            
            # Add log prior
            jll[:, k] = log_lik + np.log(self.class_prior_[k])
        
        return jll
    
    def predict_log_proba(self, X):
        """
        Return log-probability estimates for test vector X.
        """
        jll = self._joint_log_likelihood(X)
        # Normalize using log-sum-exp
        log_prob_x = logsumexp(jll, axis=1, keepdims=True)
        return jll - log_prob_x
    
    def predict_proba(self, X):
        """
        Return probability estimates for test vector X.
        """
        return np.exp(self.predict_log_proba(X))
    
    def predict(self, X):
        """
        Perform classification on test vectors X.
        """
        jll = self._joint_log_likelihood(X)
        return self.classes_[np.argmax(jll, axis=1)]
    
    def score(self, X, y):
        """
        Return mean accuracy on the given test data and labels.
        """
        return np.mean(self.predict(X) == y)
 
 
def full_demonstration():
    """
    Complete demonstration of Gaussian NB training and prediction.
    """
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import GaussianNB as SklearnGNB
    
    print("=" * 60)
    print("COMPLETE GAUSSIAN NAIVE BAYES DEMONSTRATION")
    print("=" * 60)
    
    # Load Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    print(f"
Dataset: Iris")
    print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}, Classes: {len(np.unique(y))}")
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Train our model
    model = GaussianNaiveBayes()
    model.fit(X_train, y_train)
    
    # Train sklearn model for comparison
    sklearn_model = SklearnGNB()
    sklearn_model.fit(X_train, y_train)
    
    print("
--- Learned Parameters ---")
    print(f"Classes: {model.classes_}")
    print(f"Class counts: {model.class_count_}")
    print(f"Class priors: {model.class_prior_}")
    print(f"
Means (theta):
{model.theta_}")
    print(f"
Variances (first 2 features):
{model.var_[:, :2]}")
    
    # Evaluate
    our_accuracy = model.score(X_test, y_test)
    sklearn_accuracy = sklearn_model.score(X_test, y_test)
    
    print("
--- Evaluation ---")
    print(f"Our implementation accuracy: {our_accuracy:.4f}")
    print(f"sklearn accuracy: {sklearn_accuracy:.4f}")
    
    # Verify predictions match
    our_preds = model.predict(X_test)
    sklearn_preds = sklearn_model.predict(X_test)
    match = np.all(our_preds == sklearn_preds)
    print(f"Predictions match sklearn: {match}")
    
    # Example probability predictions
    print("
--- Sample Predictions with Probabilities ---")
    sample_indices = [0, 15, 30]
    for i in sample_indices:
        probs = model.predict_proba(X_test[i:i+1])[0]
        pred = model.predict(X_test[i:i+1])[0]
        true = y_test[i]
        print(f"Sample {i}: True={true}, Pred={pred}, Probs={probs.round(4)}")
 
full_demonstration()

Summary: Parameter Estimation

We have developed a complete understanding of parameter estimation in Gaussian Naive Bayes. Let us consolidate the key concepts:

Key Concepts

•Maximum likelihood estimation: Choose parameters that maximize the probability of observed data
•Gaussian MLE formulas: Sample mean for $\mu$, sample variance for $\sigma^2$; simple and intuitive
•Bias in variance MLE: Uses $n$ denominator, slight underestimate; usually negligible for reasonable sample sizes
•Class-specific estimation: Separate parameters for each feature in each class; total $O(Kd)$ parameters
•Edge case handling: Variance smoothing prevents zero-variance problems; small $\epsilon$ added to all variances
•Regularization strategies: Pooled variance, shrinkage, Bayesian methods for small-sample settings
•Computational efficiency: $O(n \cdot d)$ training time, single pass through data

What's next:

With parameters estimated from training data, we can now classify new observations. The next page explores the decision boundary—the geometric surface that separates classes in feature space. We'll see how the Gaussian parameters determine this boundary and discover when it's linear versus quadratic.

Page Complete

You now understand how to estimate Gaussian Naive Bayes parameters from data. The elegant closed-form solutions—sample means and variances—make training extremely fast and simple. With variance smoothing and regularization, the model handles edge cases robustly. Next, we analyze the geometry of classification decisions.

Parameter Estimation

Learning from Data: The Estimation Problem

This page develops the complete theory of parameter estimation for Gaussian Naive Bayes, from first principles to practical implementation.

What You Will Learn

Maximum Likelihood Estimation Principle

The Likelihood Function

Given:

Observations $\mathcal{D} = {x_1, x_2, \ldots, x_n}$
A parametric distribution family $f(x | \theta)$ with unknown parameter $\theta$

The likelihood function is the probability of observing the data as a function of the parameter: $$\mathcal{L}(\theta | \mathcal{D}) = \prod_{i=1}^{n} f(x_i | \theta)$$

Assuming observations are independent and identically distributed (i.i.d.).

The Maximum Likelihood Estimator

The MLE is the value of $\theta$ that maximizes the likelihood: $$\hat{\theta}{\text{MLE}} = \arg\max\theta \mathcal{L}(\theta | \mathcal{D})$$

Log-Likelihood

This is computationally preferable (sums instead of products) and analytically convenient (derivatives are simpler).

Intuition

MLE asks: "Which parameter value(s) would have made the observed data most likely to occur?"

If we observe many values clustered around 50, the MLE for the mean will be near 50. If values are spread out, the MLE for variance will be large. MLE formalizes this intuition precisely.

MLE Properties

mle_intuition.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def demonstrate_mle_intuition():
    """
    Show how MLE finds the parameters that maximize data likelihood.
    """
    print("=" * 60)
    print("MAXIMUM LIKELIHOOD ESTIMATION INTUITION")
    print("=" * 60)
    
    # Generate data from a known distribution
    np.random.seed(42)
    true_mu = 5.0
    true_sigma = 2.0
    n = 50
    data = np.random.normal(true_mu, true_sigma, n)
    
    print(f"
Generated {n} samples from N({true_mu}, {true_sigma}²)")
    print(f"Sample mean: {data.mean():.4f}")
    print(f"Sample std: {data.std():.4f}")
    
    # Compute likelihood for different mu values (fixing sigma at true value)
    print("
--- Likelihood as a function of μ (σ fixed at true value) ---")
    
    mu_range = np.linspace(3, 7, 41)
    log_likelihoods = []
    
    for mu in mu_range:
        # Log-likelihood = sum of log(PDF) for each observation
        log_lik = np.sum(stats.norm(mu, true_sigma).logpdf(data))
        log_likelihoods.append(log_lik)
    
    log_likelihoods = np.array(log_likelihoods)
    best_mu_idx = np.argmax(log_likelihoods)
    best_mu = mu_range[best_mu_idx]
    
    print(f"
Log-likelihood evaluated at different μ values:")
    for i in range(0, len(mu_range), 8):
        print(f"  μ = {mu_range[i]:.1f}: log L = {log_likelihoods[i]:.2f}")
    
    print(f"
MLE for μ: {best_mu:.2f}")
    print(f"Sample mean: {data.mean():.4f}")
    print(f"True μ: {true_mu}")
    print("
→ MLE equals the sample mean (for known σ)")
    
    # Now compute likelihood over both μ and σ²
    print("
--- Likelihood as a function of both μ and σ ---")
    
    mu_grid = np.linspace(4, 6, 21)
    sigma_grid = np.linspace(1, 3, 21)
    
    log_lik_grid = np.zeros((len(sigma_grid), len(mu_grid)))
    
    for i, sigma in enumerate(sigma_grid):
        for j, mu in enumerate(mu_grid):
            log_lik_grid[i, j] = np.sum(stats.norm(mu, sigma).logpdf(data))
    
    # Find maximum
    max_idx = np.unravel_index(np.argmax(log_lik_grid), log_lik_grid.shape)
    mle_sigma = sigma_grid[max_idx[0]]
    mle_mu = mu_grid[max_idx[1]]
    
    print(f"MLE for (μ, σ): ({mle_mu:.2f}, {mle_sigma:.2f})")
    print(f"Sample statistics: ({data.mean():.2f}, {data.std():.2f})")
    print(f"True parameters: ({true_mu}, {true_sigma})")
    
    # Verify analytically
    print("
--- Analytical MLE formulas ---")
    mle_mu_analytical = data.mean()
    mle_var_analytical = data.var()  # Note: this is the MLE variance (biased)
    mle_sigma_analytical = np.sqrt(mle_var_analytical)
    
    print(f"MLE μ = sample mean = {mle_mu_analytical:.4f}")
    print(f"MLE σ² = (1/n)Σ(x_i - x̄)² = {mle_var_analytical:.4f}")
    print(f"MLE σ = {mle_sigma_analytical:.4f}")
    
    # Compare to unbiased variance estimator
    unbiased_var = data.var(ddof=1)
    print(f"
Unbiased variance (1/(n-1)Σ(x_i - x̄)²) = {unbiased_var:.4f}")
    print(f"True σ² = {true_sigma**2}")
 
demonstrate_mle_intuition()

MLE for Gaussian Parameters

Let us derive the maximum likelihood estimators for the Gaussian distribution parameters rigorously.

Setup

Given i.i.d. observations $x_1, x_2, \ldots, x_n$ from $\mathcal{N}(\mu, \sigma^2)$:

$$f(x_i | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$

Log-Likelihood Function

The log-likelihood is: $$\ell(\mu, \sigma^2) = \sum_{i=1}^{n} \log f(x_i | \mu, \sigma^2)$$

$$= \sum_{i=1}^{n} \left[ -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right]$$

$$= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Maximizing with Respect to $\mu$

Taking the partial derivative and setting to zero: $$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0$$

$$\sum_{i=1}^{n}x_i - n\mu = 0$$

$$\hat{\mu}{\text{MLE}} = \frac{1}{n}\sum{i=1}^{n}x_i = \bar{x}$$

The MLE for the mean is the sample mean.

Maximizing with Respect to $\sigma^2$

Taking the partial derivative: $$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^{n}(x_i - \mu)^2 = 0$$

$$\frac{n}{2\sigma^2} = \frac{1}{2(\sigma^2)^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$

Substituting $\hat{\mu}{\text{MLE}} = \bar{x}$: $$\hat{\sigma}^2{\text{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

The MLE for variance is the sample variance with $n$ in the denominator.

Biased Variance Estimator

Second Derivative Test

To confirm these are maxima, we check the second derivatives (Hessian matrix):

$$\frac{\partial^2 \ell}{\partial \mu^2} = -\frac{n}{\sigma^2} < 0$$

$$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2(\sigma^2)^2} - \frac{1}{(\sigma^2)^3}\sum_{i=1}^{n}(x_i - \mu)^2$$

At the MLE values, this is negative (for $n \geq 2$), confirming a maximum.

Summary of Gaussian MLE

Parameter	MLE Formula	Interpretation
Mean $\mu$	$\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i = \bar{x}$	Sample average
Variance $\sigma^2$	$\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$	Sample variance (biased)

Properties of Gaussian MLE
Property	Mean ($\hat{\mu}$)	Variance ($\hat{\sigma}^2$)
Unbiased?	Yes: $E[\hat{\mu}] = \mu$	No: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$
Consistent?	Yes	Yes
Variance	$\text{Var}[\hat{\mu}] = \frac{\sigma^2}{n}$	$\text{Var}[\hat{\sigma}^2] = \frac{2\sigma^4}{n}$ (approx)
Efficient?	Yes (achieves Cramér-Rao bound)	Asymptotically

Class-Specific Parameter Estimation

In Gaussian Naive Bayes, we estimate separate Gaussian parameters for each feature within each class. This section details the complete estimation procedure.

The Complete Parameter Set

For a classification problem with:

$K$ classes: ${0, 1, \ldots, K-1}$
$d$ features: ${1, 2, \ldots, d}$

We estimate:

Class priors: $\pi_k = P(y = k)$ for each class
Class-conditional means: $\mu_{jk}$ for each feature $j$ in each class $k$
Class-conditional variances: $\sigma^2_{jk}$ for each feature $j$ in each class $k$

Total: $(K-1) + K \cdot d + K \cdot d = 2Kd + K - 1$ parameters

Prior Estimation

The MLE for class priors is the class frequency: $$\hat{\pi}_k = \frac{n_k}{n}$$

Where $n_k$ is the number of training samples in class $k$ and $n$ is the total sample count.

Mean Estimation (Per Feature, Per Class)

$$\hat{\mu}{jk} = \frac{1}{n_k} \sum{i: y_i = k} x_{ij}$$

The mean of feature $j$ over all training samples belonging to class $k$.

Variance Estimation (Per Feature, Per Class)

$$\hat{\sigma}^2_{jk} = \frac{1}{n_k} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$$

The variance of feature $j$ over all training samples belonging to class $k$.

class_specific_estimation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from collections import defaultdict
 
def estimate_gaussian_nb_parameters(X, y):
    """
    Estimate all Gaussian Naive Bayes parameters from training data.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Training feature matrix
    y : np.ndarray of shape (n_samples,)
        Training class labels
    
    Returns:
    --------
    params : dict
        Dictionary containing:
        - priors: array of shape (n_classes,)
        - means: array of shape (n_classes, n_features)
        - variances: array of shape (n_classes, n_features)
        - classes: array of unique class labels
    """
    n_samples, n_features = X.shape
    classes = np.unique(y)
    n_classes = len(classes)
    
    # Initialize parameter arrays
    priors = np.zeros(n_classes)
    means = np.zeros((n_classes, n_features))
    variances = np.zeros((n_classes, n_features))
    
    print("=" * 60)
    print("GAUSSIAN NAIVE BAYES PARAMETER ESTIMATION")
    print("=" * 60)
    print(f"
Dataset: {n_samples} samples, {n_features} features, {n_classes} classes")
    
    for k, cls in enumerate(classes):
        # Get samples for this class
        class_mask = (y == cls)
        X_class = X[class_mask]
        n_class = len(X_class)
        
        print(f"
--- Class {cls} ---")
        print(f"Number of samples: {n_class}")
        
        # Prior probability
        priors[k] = n_class / n_samples
        print(f"Prior π_{cls} = {n_class}/{n_samples} = {priors[k]:.4f}")
        
        # Mean for each feature
        means[k, :] = X_class.mean(axis=0)
        
        # Variance for each feature
        variances[k, :] = X_class.var(axis=0)  # MLE (n in denominator)
        
        print(f"
Feature statistics:")
        for j in range(min(n_features, 5)):  # Show first 5 features
            print(f"  Feature {j}: μ = {means[k, j]:.4f}, σ² = {variances[k, j]:.4f}")
        if n_features > 5:
            print(f"  ... ({n_features - 5} more features)")
    
    return {
        'priors': priors,
        'means': means,
        'variances': variances,
        'classes': classes
    }
 
 
def verify_estimation_correctness():
    """
    Verify our estimation matches sklearn's GaussianNB.
    """
    from sklearn.naive_bayes import GaussianNB
    
    # Generate synthetic data
    np.random.seed(42)
    n_per_class = 100
    n_features = 4
    
    # Class 0: centered at origin
    X_0 = np.random.randn(n_per_class, n_features) * 2 + 0
    # Class 1: shifted
    X_1 = np.random.randn(n_per_class, n_features) * 1.5 + 3
    # Class 2: different variance
    X_2 = np.random.randn(n_per_class, n_features) * 0.5 + -1
    
    X = np.vstack([X_0, X_1, X_2])
    y = np.array([0]*n_per_class + [1]*n_per_class + [2]*n_per_class)
    
    # Shuffle
    shuffle_idx = np.random.permutation(len(y))
    X, y = X[shuffle_idx], y[shuffle_idx]
    
    # Our estimation
    print("
" + "="*60)
    print("VERIFICATION: COMPARING TO SKLEARN")
    print("="*60)
    
    params = estimate_gaussian_nb_parameters(X, y)
    
    # sklearn estimation
    gnb = GaussianNB()
    gnb.fit(X, y)
    
    print("
--- Comparison with sklearn ---")
    print("
Priors:")
    print(f"  Ours:    {params['priors']}")
    print(f"  sklearn: {gnb.class_prior_}")
    print(f"  Match:   {np.allclose(params['priors'], gnb.class_prior_)}")
    
    print("
Means (first 2 features shown):")
    print(f"  Ours:
{params['means'][:, :2]}")
    print(f"  sklearn:
{gnb.theta_[:, :2]}")
    print(f"  Match: {np.allclose(params['means'], gnb.theta_)}")
    
    print("
Variances (first 2 features shown):")
    print(f"  Ours:
{params['variances'][:, :2]}")
    print(f"  sklearn:
{gnb.var_[:, :2]}")
    print(f"  Match: {np.allclose(params['variances'], gnb.var_)}")
 
verify_estimation_correctness()

Efficient Implementation

Handling Edge Cases

Real-world data presents challenges that require careful handling. Several edge cases can cause problems in Gaussian Naive Bayes parameter estimation.

Zero Variance

If a feature has the same value for all samples in a class, its variance estimate is zero: $$\hat{\sigma}^2_{jk} = 0$$

Problem: Division by zero when computing log-likelihood: $$\log f(x_j | k) = -\frac{(x_j - \mu_{jk})^2}{2\sigma^2_{jk}} - \frac{1}{2}\log(2\pi\sigma^2_{jk})$$

Both terms involve division by or $\log$ of zero.

Solutions:

Variance floor: Replace $\sigma^2 = 0$ with a small minimum value $\epsilon$
Add smoothing constant: $\tilde{\sigma}^2 = \sigma^2 + \epsilon$
Remove constant features: Preprocessing step to eliminate zero-variance features

Small Sample Size

With few samples per class, variance estimates are unreliable (high variance of the variance estimate!).

Solutions:

Pooled variance: Share variance across classes
Prior variance: Regularize toward a prior estimate
Bayesian estimation: Full Bayesian treatment with conjugate priors

Missing Classes in Training

If a class has no training examples, we cannot estimate its parameters.

Solutions:

Uniform prior: Use uninformative mean and large variance
Error handling: Raise exception during training
Data augmentation: Ensure all classes are represented

edge_case_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
import numpy as np
from scipy import stats
 
class RobustGaussianNB:
    """
    Gaussian Naive Bayes with robust handling of edge cases.
    """
    
    def __init__(self, var_smoothing=1e-9, min_samples_per_class=1):
        """
        Parameters:
        -----------
        var_smoothing : float
            Minimum variance to prevent division by zero.
            Added to all variances.
        min_samples_per_class : int
            Minimum samples required per class (for warning).
        """
        self.var_smoothing = var_smoothing
        self.min_samples_per_class = min_samples_per_class
        self.classes_ = None
        self.priors_ = None
        self.means_ = None
        self.variances_ = None
        
    def fit(self, X, y):
        """
        Fit Gaussian NB with edge case handling.
        """
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        self.priors_ = np.zeros(n_classes)
        self.means_ = np.zeros((n_classes, n_features))
        self.variances_ = np.zeros((n_classes, n_features))
        
        print("=" * 60)
        print("ROBUST GAUSSIAN NB: EDGE CASE HANDLING")
        print("=" * 60)
        
        for k, cls in enumerate(self.classes_):
            class_mask = (y == cls)
            X_class = X[class_mask]
            n_class = len(X_class)
            
            # Edge case 1: Too few samples
            if n_class < self.min_samples_per_class:
                print(f"
⚠ WARNING: Class {cls} has only {n_class} samples!")
            
            # Prior
            self.priors_[k] = n_class / n_samples
            
            # Mean (works even with 1 sample)
            self.means_[k, :] = X_class.mean(axis=0)
            
            # Variance
            raw_variance = X_class.var(axis=0)
            
            # Edge case 2: Zero variance detection
            zero_var_features = np.sum(raw_variance == 0)
            if zero_var_features > 0:
                print(f"
⚠ Class {cls}: {zero_var_features} features have zero variance")
                print(f"   These will be smoothed with ε = {self.var_smoothing}")
            
            # Apply variance smoothing
            self.variances_[k, :] = raw_variance + self.var_smoothing
        
        print("
✓ Training complete with variance smoothing applied")
        return self
    
    def _log_likelihood(self, X, class_idx):
        """
        Compute log-likelihood for samples under a class.
        """
        means = self.means_[class_idx]
        variances = self.variances_[class_idx]
        
        # Log PDF of Gaussian: -0.5 * [(x-μ)²/σ² + log(2πσ²)]
        log_pdf = -0.5 * (
            ((X - means) ** 2) / variances +
            np.log(2 * np.pi * variances)
        )
        
        return log_pdf.sum(axis=1)
    
    def predict_log_proba(self, X):
        """
        Compute log posterior probabilities.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        log_proba = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            log_proba[:, k] = np.log(self.priors_[k]) + self._log_likelihood(X, k)
        
        # Normalize
        log_sum = np.log(np.sum(np.exp(log_proba - log_proba.max(axis=1, keepdims=True)), axis=1, keepdims=True))
        log_proba = log_proba - log_proba.max(axis=1, keepdims=True) - log_sum
        
        return log_proba
    
    def predict(self, X):
        return self.classes_[np.argmax(self.predict_log_proba(X), axis=1)]
 
 
def demonstrate_edge_cases():
    """
    Show edge case handling in action.
    """
    np.random.seed(42)
    
    # Create problematic dataset
    print("
--- Creating dataset with edge cases ---")
    
    # Class 0: normal data
    X_0 = np.random.randn(50, 5)
    
    # Class 1: one feature is constant (zero variance)
    X_1 = np.random.randn(50, 5)
    X_1[:, 2] = 3.0  # Constant feature
    
    # Class 2: very few samples (small sample size)
    X_2 = np.random.randn(3, 5) + 5
    
    X = np.vstack([X_0, X_1, X_2])
    y = np.array([0]*50 + [1]*50 + [2]*3)
    
    print(f"Class 0: {50} samples, all features variable")
    print(f"Class 1: {50} samples, feature 2 is constant (zero variance)")
    print(f"Class 2: {3} samples (small sample size)")
    
    # Train robust model
    model = RobustGaussianNB(var_smoothing=1e-9)
    model.fit(X, y)
    
    print("
--- Estimated variances (Class 1, showing zero-var feature) ---")
    print(f"Feature 2 raw variance: 0 (constant)")
    print(f"Feature 2 smoothed variance: {model.variances_[1, 2]:.2e}")
    
    # Test prediction
    X_test = np.array([
        [0, 0, 3, 0, 0],  # Matches class 1's constant feature
        [5, 5, 5, 5, 5],  # Near class 2
    ])
    
    predictions = model.predict(X_test)
    log_proba = model.predict_log_proba(X_test)
    
    print("
--- Test predictions ---")
    for i, (pred, lp) in enumerate(zip(predictions, log_proba)):
        print(f"Sample {i}: Predicted class {pred}")
        print(f"  Log-probabilities: {lp}")
 
demonstrate_edge_cases()

sklearn's Default Smoothing

Variance Regularization Strategies

Beyond simple smoothing, several variance regularization strategies can improve Gaussian Naive Bayes performance, especially with limited data.

Strategy 1: Pooled Variance

Instead of separate variances per class, share a single variance per feature across all classes:

$$\hat{\sigma}^2_j = \frac{1}{n} \sum_{k=1}^{K} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$$

This is equivalent to assuming homoscedasticity (equal variance across classes).

Advantages:

More stable estimates with limited data
Reduces number of parameters from $2Kd$ to $Kd + d$
Leads to linear decision boundaries (connection to LDA)

Disadvantages:

Ignores genuine variance differences between classes
May hurt if classes truly have different spreads

Strategy 2: Shrinkage Toward Pooled Variance

Blend class-specific and pooled variances:

$$\tilde{\sigma}^2_{jk} = (1-\lambda) \hat{\sigma}^2_{jk} + \lambda \hat{\sigma}^2_j$$

Where $\lambda \in [0, 1]$ controls the shrinkage strength.

$\lambda = 0$: Class-specific variances (standard GNB)
$\lambda = 1$: Fully pooled variances
$\lambda \in (0, 1)$: Compromise between extremes

Strategy 3: Bayesian Variance Estimation

Use an inverse-gamma prior on variance (conjugate prior for Gaussian variance):

$$\sigma^2 \sim \text{Inverse-Gamma}(\alpha_0, \beta_0)$$

The posterior mean is: $$\hat{\sigma}^2_{\text{Bayes}} = \frac{\beta_0 + \frac{1}{2}\sum(x_i - \bar{x})^2}{\alpha_0 + \frac{n}{2} + 1}$$

This naturally shrinks small-sample variance estimates toward the prior.

Variance Estimation Strategies Comparison
Strategy	Parameters	Best For	Decision Boundary
Class-specific variance	$K \times d$	Large data, different class spreads	Quadratic (curved)
Pooled variance	$d$	Small data, similar class spreads	Linear
Shrinkage	$K \times d + 1$ (λ)	Moderate data, uncertain structure	Between linear and quadratic
Bayesian	$K \times d + 2$ (α, β)	Prior knowledge available	Regularized quadratic

variance_regularization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
import numpy as np
from sklearn.base import BaseEstimator, ClassifierMixin
 
class RegularizedGaussianNB(BaseEstimator, ClassifierMixin):
    """
    Gaussian Naive Bayes with configurable variance regularization.
    """
    
    def __init__(self, variance_estimator='class_specific', shrinkage=0.5):
        """
        Parameters:
        -----------
        variance_estimator : str
            One of: 'class_specific', 'pooled', 'shrinkage'
        shrinkage : float
            Shrinkage parameter λ when using 'shrinkage' estimator.
            0 = class-specific, 1 = fully pooled.
        """
        self.variance_estimator = variance_estimator
        self.shrinkage = shrinkage
        
    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        self.priors_ = np.zeros(n_classes)
        self.means_ = np.zeros((n_classes, n_features))
        class_variances = np.zeros((n_classes, n_features))
        
        # First pass: compute class-specific means and variances
        for k, cls in enumerate(self.classes_):
            X_class = X[y == cls]
            self.priors_[k] = len(X_class) / n_samples
            self.means_[k] = X_class.mean(axis=0)
            class_variances[k] = X_class.var(axis=0)
        
        # Compute pooled variance
        pooled_variance = np.zeros(n_features)
        for k, cls in enumerate(self.classes_):
            X_class = X[y == cls]
            pooled_variance += np.sum((X_class - self.means_[k])**2, axis=0)
        pooled_variance /= n_samples
        
        # Apply regularization strategy
        if self.variance_estimator == 'class_specific':
            self.variances_ = class_variances
        elif self.variance_estimator == 'pooled':
            self.variances_ = np.tile(pooled_variance, (n_classes, 1))
        elif self.variance_estimator == 'shrinkage':
            λ = self.shrinkage
            self.variances_ = (1 - λ) * class_variances + λ * pooled_variance
        else:
            raise ValueError(f"Unknown variance estimator: {self.variance_estimator}")
        
        # Add small smoothing
        self.variances_ = np.maximum(self.variances_, 1e-9)
        
        return self
    
    def _log_likelihood(self, X):
        """Compute log-likelihood for all classes."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        log_lik = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            diff = X - self.means_[k]
            log_lik[:, k] = -0.5 * np.sum(
                diff**2 / self.variances_[k] + np.log(2 * np.pi * self.variances_[k]),
                axis=1
            )
        
        return log_lik
    
    def predict_proba(self, X):
        log_lik = self._log_likelihood(X)
        log_prior = np.log(self.priors_)
        log_posterior = log_lik + log_prior
        
        # Log-sum-exp for numerical stability
        max_log = log_posterior.max(axis=1, keepdims=True)
        log_sum = max_log + np.log(np.sum(np.exp(log_posterior - max_log), axis=1, keepdims=True))
        log_posterior -= log_sum
        
        return np.exp(log_posterior)
    
    def predict(self, X):
        return self.classes_[np.argmax(self.predict_proba(X), axis=1)]
 
 
def compare_variance_strategies():
    """
    Compare different variance estimation strategies.
    """
    from sklearn.model_selection import cross_val_score
    from sklearn.datasets import make_classification
    
    print("=" * 60)
    print("VARIANCE REGULARIZATION COMPARISON")
    print("=" * 60)
    
    # Generate dataset
    np.random.seed(42)
    X, y = make_classification(
        n_samples=200,
        n_features=10,
        n_informative=5,
        n_redundant=2,
        n_classes=3,
        n_clusters_per_class=1,
        random_state=42
    )
    
    print(f"
Dataset: {X.shape[0]} samples, {X.shape[1]} features, 3 classes")
    
    # Compare strategies
    strategies = [
        ('Class-specific', RegularizedGaussianNB(variance_estimator='class_specific')),
        ('Pooled', RegularizedGaussianNB(variance_estimator='pooled')),
        ('Shrinkage (λ=0.3)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.3)),
        ('Shrinkage (λ=0.5)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.5)),
        ('Shrinkage (λ=0.7)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.7)),
    ]
    
    print("
5-Fold Cross-Validation Accuracy:")
    print("-" * 40)
    
    for name, model in strategies:
        scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
        print(f"{name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")
 
compare_variance_strategies()

Complete Training Algorithm

Let us consolidate everything into a complete, production-ready training algorithm for Gaussian Naive Bayes.

Algorithm: Gaussian Naive Bayes Training

Input:

Training data: ${(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}$
Variance smoothing parameter: $\epsilon$ (default: $10^{-9}$)

Output:

Class priors: ${\pi_0, \ldots, \pi_{K-1}}$
Class means: ${\boldsymbol{\mu}0, \ldots, \boldsymbol{\mu}{K-1}}$
Class variances: ${\boldsymbol{\sigma}^2_0, \ldots, \boldsymbol{\sigma}^2_{K-1}}$

Procedure:

1. Identify unique classes: classes = unique(y)
2. For each class k in classes:
   a. Select samples: X_k = X[y == k]
   b. Compute prior: π_k = |X_k| / n
   c. Compute mean: μ_k = mean(X_k, axis=0)
   d. Compute variance: σ²_k = var(X_k, axis=0)
3. Apply smoothing: σ²_k = σ²_k + ε for all k
4. Return (classes, priors, means, variances)

Computational Complexity

Time complexity: $O(n \cdot d)$

Single pass through data to compute means
Second pass to compute variances
Both operations are $O(n \cdot d)$

Space complexity: $O(K \cdot d)$

Store means and variances for each class/feature combination

This is remarkably efficient—linear in both samples and features, making Gaussian NB suitable for large-scale applications.

complete_gaussian_nb.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import numpy as np
from scipy.special import logsumexp
 
class GaussianNaiveBayes:
    """
    Complete, production-quality Gaussian Naive Bayes implementation.
    
    This implementation includes:
    - Robust variance smoothing
    - Numerically stable log-space computation
    - Proper handling of edge cases
    - Efficient vectorized operations
    """
    
    def __init__(self, var_smoothing=1e-9):
        """
        Parameters:
        -----------
        var_smoothing : float
            Portion of the largest variance to add to all variances
            for numerical stability.
        """
        self.var_smoothing = var_smoothing
        self.classes_ = None
        self.class_count_ = None
        self.class_prior_ = None
        self.theta_ = None  # Means
        self.var_ = None    # Variances
        self.epsilon_ = None  # Actual smoothing value
        
    def fit(self, X, y):
        """
        Fit Gaussian Naive Bayes according to X, y.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training vectors
        y : array-like of shape (n_samples,)
            Target values
            
        Returns:
        --------
        self : object
            Fitted estimator
        """
        X = np.asarray(X, dtype=np.float64)
        y = np.asarray(y)
        
        n_samples, n_features = X.shape
        
        # Identify classes
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        # Initialize arrays
        self.class_count_ = np.zeros(n_classes, dtype=np.float64)
        self.theta_ = np.zeros((n_classes, n_features), dtype=np.float64)
        self.var_ = np.zeros((n_classes, n_features), dtype=np.float64)
        
        # Compute statistics for each class
        for k, cls in enumerate(self.classes_):
            X_class = X[y == cls]
            self.class_count_[k] = X_class.shape[0]
            self.theta_[k, :] = X_class.mean(axis=0)
            self.var_[k, :] = X_class.var(axis=0)
        
        # Compute priors
        self.class_prior_ = self.class_count_ / self.class_count_.sum()
        
        # Compute variance smoothing (relative to largest variance)
        self.epsilon_ = self.var_smoothing * self.var_.max()
        self.var_ += self.epsilon_
        
        return self
    
    def _joint_log_likelihood(self, X):
        """
        Compute joint log-likelihood: log P(y, x) = log P(x|y) + log P(y)
        
        Returns:
        --------
        jll : array of shape (n_samples, n_classes)
        """
        X = np.asarray(X, dtype=np.float64)
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        
        jll = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            # Log-likelihood for class k: sum over features
            # log N(x_j | μ_jk, σ²_jk) = -0.5 * [(x_j - μ_jk)² / σ²_jk + log(2π σ²_jk)]
            
            diff = X - self.theta_[k]
            log_lik = -0.5 * np.sum(
                (diff ** 2) / self.var_[k] + np.log(2 * np.pi * self.var_[k]),
                axis=1
            )
            
            # Add log prior
            jll[:, k] = log_lik + np.log(self.class_prior_[k])
        
        return jll
    
    def predict_log_proba(self, X):
        """
        Return log-probability estimates for test vector X.
        """
        jll = self._joint_log_likelihood(X)
        # Normalize using log-sum-exp
        log_prob_x = logsumexp(jll, axis=1, keepdims=True)
        return jll - log_prob_x
    
    def predict_proba(self, X):
        """
        Return probability estimates for test vector X.
        """
        return np.exp(self.predict_log_proba(X))
    
    def predict(self, X):
        """
        Perform classification on test vectors X.
        """
        jll = self._joint_log_likelihood(X)
        return self.classes_[np.argmax(jll, axis=1)]
    
    def score(self, X, y):
        """
        Return mean accuracy on the given test data and labels.
        """
        return np.mean(self.predict(X) == y)
 
 
def full_demonstration():
    """
    Complete demonstration of Gaussian NB training and prediction.
    """
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import GaussianNB as SklearnGNB
    
    print("=" * 60)
    print("COMPLETE GAUSSIAN NAIVE BAYES DEMONSTRATION")
    print("=" * 60)
    
    # Load Iris dataset
    iris = load_iris()
    X, y = iris.data, iris.target
    
    print(f"
Dataset: Iris")
    print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}, Classes: {len(np.unique(y))}")
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )
    
    # Train our model
    model = GaussianNaiveBayes()
    model.fit(X_train, y_train)
    
    # Train sklearn model for comparison
    sklearn_model = SklearnGNB()
    sklearn_model.fit(X_train, y_train)
    
    print("
--- Learned Parameters ---")
    print(f"Classes: {model.classes_}")
    print(f"Class counts: {model.class_count_}")
    print(f"Class priors: {model.class_prior_}")
    print(f"
Means (theta):
{model.theta_}")
    print(f"
Variances (first 2 features):
{model.var_[:, :2]}")
    
    # Evaluate
    our_accuracy = model.score(X_test, y_test)
    sklearn_accuracy = sklearn_model.score(X_test, y_test)
    
    print("
--- Evaluation ---")
    print(f"Our implementation accuracy: {our_accuracy:.4f}")
    print(f"sklearn accuracy: {sklearn_accuracy:.4f}")
    
    # Verify predictions match
    our_preds = model.predict(X_test)
    sklearn_preds = sklearn_model.predict(X_test)
    match = np.all(our_preds == sklearn_preds)
    print(f"Predictions match sklearn: {match}")
    
    # Example probability predictions
    print("
--- Sample Predictions with Probabilities ---")
    sample_indices = [0, 15, 30]
    for i in sample_indices:
        probs = model.predict_proba(X_test[i:i+1])[0]
        pred = model.predict(X_test[i:i+1])[0]
        true = y_test[i]
        print(f"Sample {i}: True={true}, Pred={pred}, Probs={probs.round(4)}")
 
full_demonstration()

Summary: Parameter Estimation

We have developed a complete understanding of parameter estimation in Gaussian Naive Bayes. Let us consolidate the key concepts:

Key Concepts

•Maximum likelihood estimation: Choose parameters that maximize the probability of observed data
•Gaussian MLE formulas: Sample mean for $\mu$, sample variance for $\sigma^2$; simple and intuitive
•Bias in variance MLE: Uses $n$ denominator, slight underestimate; usually negligible for reasonable sample sizes
•Class-specific estimation: Separate parameters for each feature in each class; total $O(Kd)$ parameters
•Edge case handling: Variance smoothing prevents zero-variance problems; small $\epsilon$ added to all variances
•Regularization strategies: Pooled variance, shrinkage, Bayesian methods for small-sample settings
•Computational efficiency: $O(n \cdot d)$ training time, single pass through data

What's next:

Page Complete