Loading content...
Gaussian Naive Bayes is elegant in its assumptions: each feature follows a Gaussian distribution within each class. But where do the Gaussians come from? How do we determine the mean $\mu_{jk}$ and variance $\sigma^2_{jk}$ for feature $j$ in class $k$?
The answer lies in parameter estimation—the statistical process of inferring distribution parameters from observed data. Given a training set with labeled examples, we estimate the Gaussian parameters that best explain the observed feature values within each class.
This process is remarkably simple for Gaussians: the maximum likelihood estimates are the sample mean and sample variance. Yet beneath this simplicity lies deep statistical theory that illuminates when these estimates are reliable, when they might fail, and how to handle edge cases.
This page develops the complete theory of parameter estimation for Gaussian Naive Bayes, from first principles to practical implementation.
By the end of this page, you will understand: (1) the principle of maximum likelihood estimation (MLE), (2) derivation of Gaussian MLE for mean and variance, (3) bias correction in variance estimation, (4) handling edge cases like zero variance, (5) variance smoothing and regularization, and (6) complete implementation of Gaussian NB training.
Maximum likelihood estimation (MLE) is the foundational approach to parameter estimation in statistics. The principle is elegantly simple: choose parameters that maximize the probability (or density) of the observed data.
Given:
The likelihood function is the probability of observing the data as a function of the parameter: $$\mathcal{L}(\theta | \mathcal{D}) = \prod_{i=1}^{n} f(x_i | \theta)$$
Assuming observations are independent and identically distributed (i.i.d.).
The MLE is the value of $\theta$ that maximizes the likelihood: $$\hat{\theta}{\text{MLE}} = \arg\max\theta \mathcal{L}(\theta | \mathcal{D})$$
Since $\log$ is monotonically increasing, maximizing likelihood is equivalent to maximizing log-likelihood: $$\ell(\theta | \mathcal{D}) = \log \mathcal{L}(\theta | \mathcal{D}) = \sum_{i=1}^{n} \log f(x_i | \theta)$$
This is computationally preferable (sums instead of products) and analytically convenient (derivatives are simpler).
MLE asks: "Which parameter value(s) would have made the observed data most likely to occur?"
If we observe many values clustered around 50, the MLE for the mean will be near 50. If values are spread out, the MLE for variance will be large. MLE formalizes this intuition precisely.
MLEs have desirable asymptotic properties: they are consistent (converge to true value with more data), asymptotically efficient (lowest variance among unbiased estimators), and asymptotically normal (follow a Gaussian distribution for large samples). These properties justify the widespread use of MLE in machine learning.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def demonstrate_mle_intuition(): """ Show how MLE finds the parameters that maximize data likelihood. """ print("=" * 60) print("MAXIMUM LIKELIHOOD ESTIMATION INTUITION") print("=" * 60) # Generate data from a known distribution np.random.seed(42) true_mu = 5.0 true_sigma = 2.0 n = 50 data = np.random.normal(true_mu, true_sigma, n) print(f"Generated {n} samples from N({true_mu}, {true_sigma}²)") print(f"Sample mean: {data.mean():.4f}") print(f"Sample std: {data.std():.4f}") # Compute likelihood for different mu values (fixing sigma at true value) print("--- Likelihood as a function of μ (σ fixed at true value) ---") mu_range = np.linspace(3, 7, 41) log_likelihoods = [] for mu in mu_range: # Log-likelihood = sum of log(PDF) for each observation log_lik = np.sum(stats.norm(mu, true_sigma).logpdf(data)) log_likelihoods.append(log_lik) log_likelihoods = np.array(log_likelihoods) best_mu_idx = np.argmax(log_likelihoods) best_mu = mu_range[best_mu_idx] print(f"Log-likelihood evaluated at different μ values:") for i in range(0, len(mu_range), 8): print(f" μ = {mu_range[i]:.1f}: log L = {log_likelihoods[i]:.2f}") print(f"MLE for μ: {best_mu:.2f}") print(f"Sample mean: {data.mean():.4f}") print(f"True μ: {true_mu}") print("→ MLE equals the sample mean (for known σ)") # Now compute likelihood over both μ and σ² print("--- Likelihood as a function of both μ and σ ---") mu_grid = np.linspace(4, 6, 21) sigma_grid = np.linspace(1, 3, 21) log_lik_grid = np.zeros((len(sigma_grid), len(mu_grid))) for i, sigma in enumerate(sigma_grid): for j, mu in enumerate(mu_grid): log_lik_grid[i, j] = np.sum(stats.norm(mu, sigma).logpdf(data)) # Find maximum max_idx = np.unravel_index(np.argmax(log_lik_grid), log_lik_grid.shape) mle_sigma = sigma_grid[max_idx[0]] mle_mu = mu_grid[max_idx[1]] print(f"MLE for (μ, σ): ({mle_mu:.2f}, {mle_sigma:.2f})") print(f"Sample statistics: ({data.mean():.2f}, {data.std():.2f})") print(f"True parameters: ({true_mu}, {true_sigma})") # Verify analytically print("--- Analytical MLE formulas ---") mle_mu_analytical = data.mean() mle_var_analytical = data.var() # Note: this is the MLE variance (biased) mle_sigma_analytical = np.sqrt(mle_var_analytical) print(f"MLE μ = sample mean = {mle_mu_analytical:.4f}") print(f"MLE σ² = (1/n)Σ(x_i - x̄)² = {mle_var_analytical:.4f}") print(f"MLE σ = {mle_sigma_analytical:.4f}") # Compare to unbiased variance estimator unbiased_var = data.var(ddof=1) print(f"Unbiased variance (1/(n-1)Σ(x_i - x̄)²) = {unbiased_var:.4f}") print(f"True σ² = {true_sigma**2}") demonstrate_mle_intuition()Let us derive the maximum likelihood estimators for the Gaussian distribution parameters rigorously.
Given i.i.d. observations $x_1, x_2, \ldots, x_n$ from $\mathcal{N}(\mu, \sigma^2)$:
$$f(x_i | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right)$$
The log-likelihood is: $$\ell(\mu, \sigma^2) = \sum_{i=1}^{n} \log f(x_i | \mu, \sigma^2)$$
$$= \sum_{i=1}^{n} \left[ -\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_i - \mu)^2}{2\sigma^2} \right]$$
$$= -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$
Taking the partial derivative and setting to zero: $$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0$$
$$\sum_{i=1}^{n}x_i - n\mu = 0$$
$$\hat{\mu}{\text{MLE}} = \frac{1}{n}\sum{i=1}^{n}x_i = \bar{x}$$
The MLE for the mean is the sample mean.
Taking the partial derivative: $$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^{n}(x_i - \mu)^2 = 0$$
$$\frac{n}{2\sigma^2} = \frac{1}{2(\sigma^2)^2}\sum_{i=1}^{n}(x_i - \mu)^2$$
$$\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2$$
Substituting $\hat{\mu}{\text{MLE}} = \bar{x}$: $$\hat{\sigma}^2{\text{MLE}} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$$
The MLE for variance is the sample variance with $n$ in the denominator.
The MLE for variance $\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum(x_i - \bar{x})^2$ is biased: $E[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2$. It systematically underestimates the true variance. The unbiased estimator uses $n-1$: $s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2$. For large $n$, the difference is negligible, but for small samples it matters.
To confirm these are maxima, we check the second derivatives (Hessian matrix):
$$\frac{\partial^2 \ell}{\partial \mu^2} = -\frac{n}{\sigma^2} < 0$$
$$\frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2(\sigma^2)^2} - \frac{1}{(\sigma^2)^3}\sum_{i=1}^{n}(x_i - \mu)^2$$
At the MLE values, this is negative (for $n \geq 2$), confirming a maximum.
| Parameter | MLE Formula | Interpretation |
|---|---|---|
| Mean $\mu$ | $\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i = \bar{x}$ | Sample average |
| Variance $\sigma^2$ | $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2$ | Sample variance (biased) |
| Property | Mean ($\hat{\mu}$) | Variance ($\hat{\sigma}^2$) |
|---|---|---|
| Unbiased? | Yes: $E[\hat{\mu}] = \mu$ | No: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$ |
| Consistent? | Yes | Yes |
| Variance | $\text{Var}[\hat{\mu}] = \frac{\sigma^2}{n}$ | $\text{Var}[\hat{\sigma}^2] = \frac{2\sigma^4}{n}$ (approx) |
| Efficient? | Yes (achieves Cramér-Rao bound) | Asymptotically |
In Gaussian Naive Bayes, we estimate separate Gaussian parameters for each feature within each class. This section details the complete estimation procedure.
For a classification problem with:
We estimate:
Total: $(K-1) + K \cdot d + K \cdot d = 2Kd + K - 1$ parameters
The MLE for class priors is the class frequency: $$\hat{\pi}_k = \frac{n_k}{n}$$
Where $n_k$ is the number of training samples in class $k$ and $n$ is the total sample count.
$$\hat{\mu}{jk} = \frac{1}{n_k} \sum{i: y_i = k} x_{ij}$$
The mean of feature $j$ over all training samples belonging to class $k$.
$$\hat{\sigma}^2_{jk} = \frac{1}{n_k} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$$
The variance of feature $j$ over all training samples belonging to class $k$.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom collections import defaultdict def estimate_gaussian_nb_parameters(X, y): """ Estimate all Gaussian Naive Bayes parameters from training data. Parameters: ----------- X : np.ndarray of shape (n_samples, n_features) Training feature matrix y : np.ndarray of shape (n_samples,) Training class labels Returns: -------- params : dict Dictionary containing: - priors: array of shape (n_classes,) - means: array of shape (n_classes, n_features) - variances: array of shape (n_classes, n_features) - classes: array of unique class labels """ n_samples, n_features = X.shape classes = np.unique(y) n_classes = len(classes) # Initialize parameter arrays priors = np.zeros(n_classes) means = np.zeros((n_classes, n_features)) variances = np.zeros((n_classes, n_features)) print("=" * 60) print("GAUSSIAN NAIVE BAYES PARAMETER ESTIMATION") print("=" * 60) print(f"Dataset: {n_samples} samples, {n_features} features, {n_classes} classes") for k, cls in enumerate(classes): # Get samples for this class class_mask = (y == cls) X_class = X[class_mask] n_class = len(X_class) print(f"--- Class {cls} ---") print(f"Number of samples: {n_class}") # Prior probability priors[k] = n_class / n_samples print(f"Prior π_{cls} = {n_class}/{n_samples} = {priors[k]:.4f}") # Mean for each feature means[k, :] = X_class.mean(axis=0) # Variance for each feature variances[k, :] = X_class.var(axis=0) # MLE (n in denominator) print(f"Feature statistics:") for j in range(min(n_features, 5)): # Show first 5 features print(f" Feature {j}: μ = {means[k, j]:.4f}, σ² = {variances[k, j]:.4f}") if n_features > 5: print(f" ... ({n_features - 5} more features)") return { 'priors': priors, 'means': means, 'variances': variances, 'classes': classes } def verify_estimation_correctness(): """ Verify our estimation matches sklearn's GaussianNB. """ from sklearn.naive_bayes import GaussianNB # Generate synthetic data np.random.seed(42) n_per_class = 100 n_features = 4 # Class 0: centered at origin X_0 = np.random.randn(n_per_class, n_features) * 2 + 0 # Class 1: shifted X_1 = np.random.randn(n_per_class, n_features) * 1.5 + 3 # Class 2: different variance X_2 = np.random.randn(n_per_class, n_features) * 0.5 + -1 X = np.vstack([X_0, X_1, X_2]) y = np.array([0]*n_per_class + [1]*n_per_class + [2]*n_per_class) # Shuffle shuffle_idx = np.random.permutation(len(y)) X, y = X[shuffle_idx], y[shuffle_idx] # Our estimation print("" + "="*60) print("VERIFICATION: COMPARING TO SKLEARN") print("="*60) params = estimate_gaussian_nb_parameters(X, y) # sklearn estimation gnb = GaussianNB() gnb.fit(X, y) print("--- Comparison with sklearn ---") print("Priors:") print(f" Ours: {params['priors']}") print(f" sklearn: {gnb.class_prior_}") print(f" Match: {np.allclose(params['priors'], gnb.class_prior_)}") print("Means (first 2 features shown):") print(f" Ours:{params['means'][:, :2]}") print(f" sklearn:{gnb.theta_[:, :2]}") print(f" Match: {np.allclose(params['means'], gnb.theta_)}") print("Variances (first 2 features shown):") print(f" Ours:{params['variances'][:, :2]}") print(f" sklearn:{gnb.var_[:, :2]}") print(f" Match: {np.allclose(params['variances'], gnb.var_)}") verify_estimation_correctness()For large datasets, use numpy's vectorized operations: X[y == k].mean(axis=0) and X[y == k].var(axis=0) compute all feature statistics for a class in one operation. This is much faster than looping over features.
Real-world data presents challenges that require careful handling. Several edge cases can cause problems in Gaussian Naive Bayes parameter estimation.
If a feature has the same value for all samples in a class, its variance estimate is zero: $$\hat{\sigma}^2_{jk} = 0$$
Problem: Division by zero when computing log-likelihood: $$\log f(x_j | k) = -\frac{(x_j - \mu_{jk})^2}{2\sigma^2_{jk}} - \frac{1}{2}\log(2\pi\sigma^2_{jk})$$
Both terms involve division by or $\log$ of zero.
Solutions:
With few samples per class, variance estimates are unreliable (high variance of the variance estimate!).
Solutions:
If a class has no training examples, we cannot estimate its parameters.
Solutions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
import numpy as npfrom scipy import stats class RobustGaussianNB: """ Gaussian Naive Bayes with robust handling of edge cases. """ def __init__(self, var_smoothing=1e-9, min_samples_per_class=1): """ Parameters: ----------- var_smoothing : float Minimum variance to prevent division by zero. Added to all variances. min_samples_per_class : int Minimum samples required per class (for warning). """ self.var_smoothing = var_smoothing self.min_samples_per_class = min_samples_per_class self.classes_ = None self.priors_ = None self.means_ = None self.variances_ = None def fit(self, X, y): """ Fit Gaussian NB with edge case handling. """ n_samples, n_features = X.shape self.classes_ = np.unique(y) n_classes = len(self.classes_) self.priors_ = np.zeros(n_classes) self.means_ = np.zeros((n_classes, n_features)) self.variances_ = np.zeros((n_classes, n_features)) print("=" * 60) print("ROBUST GAUSSIAN NB: EDGE CASE HANDLING") print("=" * 60) for k, cls in enumerate(self.classes_): class_mask = (y == cls) X_class = X[class_mask] n_class = len(X_class) # Edge case 1: Too few samples if n_class < self.min_samples_per_class: print(f"⚠ WARNING: Class {cls} has only {n_class} samples!") # Prior self.priors_[k] = n_class / n_samples # Mean (works even with 1 sample) self.means_[k, :] = X_class.mean(axis=0) # Variance raw_variance = X_class.var(axis=0) # Edge case 2: Zero variance detection zero_var_features = np.sum(raw_variance == 0) if zero_var_features > 0: print(f"⚠ Class {cls}: {zero_var_features} features have zero variance") print(f" These will be smoothed with ε = {self.var_smoothing}") # Apply variance smoothing self.variances_[k, :] = raw_variance + self.var_smoothing print("✓ Training complete with variance smoothing applied") return self def _log_likelihood(self, X, class_idx): """ Compute log-likelihood for samples under a class. """ means = self.means_[class_idx] variances = self.variances_[class_idx] # Log PDF of Gaussian: -0.5 * [(x-μ)²/σ² + log(2πσ²)] log_pdf = -0.5 * ( ((X - means) ** 2) / variances + np.log(2 * np.pi * variances) ) return log_pdf.sum(axis=1) def predict_log_proba(self, X): """ Compute log posterior probabilities. """ n_samples = X.shape[0] n_classes = len(self.classes_) log_proba = np.zeros((n_samples, n_classes)) for k in range(n_classes): log_proba[:, k] = np.log(self.priors_[k]) + self._log_likelihood(X, k) # Normalize log_sum = np.log(np.sum(np.exp(log_proba - log_proba.max(axis=1, keepdims=True)), axis=1, keepdims=True)) log_proba = log_proba - log_proba.max(axis=1, keepdims=True) - log_sum return log_proba def predict(self, X): return self.classes_[np.argmax(self.predict_log_proba(X), axis=1)] def demonstrate_edge_cases(): """ Show edge case handling in action. """ np.random.seed(42) # Create problematic dataset print("--- Creating dataset with edge cases ---") # Class 0: normal data X_0 = np.random.randn(50, 5) # Class 1: one feature is constant (zero variance) X_1 = np.random.randn(50, 5) X_1[:, 2] = 3.0 # Constant feature # Class 2: very few samples (small sample size) X_2 = np.random.randn(3, 5) + 5 X = np.vstack([X_0, X_1, X_2]) y = np.array([0]*50 + [1]*50 + [2]*3) print(f"Class 0: {50} samples, all features variable") print(f"Class 1: {50} samples, feature 2 is constant (zero variance)") print(f"Class 2: {3} samples (small sample size)") # Train robust model model = RobustGaussianNB(var_smoothing=1e-9) model.fit(X, y) print("--- Estimated variances (Class 1, showing zero-var feature) ---") print(f"Feature 2 raw variance: 0 (constant)") print(f"Feature 2 smoothed variance: {model.variances_[1, 2]:.2e}") # Test prediction X_test = np.array([ [0, 0, 3, 0, 0], # Matches class 1's constant feature [5, 5, 5, 5, 5], # Near class 2 ]) predictions = model.predict(X_test) log_proba = model.predict_log_proba(X_test) print("--- Test predictions ---") for i, (pred, lp) in enumerate(zip(predictions, log_proba)): print(f"Sample {i}: Predicted class {pred}") print(f" Log-probabilities: {lp}") demonstrate_edge_cases()sklearn's GaussianNB uses var_smoothing=1e-9 by default. This tiny value is added to all variances, preventing numerical issues while having negligible effect on normal variances. The smoothing effectively places a small lower bound on feature variability.
Beyond simple smoothing, several variance regularization strategies can improve Gaussian Naive Bayes performance, especially with limited data.
Instead of separate variances per class, share a single variance per feature across all classes:
$$\hat{\sigma}^2_j = \frac{1}{n} \sum_{k=1}^{K} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$$
This is equivalent to assuming homoscedasticity (equal variance across classes).
Advantages:
Disadvantages:
Blend class-specific and pooled variances:
$$\tilde{\sigma}^2_{jk} = (1-\lambda) \hat{\sigma}^2_{jk} + \lambda \hat{\sigma}^2_j$$
Where $\lambda \in [0, 1]$ controls the shrinkage strength.
Use an inverse-gamma prior on variance (conjugate prior for Gaussian variance):
$$\sigma^2 \sim \text{Inverse-Gamma}(\alpha_0, \beta_0)$$
The posterior mean is: $$\hat{\sigma}^2_{\text{Bayes}} = \frac{\beta_0 + \frac{1}{2}\sum(x_i - \bar{x})^2}{\alpha_0 + \frac{n}{2} + 1}$$
This naturally shrinks small-sample variance estimates toward the prior.
| Strategy | Parameters | Best For | Decision Boundary |
|---|---|---|---|
| Class-specific variance | $K \times d$ | Large data, different class spreads | Quadratic (curved) |
| Pooled variance | $d$ | Small data, similar class spreads | Linear |
| Shrinkage | $K \times d + 1$ (λ) | Moderate data, uncertain structure | Between linear and quadratic |
| Bayesian | $K \times d + 2$ (α, β) | Prior knowledge available | Regularized quadratic |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
import numpy as npfrom sklearn.base import BaseEstimator, ClassifierMixin class RegularizedGaussianNB(BaseEstimator, ClassifierMixin): """ Gaussian Naive Bayes with configurable variance regularization. """ def __init__(self, variance_estimator='class_specific', shrinkage=0.5): """ Parameters: ----------- variance_estimator : str One of: 'class_specific', 'pooled', 'shrinkage' shrinkage : float Shrinkage parameter λ when using 'shrinkage' estimator. 0 = class-specific, 1 = fully pooled. """ self.variance_estimator = variance_estimator self.shrinkage = shrinkage def fit(self, X, y): n_samples, n_features = X.shape self.classes_ = np.unique(y) n_classes = len(self.classes_) self.priors_ = np.zeros(n_classes) self.means_ = np.zeros((n_classes, n_features)) class_variances = np.zeros((n_classes, n_features)) # First pass: compute class-specific means and variances for k, cls in enumerate(self.classes_): X_class = X[y == cls] self.priors_[k] = len(X_class) / n_samples self.means_[k] = X_class.mean(axis=0) class_variances[k] = X_class.var(axis=0) # Compute pooled variance pooled_variance = np.zeros(n_features) for k, cls in enumerate(self.classes_): X_class = X[y == cls] pooled_variance += np.sum((X_class - self.means_[k])**2, axis=0) pooled_variance /= n_samples # Apply regularization strategy if self.variance_estimator == 'class_specific': self.variances_ = class_variances elif self.variance_estimator == 'pooled': self.variances_ = np.tile(pooled_variance, (n_classes, 1)) elif self.variance_estimator == 'shrinkage': λ = self.shrinkage self.variances_ = (1 - λ) * class_variances + λ * pooled_variance else: raise ValueError(f"Unknown variance estimator: {self.variance_estimator}") # Add small smoothing self.variances_ = np.maximum(self.variances_, 1e-9) return self def _log_likelihood(self, X): """Compute log-likelihood for all classes.""" n_samples = X.shape[0] n_classes = len(self.classes_) log_lik = np.zeros((n_samples, n_classes)) for k in range(n_classes): diff = X - self.means_[k] log_lik[:, k] = -0.5 * np.sum( diff**2 / self.variances_[k] + np.log(2 * np.pi * self.variances_[k]), axis=1 ) return log_lik def predict_proba(self, X): log_lik = self._log_likelihood(X) log_prior = np.log(self.priors_) log_posterior = log_lik + log_prior # Log-sum-exp for numerical stability max_log = log_posterior.max(axis=1, keepdims=True) log_sum = max_log + np.log(np.sum(np.exp(log_posterior - max_log), axis=1, keepdims=True)) log_posterior -= log_sum return np.exp(log_posterior) def predict(self, X): return self.classes_[np.argmax(self.predict_proba(X), axis=1)] def compare_variance_strategies(): """ Compare different variance estimation strategies. """ from sklearn.model_selection import cross_val_score from sklearn.datasets import make_classification print("=" * 60) print("VARIANCE REGULARIZATION COMPARISON") print("=" * 60) # Generate dataset np.random.seed(42) X, y = make_classification( n_samples=200, n_features=10, n_informative=5, n_redundant=2, n_classes=3, n_clusters_per_class=1, random_state=42 ) print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features, 3 classes") # Compare strategies strategies = [ ('Class-specific', RegularizedGaussianNB(variance_estimator='class_specific')), ('Pooled', RegularizedGaussianNB(variance_estimator='pooled')), ('Shrinkage (λ=0.3)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.3)), ('Shrinkage (λ=0.5)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.5)), ('Shrinkage (λ=0.7)', RegularizedGaussianNB(variance_estimator='shrinkage', shrinkage=0.7)), ] print("5-Fold Cross-Validation Accuracy:") print("-" * 40) for name, model in strategies: scores = cross_val_score(model, X, y, cv=5, scoring='accuracy') print(f"{name:25s}: {scores.mean():.4f} ± {scores.std():.4f}") compare_variance_strategies()Let us consolidate everything into a complete, production-ready training algorithm for Gaussian Naive Bayes.
Input:
Output:
Procedure:
1. Identify unique classes: classes = unique(y)
2. For each class k in classes:
a. Select samples: X_k = X[y == k]
b. Compute prior: π_k = |X_k| / n
c. Compute mean: μ_k = mean(X_k, axis=0)
d. Compute variance: σ²_k = var(X_k, axis=0)
3. Apply smoothing: σ²_k = σ²_k + ε for all k
4. Return (classes, priors, means, variances)
Time complexity: $O(n \cdot d)$
Space complexity: $O(K \cdot d)$
This is remarkably efficient—linear in both samples and features, making Gaussian NB suitable for large-scale applications.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205
import numpy as npfrom scipy.special import logsumexp class GaussianNaiveBayes: """ Complete, production-quality Gaussian Naive Bayes implementation. This implementation includes: - Robust variance smoothing - Numerically stable log-space computation - Proper handling of edge cases - Efficient vectorized operations """ def __init__(self, var_smoothing=1e-9): """ Parameters: ----------- var_smoothing : float Portion of the largest variance to add to all variances for numerical stability. """ self.var_smoothing = var_smoothing self.classes_ = None self.class_count_ = None self.class_prior_ = None self.theta_ = None # Means self.var_ = None # Variances self.epsilon_ = None # Actual smoothing value def fit(self, X, y): """ Fit Gaussian Naive Bayes according to X, y. Parameters: ----------- X : array-like of shape (n_samples, n_features) Training vectors y : array-like of shape (n_samples,) Target values Returns: -------- self : object Fitted estimator """ X = np.asarray(X, dtype=np.float64) y = np.asarray(y) n_samples, n_features = X.shape # Identify classes self.classes_ = np.unique(y) n_classes = len(self.classes_) # Initialize arrays self.class_count_ = np.zeros(n_classes, dtype=np.float64) self.theta_ = np.zeros((n_classes, n_features), dtype=np.float64) self.var_ = np.zeros((n_classes, n_features), dtype=np.float64) # Compute statistics for each class for k, cls in enumerate(self.classes_): X_class = X[y == cls] self.class_count_[k] = X_class.shape[0] self.theta_[k, :] = X_class.mean(axis=0) self.var_[k, :] = X_class.var(axis=0) # Compute priors self.class_prior_ = self.class_count_ / self.class_count_.sum() # Compute variance smoothing (relative to largest variance) self.epsilon_ = self.var_smoothing * self.var_.max() self.var_ += self.epsilon_ return self def _joint_log_likelihood(self, X): """ Compute joint log-likelihood: log P(y, x) = log P(x|y) + log P(y) Returns: -------- jll : array of shape (n_samples, n_classes) """ X = np.asarray(X, dtype=np.float64) n_samples = X.shape[0] n_classes = len(self.classes_) jll = np.zeros((n_samples, n_classes)) for k in range(n_classes): # Log-likelihood for class k: sum over features # log N(x_j | μ_jk, σ²_jk) = -0.5 * [(x_j - μ_jk)² / σ²_jk + log(2π σ²_jk)] diff = X - self.theta_[k] log_lik = -0.5 * np.sum( (diff ** 2) / self.var_[k] + np.log(2 * np.pi * self.var_[k]), axis=1 ) # Add log prior jll[:, k] = log_lik + np.log(self.class_prior_[k]) return jll def predict_log_proba(self, X): """ Return log-probability estimates for test vector X. """ jll = self._joint_log_likelihood(X) # Normalize using log-sum-exp log_prob_x = logsumexp(jll, axis=1, keepdims=True) return jll - log_prob_x def predict_proba(self, X): """ Return probability estimates for test vector X. """ return np.exp(self.predict_log_proba(X)) def predict(self, X): """ Perform classification on test vectors X. """ jll = self._joint_log_likelihood(X) return self.classes_[np.argmax(jll, axis=1)] def score(self, X, y): """ Return mean accuracy on the given test data and labels. """ return np.mean(self.predict(X) == y) def full_demonstration(): """ Complete demonstration of Gaussian NB training and prediction. """ from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.naive_bayes import GaussianNB as SklearnGNB print("=" * 60) print("COMPLETE GAUSSIAN NAIVE BAYES DEMONSTRATION") print("=" * 60) # Load Iris dataset iris = load_iris() X, y = iris.data, iris.target print(f"Dataset: Iris") print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}, Classes: {len(np.unique(y))}") # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42 ) # Train our model model = GaussianNaiveBayes() model.fit(X_train, y_train) # Train sklearn model for comparison sklearn_model = SklearnGNB() sklearn_model.fit(X_train, y_train) print("--- Learned Parameters ---") print(f"Classes: {model.classes_}") print(f"Class counts: {model.class_count_}") print(f"Class priors: {model.class_prior_}") print(f"Means (theta):{model.theta_}") print(f"Variances (first 2 features):{model.var_[:, :2]}") # Evaluate our_accuracy = model.score(X_test, y_test) sklearn_accuracy = sklearn_model.score(X_test, y_test) print("--- Evaluation ---") print(f"Our implementation accuracy: {our_accuracy:.4f}") print(f"sklearn accuracy: {sklearn_accuracy:.4f}") # Verify predictions match our_preds = model.predict(X_test) sklearn_preds = sklearn_model.predict(X_test) match = np.all(our_preds == sklearn_preds) print(f"Predictions match sklearn: {match}") # Example probability predictions print("--- Sample Predictions with Probabilities ---") sample_indices = [0, 15, 30] for i in sample_indices: probs = model.predict_proba(X_test[i:i+1])[0] pred = model.predict(X_test[i:i+1])[0] true = y_test[i] print(f"Sample {i}: True={true}, Pred={pred}, Probs={probs.round(4)}") full_demonstration()We have developed a complete understanding of parameter estimation in Gaussian Naive Bayes. Let us consolidate the key concepts:
What's next:
With parameters estimated from training data, we can now classify new observations. The next page explores the decision boundary—the geometric surface that separates classes in feature space. We'll see how the Gaussian parameters determine this boundary and discover when it's linear versus quadratic.
You now understand how to estimate Gaussian Naive Bayes parameters from data. The elegant closed-form solutions—sample means and variances—make training extremely fast and simple. With variance smoothing and regularization, the model handles edge cases robustly. Next, we analyze the geometry of classification decisions.