Loading learning content...
In our exploration of Naive Bayes, we have encountered powerful techniques for discrete features: Multinomial Naive Bayes excels at text classification by modeling word frequencies, while Bernoulli Naive Bayes handles binary feature vectors elegantly. But what happens when our features are continuous—real-valued measurements like height, temperature, stock prices, or pixel intensities?
Consider a medical diagnosis system that must classify patients based on blood pressure (120.5 mmHg), body temperature (37.2°C), and cholesterol level (195 mg/dL). These are not word counts or binary flags—they are measurements from a continuous spectrum. How do we compute $P(\text{blood_pressure} = 120.5 | \text{disease})$? The probability of observing any exact real number is technically zero under continuous distributions.
This fundamental challenge motivates Gaussian Naive Bayes, where we model each continuous feature as following a Gaussian (normal) distribution within each class. This elegant assumption transforms an impossible probability computation into a tractable density estimation problem.
By the end of this page, you will understand: (1) why discrete Naive Bayes methods fail for continuous features, (2) the distinction between probability mass functions and probability density functions, (3) how to use probability densities in Bayes' theorem, (4) the concept of likelihood versus probability, and (5) why Gaussian distributions are a natural choice for continuous features.
Let us first understand precisely why the techniques we developed for Multinomial and Bernoulli Naive Bayes cannot be directly applied to continuous features.
For discrete features (e.g., word counts in text), we estimate class-conditional probabilities by counting:
$$P(x_j = v | y = k) = \frac{\text{count of samples in class } k \text{ where feature } j = v}{\text{total samples in class } k}$$
This works because:
For continuous features, these assumptions break down:
Problem 1: Infinite possible values
A feature like height can take any value in a continuous range (e.g., 150.00000... to 200.00000... cm). There are uncountably infinite possible values, not a finite set.
Problem 2: Zero probability of exact values
In a continuous distribution, the probability of observing any exact value is zero: $$P(X = 175.3247281...) = 0$$
This is because probability mass must be distributed over infinitely many possible values.
Problem 3: Counting doesn't work
If we try to estimate $P(\text{height} = 175.0 | \text{male})$ by counting, we might find that out of 1000 male samples, exactly zero have height precisely equal to 175.0 (maybe 174.98 or 175.02, but not exactly 175.0). Our probability estimate would be zero, which is both mathematically correct and practically useless.
A common but flawed approach is to discretize continuous features into bins (e.g., height < 160, 160-170, 170-180, > 180). While this can work in some cases, it introduces arbitrary boundaries, loses information, and can dramatically affect classifier performance. The choice of bin edges becomes a hyperparameter that is difficult to optimize and often dataset-specific.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npimport matplotlib.pyplot as plt # Generate continuous data for two classesnp.random.seed(42)n_samples = 500 # Class 0: lower valuesclass_0 = np.random.normal(loc=45, scale=8, size=n_samples)# Class 1: higher values class_1 = np.random.normal(loc=55, scale=8, size=n_samples) # Demonstrate the discretization problemdef discretize_and_classify(data, labels, n_bins): """ Show how discretization loses information and creates arbitrary boundaries. """ # Combine data all_data = np.concatenate([data[labels == 0], data[labels == 1]]) all_labels = labels # Create bin edges bin_edges = np.linspace(all_data.min(), all_data.max(), n_bins + 1) # Assign to bins binned_data = np.digitize(all_data, bin_edges[:-1]) - 1 # Count per class per bin class_0_counts = np.zeros(n_bins) class_1_counts = np.zeros(n_bins) for bin_idx in range(n_bins): class_0_counts[bin_idx] = np.sum((binned_data == bin_idx) & (all_labels == 0)) class_1_counts[bin_idx] = np.sum((binned_data == bin_idx) & (all_labels == 1)) return bin_edges, class_0_counts, class_1_counts # Compare different numbers of binsall_data = np.concatenate([class_0, class_1])labels = np.concatenate([np.zeros(n_samples), np.ones(n_samples)]) print("=" * 60)print("DISCRETIZATION SENSITIVITY ANALYSIS")print("=" * 60) for n_bins in [5, 10, 20, 50]: edges, c0, c1 = discretize_and_classify(all_data, labels, n_bins) # Find the bin where decision changes (class 0 > class 1 vs class 1 > class 0) decision_changes = [] for i in range(n_bins - 1): ratio_current = c0[i] / max(c1[i], 1) ratio_next = c0[i+1] / max(c1[i+1], 1) if (ratio_current > 1) != (ratio_next > 1): decision_changes.append(edges[i+1]) print(f"\nBins: {n_bins}") print(f" Decision boundary locations: {decision_changes}") print(f" Number of boundary shifts: {len(decision_changes)}") # Check for empty bins (problematic for probability estimation) empty_c0 = np.sum(c0 == 0) empty_c1 = np.sum(c1 == 0) print(f" Empty bins (class 0): {empty_c0}") print(f" Empty bins (class 1): {empty_c1}") print("\n" + "=" * 60)print("KEY INSIGHT: The decision boundary MOVES based on arbitrary")print("bin count choice. This is not a property of the data!")print("=" * 60)The mathematical tool for handling continuous random variables is the probability density function (PDF), which fundamentally differs from the probability mass functions (PMFs) used for discrete variables.
Probability Mass Function (PMF) — for discrete variables:
Probability Density Function (PDF) — for continuous variables:
$$P(a \leq X \leq b) = \int_a^b f(x) dx$$
The density $f(x)$ indicates the relative likelihood of observing values near $x$. If $f(x_1) = 2 \cdot f(x_2)$, then observing a value near $x_1$ is twice as likely as observing a value near $x_2$.
Crucial insight: While $P(X = x) = 0$ for any specific $x$, the density $f(x)$ can be large or small, telling us where values are more or less likely to concentrate.
$$f(x) = \lim_{\epsilon \to 0} \frac{P(x - \epsilon < X < x + \epsilon)}{2\epsilon}$$
The density is the limit of probability per unit length as the interval shrinks to zero.
| Property | PMF (Discrete) | PDF (Continuous) |
|---|---|---|
| Output interpretation | $p(x) = P(X = x)$ | $f(x) = $ density at $x$ |
| Range of output | $0 \leq p(x) \leq 1$ | $f(x) \geq 0$, can exceed 1 |
| Normalization | $\sum_x p(x) = 1$ | $\int f(x) dx = 1$ |
| Point probability | $P(X = x) = p(x) > 0$ possible | $P(X = x) = 0$ always |
| Interval probability | Sum over integers in interval | Integral over interval |
| Typical examples | Binomial, Poisson, Categorical | Gaussian, Exponential, Uniform |
Unlike probabilities, density values can be greater than 1. For example, the uniform distribution on [0, 0.5] has density $f(x) = 2$ for $x \in [0, 0.5]$. The total area under the curve (probability) is still 1, but the height (density) is 2. This is a common source of confusion when first encountering continuous distributions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt # Demonstrate PMF vs PDF conceptsprint("=" * 60)print("PMF vs PDF DEMONSTRATION")print("=" * 60) # PMF Example: Poisson distribution (discrete)print("\n--- PMF: Poisson Distribution (λ=3) ---")poisson = stats.poisson(mu=3)for x in range(10): pmf_val = poisson.pmf(x) print(f"P(X = {x}) = {pmf_val:.4f}")print(f"Sum of PMF values: {sum(poisson.pmf(x) for x in range(100)):.6f}") # PDF Example: Normal distribution (continuous)print("\n--- PDF: Normal Distribution (μ=0, σ=0.3) ---")normal = stats.norm(loc=0, scale=0.3)x_values = np.array([-1.0, -0.5, 0.0, 0.5, 1.0])for x in x_values: pdf_val = normal.pdf(x) print(f"f(X = {x}) = {pdf_val:.4f}") # Note: can exceed 1! print(f"\nPDF at x=0: {normal.pdf(0):.4f}") # > 1 because σ is smallprint(f"This exceeds 1, but the INTEGRAL equals: {normal.cdf(np.inf) - normal.cdf(-np.inf):.6f}") # Show that point probabilities are zeroprint("\n--- Point vs Interval Probability ---")print(f"P(X = 0.0 exactly) = {0.0:.10f} (always zero for continuous)")print(f"P(-0.1 < X < 0.1) = {normal.cdf(0.1) - normal.cdf(-0.1):.6f}")print(f"P(-0.01 < X < 0.01) = {normal.cdf(0.01) - normal.cdf(-0.01):.6f}")print(f"P(-0.001 < X < 0.001) = {normal.cdf(0.001) - normal.cdf(-0.001):.6f}")print("\nAs interval shrinks → probability → 0") # Demonstrate relative likelihood interpretationprint("\n--- Relative Likelihood Interpretation ---")# Compare densities at two pointsx1, x2 = 0, 1f_x1, f_x2 = normal.pdf(x1), normal.pdf(x2)print(f"Density at x={x1}: {f_x1:.4f}")print(f"Density at x={x2}: {f_x2:.4f}")print(f"Ratio: {f_x1 / f_x2:.2f}x more likely near x={x1} than near x={x2}")The elegant insight that enables Gaussian Naive Bayes is that we can substitute probability densities for probability mass in Bayes' theorem, and the classification still works correctly.
For discrete features: $$P(y = k | \mathbf{x}) = \frac{P(\mathbf{x} | y = k) P(y = k)}{P(\mathbf{x})}$$
We classify by finding: $$\hat{y} = \arg\max_k P(y = k | \mathbf{x}) = \arg\max_k P(\mathbf{x} | y = k) P(y = k)$$
For continuous features, we replace probabilities with probability densities: $$f(y = k | \mathbf{x}) \propto f(\mathbf{x} | y = k) P(y = k)$$
Where:
We classify by finding: $$\hat{y} = \arg\max_k f(\mathbf{x} | y = k) P(y = k)$$
The key insight is that we are comparing ratios of likelihoods across classes. Since we're comparing: $$\frac{f(\mathbf{x} | y = 1) P(y = 1)}{f(\mathbf{x} | y = 0) P(y = 0)}$$
Any constants that don't depend on class $k$ cancel out. The densities serve as relative likelihoods—telling us which class makes the observed data more likely.
When we use $f(\mathbf{x} | y = k)$ for classification, we're computing the likelihood of the data under each class model. Higher density means the observed features are more consistent with that class's distribution. We don't need actual probabilities—we just need to compare which class better 'explains' the observed features.
The formal justification involves considering infinitesimally small regions around the observed point. For a small region $\mathcal{B}$ around $\mathbf{x}$:
$$P(y = k | \mathbf{x} \in \mathcal{B}) = \frac{P(\mathbf{x} \in \mathcal{B} | y = k) P(y = k)}{P(\mathbf{x} \in \mathcal{B})}$$
As $\mathcal{B}$ shrinks to a point: $$P(\mathbf{x} \in \mathcal{B} | y = k) \approx f(\mathbf{x} | y = k) \cdot |\mathcal{B}|$$
The volume $|\mathcal{B}|$ appears in both numerator and denominator, canceling out: $$P(y = k | \mathbf{x}) = \lim_{|\mathcal{B}| \to 0} \frac{f(\mathbf{x} | y = k) \cdot |\mathcal{B}| \cdot P(y = k)}{\sum_{j} f(\mathbf{x} | y = j) \cdot |\mathcal{B}| \cdot P(y = j)}$$
$$= \frac{f(\mathbf{x} | y = k) P(y = k)}{\sum_{j} f(\mathbf{x} | y = j) P(y = j)}$$
This is a proper posterior probability, summing to 1 across all classes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom scipy import stats def density_based_classification(x, class_distributions, priors): """ Classify a point using density-based Bayes' theorem. Parameters: ----------- x : float or array The observation(s) to classify class_distributions : list of scipy.stats distributions Class-conditional density functions priors : array Prior probabilities for each class Returns: -------- predicted_class : int The predicted class label posteriors : array Posterior probability for each class """ n_classes = len(class_distributions) # Compute numerator for each class: f(x|k) * P(k) numerators = np.array([ class_distributions[k].pdf(x) * priors[k] for k in range(n_classes) ]) # Normalize to get proper posterior probabilities denominator = np.sum(numerators) posteriors = numerators / denominator predicted_class = np.argmax(posteriors) return predicted_class, posteriors # Example: Two Gaussian classesprint("=" * 60)print("DENSITY-BASED BAYESIAN CLASSIFICATION")print("=" * 60) # Define class-conditional distributionsclass_0 = stats.norm(loc=40, scale=5) # Class 0: mean=40, std=5class_1 = stats.norm(loc=60, scale=5) # Class 1: mean=60, std=5 distributions = [class_0, class_1]priors = np.array([0.5, 0.5]) # Equal priors # Classify several test pointstest_points = [30, 40, 50, 60, 70] print("\nClass 0: N(μ=40, σ=5)")print("Class 1: N(μ=60, σ=5)")print("Priors: P(class 0) = P(class 1) = 0.5")print() print(f"{'x':>6} | {'f(x|0)':>10} | {'f(x|1)':>10} | {'P(0|x)':>8} | {'P(1|x)':>8} | {'Pred':>6}")print("-" * 65) for x in test_points: pred, posteriors = density_based_classification(x, distributions, priors) f_0 = class_0.pdf(x) f_1 = class_1.pdf(x) print(f"{x:>6} | {f_0:>10.6f} | {f_1:>10.6f} | {posteriors[0]:>8.4f} | {posteriors[1]:>8.4f} | {pred:>6}") print()print("Note: Decision boundary is at x=50 where posteriors are equal") # Show effect of unequal priorsprint("\n" + "=" * 60)print("EFFECT OF UNEQUAL PRIORS")print("=" * 60) priors_unequal = np.array([0.8, 0.2]) # Class 0 is more commonprint(f"\nNew priors: P(class 0) = {priors_unequal[0]}, P(class 1) = {priors_unequal[1]}")print() x = 50 # The old decision boundarypred_equal, post_equal = density_based_classification(x, distributions, np.array([0.5, 0.5]))pred_unequal, post_unequal = density_based_classification(x, distributions, priors_unequal) print(f"At x = {x}:")print(f" Equal priors: P(0|x) = {post_equal[0]:.4f}, P(1|x) = {post_equal[1]:.4f} → Predict {pred_equal}")print(f" Unequal priors: P(0|x) = {post_unequal[0]:.4f}, P(1|x) = {post_unequal[1]:.4f} → Predict {pred_unequal}")print("\nHigher prior for class 0 shifts decision boundary toward class 1!")While any continuous density function could theoretically be used in the Bayesian framework, the Gaussian (normal) distribution is the canonical choice for continuous features. This preference is not arbitrary—it rests on deep theoretical and practical foundations.
Perhaps the most compelling argument comes from the Central Limit Theorem (CLT):
The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distribution.
Many real-world measurements arise as the aggregate of many small, independent factors:
Consequence: When we don't know the true distribution, Gaussian is often a reasonable assumption for measurements that result from aggregated effects.
Among all continuous distributions with a given mean and variance, the Gaussian distribution has maximum entropy. In information-theoretic terms:
$$H[X] = -\int f(x) \log f(x) dx$$
Maximizing entropy subject to constraints on mean and variance yields the Gaussian. This means:
Maximum entropy is a principled way to choose distributions: make the fewest assumptions beyond what the data tells you. If you know only the mean and variance of a continuous random variable, the Gaussian is the unique distribution that maximizes uncertainty (entropy) while respecting these constraints. Using any other distribution implicitly assumes you know more than you actually do.
Gaussian distributions have remarkable mathematical properties:
Closed-form density: Easy to evaluate $$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
Simple parameter estimation: MLE gives $\hat{\mu} = \bar{x}$ and $\hat{\sigma}^2 = \frac{1}{n}\sum_i(x_i - \bar{x})^2$
Log-density is quadratic: Enables linear algebra in log-space $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$$
Closure under linear operations: Sums of Gaussians are Gaussian
Conjugate prior structure: Enables Bayesian inference
Empirical evidence consistently shows that Gaussian Naive Bayes:
Having established that we can use probability densities in Bayes' theorem and that Gaussians are a principled choice, we now apply the naive Bayes assumption to factorize the joint class-conditional density.
For a feature vector $\mathbf{x} = (x_1, x_2, \ldots, x_d)$, the naive Bayes assumption states: $$f(\mathbf{x} | y = k) = \prod_{j=1}^{d} f(x_j | y = k)$$
Features are assumed to be conditionally independent given the class label. This allows us to model each feature's distribution separately within each class.
Under Gaussian Naive Bayes, each univariate conditional density is Gaussian: $$f(x_j | y = k) = \mathcal{N}(x_j | \mu_{jk}, \sigma_{jk}^2) = \frac{1}{\sqrt{2\pi\sigma_{jk}^2}} \exp\left(-\frac{(x_j - \mu_{jk})^2}{2\sigma_{jk}^2}\right)$$
Where:
The full Gaussian Naive Bayes classifier is: $$f(\mathbf{x} | y = k) = \prod_{j=1}^{d} \frac{1}{\sqrt{2\pi\sigma_{jk}^2}} \exp\left(-\frac{(x_j - \mu_{jk})^2}{2\sigma_{jk}^2}\right)$$
Taking the logarithm (for numerical stability and convenience): $$\log f(\mathbf{x} | y = k) = -\frac{1}{2} \sum_{j=1}^{d} \left[ \frac{(x_j - \mu_{jk})^2}{\sigma_{jk}^2} + \log(2\pi\sigma_{jk}^2) \right]$$
To classify a new point $\mathbf{x}$, compute: $$\hat{y} = \arg\max_k \left[ \log P(y = k) + \sum_{j=1}^{d} \log f(x_j | y = k) \right]$$
$$= \arg\max_k \left[ \log \pi_k - \frac{1}{2} \sum_{j=1}^{d} \left( \frac{(x_j - \mu_{jk})^2}{\sigma_{jk}^2} + \log \sigma_{jk}^2 \right) \right]$$
Where $\pi_k = P(y = k)$ is the class prior.
Always work in log-space when implementing Naive Bayes. Multiplying many small probabilities or densities causes numerical underflow (the product rounds to zero). Adding log-densities is numerically stable and computationally efficient. Most implementations use log-likelihoods throughout.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155
import numpy as npfrom scipy import stats class GaussianNaiveBayesFromScratch: """ Gaussian Naive Bayes classifier built from first principles. This implementation clearly shows the mathematical structure: - Each feature j in each class k has its own Gaussian distribution - Classification uses log-densities for numerical stability """ def __init__(self): self.classes_ = None self.means_ = None # Shape: (n_classes, n_features) self.variances_ = None # Shape: (n_classes, n_features) self.priors_ = None # Shape: (n_classes,) def fit(self, X, y): """ Estimate Gaussian parameters for each feature in each class. """ n_samples, n_features = X.shape self.classes_ = np.unique(y) n_classes = len(self.classes_) # Initialize parameter arrays self.means_ = np.zeros((n_classes, n_features)) self.variances_ = np.zeros((n_classes, n_features)) self.priors_ = np.zeros(n_classes) for k, cls in enumerate(self.classes_): X_cls = X[y == cls] # Samples belonging to class k n_cls = len(X_cls) # Prior: P(y = k) = fraction of samples in class k self.priors_[k] = n_cls / n_samples # Mean and variance for each feature in this class self.means_[k, :] = X_cls.mean(axis=0) self.variances_[k, :] = X_cls.var(axis=0) # Add small epsilon for numerical stability (avoid division by zero) self.variances_ = np.maximum(self.variances_, 1e-9) return self def _log_likelihood(self, X, class_idx): """ Compute log-likelihood of samples under class distribution. log f(x | y=k) = sum_j log f(x_j | y=k) = -0.5 * sum_j [(x_j - μ_jk)² / σ²_jk + log(2π σ²_jk)] """ means = self.means_[class_idx] variances = self.variances_[class_idx] # Compute log density for each feature, then sum log_densities = -0.5 * ( ((X - means) ** 2) / variances + np.log(2 * np.pi * variances) ) # Sum across features (independence assumption) return log_densities.sum(axis=1) def predict_log_proba(self, X): """ Compute log-posterior probabilities for each class. """ n_samples = X.shape[0] n_classes = len(self.classes_) # log P(x|k) + log P(k) for each class log_posteriors = np.zeros((n_samples, n_classes)) for k in range(n_classes): log_posteriors[:, k] = ( self._log_likelihood(X, k) + np.log(self.priors_[k]) ) # Normalize using log-sum-exp trick log_sum = np.log(np.exp(log_posteriors - log_posteriors.max(axis=1, keepdims=True)).sum(axis=1, keepdims=True)) log_posteriors = log_posteriors - log_posteriors.max(axis=1, keepdims=True) - log_sum return log_posteriors def predict_proba(self, X): """Convert log probabilities to probabilities.""" return np.exp(self.predict_log_proba(X)) def predict(self, X): """Predict class labels.""" return self.classes_[np.argmax(self.predict_log_proba(X), axis=1)] # Demonstrationnp.random.seed(42) # Generate synthetic 2-class data with 3 featuresn_per_class = 200 # Class 0: lower values on all featuresX_0 = np.column_stack([ np.random.normal(50, 10, n_per_class), np.random.normal(30, 5, n_per_class), np.random.normal(100, 20, n_per_class)]) # Class 1: higher values on all featuresX_1 = np.column_stack([ np.random.normal(70, 10, n_per_class), np.random.normal(50, 5, n_per_class), np.random.normal(140, 20, n_per_class)]) X = np.vstack([X_0, X_1])y = np.array([0] * n_per_class + [1] * n_per_class) # Shuffleshuffle_idx = np.random.permutation(len(y))X, y = X[shuffle_idx], y[shuffle_idx] # Split into train/testX_train, X_test = X[:300], X[300:]y_train, y_test = y[:300], y[300:] # Train and evaluategnb = GaussianNaiveBayesFromScratch()gnb.fit(X_train, y_train) print("=" * 60)print("GAUSSIAN NAIVE BAYES: LEARNED PARAMETERS")print("=" * 60) for k in range(2): print(f"\nClass {k}:") print(f" Prior P(y={k}) = {gnb.priors_[k]:.4f}") for j in range(3): print(f" Feature {j}: μ = {gnb.means_[k,j]:.2f}, σ² = {gnb.variances_[k,j]:.2f}") # Predictionsy_pred = gnb.predict(X_test)accuracy = np.mean(y_pred == y_test) print(f"\nTest Accuracy: {accuracy:.4f}") # Example predictions with probabilitiesprint("\n" + "=" * 60)print("EXAMPLE PREDICTIONS")print("=" * 60)for i in range(5): probs = gnb.predict_proba(X_test[i:i+1])[0] print(f"Sample {i}: True={y_test[i]}, Pred={y_pred[i]}, P(y=0)={probs[0]:.4f}, P(y=1)={probs[1]:.4f}")Understanding the number of parameters in Gaussian Naive Bayes reveals why it's such an efficient model and why the naive independence assumption is so powerful.
For a problem with $K$ classes, $d$ features:
Class priors: $K - 1$ free parameters (the $K$-th is determined by constraint $\sum_k \pi_k = 1$)
Gaussian parameters:
Total parameters: $$\text{Parameters} = (K - 1) + 2Kd = 2Kd + K - 1$$
For typical cases:
Without the naive Bayes assumption, we would model the full class-conditional distribution as a multivariate Gaussian: $$f(\mathbf{x} | y = k) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
A full $d \times d$ covariance matrix has $\frac{d(d+1)}{2}$ parameters per class: $$\text{Full model parameters} = Kd + K \cdot \frac{d(d+1)}{2} = Kd + \frac{Kd(d+1)}{2}$$
For 10 classes, 100 features: $10 \times 100 + 10 \times \frac{100 \times 101}{2} = 1000 + 50,500 = 51,500$ parameters!
| Model | Formula | K=2, d=10 | K=10, d=100 |
|---|---|---|---|
| Gaussian Naive Bayes | $2Kd + K - 1$ | 41 | 2,009 |
| Full Covariance GDA | $Kd + \frac{Kd(d+1)}{2}$ | 130 | 51,500 |
| Ratio (Full/NB) | 3.2× | 25.6× |
Gaussian Naive Bayes has fewer parameters (lower variance, less overfitting risk) but makes a stronger assumption (higher bias). The full covariance model is more flexible but requires more data to estimate reliably. In high dimensions with limited data, the naive model often outperforms the 'correct' model because it can be estimated more reliably.
The naive Bayes assumption is equivalent to assuming a diagonal covariance matrix:
$$\boldsymbol{\Sigma}k = \begin{pmatrix} \sigma{1k}^2 & 0 & \cdots & 0 \ 0 & \sigma_{2k}^2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \sigma_{dk}^2 \end{pmatrix}$$
Zero off-diagonal entries mean zero correlation between features (conditional on class). This is exactly the conditional independence assumption expressed in linear algebra.
Key insight: Gaussian Naive Bayes assumes that knowing feature $x_1$ tells us nothing about feature $x_2$, once we know the class. The features may still be marginally correlated (in the overall data), but within each class, they are assumed independent.
With fewer parameters, Gaussian Naive Bayes can be trained reliably with less data:
Rule of thumb: Need at least 5-10× as many samples as parameters for reliable estimation. For 100 features:
This explains why Naive Bayes often outperforms more sophisticated models on small-to-medium datasets.
We have established the theoretical and practical foundations for applying Naive Bayes to continuous features. Let us consolidate the key concepts:
What's next:
Having understood why and how we model continuous features with Gaussians, the next page dives deep into the Gaussian distribution itself. We will explore its mathematical properties, the role of mean and variance parameters, and develop intuition for how Gaussian shapes encode information about data distributions.
You now understand the fundamental challenge of continuous features and why Gaussian Naive Bayes provides an elegant solution. The key insight is that probability densities serve as relative likelihoods—even though point probabilities are zero, we can still compare which class better explains the observed data. Next, we explore the Gaussian distribution in depth.