Loading content...
Gaussian Naive Bayes is not an isolated technique—it belongs to a rich family of classifiers that model class-conditional distributions as Gaussians. Understanding this family reveals fundamental insights about classifier design, the bias-variance tradeoff, and when different assumptions are appropriate.
The family includes:
These methods differ in what they assume about the covariance structure of class-conditional distributions. Their relationships illuminate deep principles:
This page explores these connections mathematically, revealing the unified theory underlying Gaussian generative classifiers.
By the end of this page, you will understand: (1) the covariance assumptions of GNB, LDA, and QDA, (2) how these assumptions affect decision boundaries, (3) the mathematical relationship between these methods, (4) parameter count comparison and sample complexity, (5) when to use each method, and (6) regularization as a continuum between methods.
All three methods share the same fundamental approach: model class-conditional distributions as multivariate Gaussians, then apply Bayes' theorem for classification.
The $d$-dimensional multivariate Gaussian distribution is:
$$f(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$
Where:
For class $k$, we model: $$\mathbf{x} | y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$
With class prior $\pi_k = P(y = k)$.
Classification uses Bayes' rule: $$P(y = k | \mathbf{x}) = \frac{f(\mathbf{x} | y = k) \pi_k}{\sum_j f(\mathbf{x} | y = j) \pi_j}$$
We classify by comparing discriminant functions: $$\delta_k(\mathbf{x}) = \log f(\mathbf{x} | y = k) + \log \pi_k$$
$$= -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) - \frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{d}{2}\log(2\pi) + \log \pi_k$$
The classification rule is: $$\hat{y} = \arg\max_k \delta_k(\mathbf{x})$$
The key question: What structure do we impose on the covariance matrices $\boldsymbol{\Sigma}_k$?
The term $(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)$ is the squared Mahalanobis distance from $\mathbf{x}$ to class $k$'s mean. It measures distance in units of the class's covariance structure—a point one standard deviation away in the direction of highest variance is 'closer' than one standard deviation in a low-variance direction.
The three methods differ in their covariance structure assumptions:
Assumption: Each class has its own unrestricted covariance matrix. $$\boldsymbol{\Sigma}_k \text{ is a general } d \times d \text{ positive definite matrix, different for each } k$$
Covariance structure: $$\boldsymbol{\Sigma}k = \begin{pmatrix} \sigma^2{1k} & \rho_{12,k}\sigma_{1k}\sigma_{2k} & \cdots \ \rho_{12,k}\sigma_{1k}\sigma_{2k} & \sigma^2_{2k} & \cdots \ \vdots & \vdots & \ddots \end{pmatrix}$$
Parameters per class: $\frac{d(d+1)}{2}$ (symmetric matrix)
Assumption: All classes share the same covariance matrix. $$\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma} \text{ for all } k$$
Covariance structure: Same as QDA, but shared: $$\boldsymbol{\Sigma} = \begin{pmatrix} \sigma^2_{1} & \rho_{12}\sigma_{1}\sigma_{2} & \cdots \ \rho_{12}\sigma_{1}\sigma_{2} & \sigma^2_{2} & \cdots \ \vdots & \vdots & \ddots \end{pmatrix}$$
Parameters (total): $\frac{d(d+1)}{2}$ (shared across classes)
Assumption: Features are conditionally independent given class. Each class has a diagonal covariance matrix. $$\boldsymbol{\Sigma}k = \text{diag}(\sigma^2{1k}, \sigma^2_{2k}, \ldots, \sigma^2_{dk})$$
Covariance structure: $$\boldsymbol{\Sigma}k = \begin{pmatrix} \sigma^2{1k} & 0 & \cdots & 0 \ 0 & \sigma^2_{2k} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \sigma^2_{dk} \end{pmatrix}$$
Parameters per class: $d$ (variances only, no covariances)
| Method | Covariance Structure | Params per Class | Total Params | Boundary |
|---|---|---|---|---|
| QDA | Full, class-specific | $\frac{d(d+1)}{2}$ | $K \cdot \frac{d(d+1)}{2} + Kd$ | Quadratic |
| LDA | Full, shared | $0$ (shared) | $\frac{d(d+1)}{2} + Kd$ | Linear |
| GNB | Diagonal, class-specific | $d$ | $2Kd$ | Linear or Quadratic |
| GNB (equal var) | Diagonal, shared variances | $d$ (shared) | $d + Kd$ | Linear |
In a multivariate Gaussian, zero covariance between features implies independence. A diagonal covariance matrix (all off-diagonal entries zero) means all feature pairs are independent. This is exactly the naive Bayes assumption expressed in linear algebra terms.
The covariance assumptions directly determine decision boundary shapes.
With class-specific covariance matrices, the discriminant function contains: $$-\frac{1}{2}\mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x} + \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x} + \ldots$$
The $\mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x}$ term is quadratic in $\mathbf{x}$ and differs across classes.
$\Rightarrow$ Quadratic decision boundaries (ellipsoids, hyperboloids, paraboloids)
With shared covariance $\boldsymbol{\Sigma}$: $$\delta_k(\mathbf{x}) = -\frac{1}{2}\mathbf{x}^T \boldsymbol{\Sigma}^{-1} \mathbf{x} + \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \mathbf{x} - \frac{1}{2}\boldsymbol{\mu}_k^T\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k + \log \pi_k$$
The quadratic term $-\frac{1}{2}\mathbf{x}^T \boldsymbol{\Sigma}^{-1} \mathbf{x}$ is the same for all classes.
When comparing $\delta_j(\mathbf{x})$ vs $\delta_k(\mathbf{x})$, the quadratic terms cancel: $$\delta_j(\mathbf{x}) - \delta_k(\mathbf{x}) = (\boldsymbol{\mu}_j - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}^{-1} \mathbf{x} + \text{constant}$$
This is linear in $\mathbf{x}$.
$\Rightarrow$ Linear decision boundaries (hyperplanes)
Unequal variances across classes:
Equal variances across classes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as npfrom sklearn.naive_bayes import GaussianNBfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysisfrom sklearn.datasets import make_classification def compare_classifier_boundaries(): """ Compare decision boundaries of GNB, LDA, and QDA. """ print("=" * 60) print("GAUSSIAN CLASSIFIER FAMILY: BOUNDARY COMPARISON") print("=" * 60) # Generate dataset with correlated features np.random.seed(42) n = 200 # Class 0: features are correlated cov_0 = np.array([[1.0, 0.8], [0.8, 1.0]]) X_0 = np.random.multivariate_normal([0, 0], cov_0, n) # Class 1: different correlation structure cov_1 = np.array([[1.0, -0.5], [-0.5, 1.0]]) X_1 = np.random.multivariate_normal([2, 2], cov_1, n) X = np.vstack([X_0, X_1]) y = np.array([0]*n + [1]*n) # Shuffle idx = np.random.permutation(len(y)) X, y = X[idx], y[idx] # Fit all three models gnb = GaussianNB() lda = LinearDiscriminantAnalysis() qda = QuadraticDiscriminantAnalysis() gnb.fit(X, y) lda.fit(X, y) qda.fit(X, y) print("--- Dataset Characteristics ---") print(f"True Class 0 covariance:{cov_0}") print(f"True Class 1 covariance:{cov_1}") print(f"Note: Features are CORRELATED (covariance ≠ 0)") print("--- Model Parameters ---") # GNB assumes diagonal covariance print("Gaussian Naive Bayes (assumes diagonal covariance):") print(f" Class 0 variances: {gnb.var_[0].round(4)}") print(f" Class 1 variances: {gnb.var_[1].round(4)}") print(f" Ignores correlation! (diagonal assumption)") # LDA estimates shared covariance print("LDA (estimates shared full covariance):") # LDA doesn't directly expose covariance, but uses shared print(f" Uses pooled covariance across both classes") print(f" Accounts for correlation") # QDA estimates separate covariances print("QDA (estimates separate full covariance per class):") print(f" Can capture different correlation structures per class") # Compare predictions print("--- Prediction Comparison ---") test_points = np.array([ [0, 0], # Near class 0 center [2, 2], # Near class 1 center [1, 1], # Between classes [1, 0], # Along x-axis [0, 1], # Along y-axis ]) print(f"{'Point':>15} | {'GNB':>5} | {'LDA':>5} | {'QDA':>5}") print("-" * 45) for point in test_points: gnb_pred = gnb.predict([point])[0] lda_pred = lda.predict([point])[0] qda_pred = qda.predict([point])[0] print(f"({point[0]:4.1f}, {point[1]:4.1f}) | {gnb_pred:>5} | {lda_pred:>5} | {qda_pred:>5}") # Accuracy comparison print("--- Accuracy on Training Data ---") print(f"GNB: {gnb.score(X, y):.4f}") print(f"LDA: {lda.score(X, y):.4f}") print(f"QDA: {qda.score(X, y):.4f}") print("--- Interpretation ---") print("QDA best fits this data because classes have different") print("correlation structures. LDA's shared covariance assumption") print("slightly hurts. GNB's independence assumption ignores") print("correlations entirely, but may still perform well overall.") compare_classifier_boundaries()The methods differ dramatically in parameter count, which has profound implications for estimation quality and generalization.
For $K$ classes and $d$ features:
QDA:
LDA:
GNB:
Reliable parameter estimation requires more samples than parameters (rule of thumb: 5-10× as many).
For $d=100$, $K=2$:
For $d=1000$, $K=2$:
GNB's linear parameter growth makes it feasible for high-dimensional problems where QDA and even LDA are impractical.
| Method | Formula | d=10 | d=100 | d=1000 |
|---|---|---|---|---|
| QDA | $2(d + \frac{d(d+1)}{2})$ | 131 | 10,301 | 1,003,001 |
| LDA | $2d + \frac{d(d+1)}{2}$ | 76 | 5,251 | 501,501 |
| GNB | $4d$ | 41 | 401 | 4,001 |
In high dimensions, full covariance matrices become prohibitively expensive to estimate. A $1000 \times 1000$ covariance matrix has over 500,000 entries! Even with substantial data, the estimates are unreliable. This is why Naive Bayes, despite its strong independence assumption, often outperforms 'correct' models in high dimensions—it can be estimated reliably.
Choosing between GNB, LDA, and QDA depends on data characteristics, sample size, and computational constraints.
Start
|
[Is d > 100?]
/
Yes No
| |
GNB [Is n > 10 × Kd²?]
/
Yes No
| |
[Different σ?] LDA or GNB
/
Yes No
| |
QDA LDA
Rather than choosing between LDA, QDA, and GNB, we can interpolate between them using regularization. This provides a continuum of models with tunable bias-variance tradeoff.
Introduced by Friedman (1989), RDA defines a regularized covariance: $$\hat{\boldsymbol{\Sigma}}_k(\alpha, \gamma) = (1 - \gamma) \left[ (1 - \alpha) \hat{\boldsymbol{\Sigma}}_k + \alpha \hat{\boldsymbol{\Sigma}} \right] + \gamma \frac{\text{tr}(\hat{\boldsymbol{\Sigma}})}{d} \mathbf{I}$$
Where:
| $\alpha$ | $\gamma$ | Result |
|---|---|---|
| 0 | 0 | QDA |
| 1 | 0 | LDA |
| 0 | 1 | Spherical QDA (isotropic classes) |
| 1 | 1 | Nearest mean classifier |
While RDA doesn't directly interpolate to GNB (which is diagonal, not spherical), a similar regularization can shrink toward diagonal: $$\hat{\boldsymbol{\Sigma}}_k(\lambda) = (1 - \lambda) \hat{\boldsymbol{\Sigma}}_k + \lambda \cdot \text{diag}(\hat{\boldsymbol{\Sigma}}_k)$$
At $\lambda = 1$, we recover the diagonal covariance of GNB.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysis def demonstrate_rda_continuum(): """ Show how shrinkage interpolates between LDA and more regularized models. """ print("=" * 60) print("REGULARIZED DISCRIMINANT ANALYSIS CONTINUUM") print("=" * 60) # Generate data where regularization helps np.random.seed(42) n_per_class = 50 # Small sample size d = 20 # Moderate dimensionality # Generate correlated features true_cov = np.eye(d) for i in range(d): for j in range(d): if i != j: true_cov[i, j] = 0.5 ** abs(i - j) # AR(1) correlation X_0 = np.random.multivariate_normal(np.zeros(d), true_cov, n_per_class) X_1 = np.random.multivariate_normal(np.ones(d), true_cov, n_per_class) X = np.vstack([X_0, X_1]) y = np.array([0]*n_per_class + [1]*n_per_class) print(f"Dataset: {X.shape[0]} samples, {d} features") print(f"This is a challenging setting: n ≈ d") # Compare different shrinkage values from sklearn.model_selection import cross_val_score print("--- Effect of Shrinkage Regularization ---") print(f"{'Shrinkage':>12} | {'Accuracy':>12} | {'Interpretation'}") print("-" * 55) shrinkages = [None, 0.0, 0.25, 0.5, 0.75, 1.0] for shrinkage in shrinkages: if shrinkage is None: lda = LinearDiscriminantAnalysis(solver='svd') label = "auto (SVD)" else: lda = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=shrinkage) label = f"{shrinkage:.2f}" try: scores = cross_val_score(lda, X, y, cv=5, scoring='accuracy') acc = f"{scores.mean():.4f} ± {scores.std():.3f}" except: acc = "Failed (singular)" if shrinkage == 0: interp = "Full covariance (LDA)" elif shrinkage == 1: interp = "Diagonal covariance (like GNB)" elif shrinkage is None: interp = "SVD-based (regularized)" else: interp = "Intermediate" print(f"{label:>12} | {acc:>12} | {interp}") print("--- Key Insight ---") print("With limited data (n ≈ d), regularization (shrinkage > 0)") print("often improves performance by stabilizing covariance estimates.") print("At shrinkage = 1, we approach diagonal assumption like GNB.") demonstrate_rda_continuum()When uncertain whether to use GNB, LDA, or QDA, try LDA with shrinkage and cross-validate over the shrinkage parameter. This lets the data decide the appropriate level of regularization. sklearn's LDA with shrinkage='auto' uses the Ledoit-Wolf estimator for optimal shrinkage.
Let us formally summarize how GNB, LDA, and QDA relate mathematically.
Most General → Most Restrictive:
$$\text{QDA} \supset \text{LDA} \supset \text{GNB (equal var)} \quad \text{and} \quad \text{QDA} \supset \text{GNB}$$
Where:
GNB is QDA with diagonal covariance: The naive assumption is a special case of quadratic discriminant analysis where off-diagonal entries are zero.
GNB with equal variances is equivalent to LDA with diagonal covariance: Both produce linear boundaries, but LDA's can be oblique while GNB's are axis-aligned.
LDA is QDA with shared covariance: The constraint that $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ leads to linear boundaries.
If the true class-conditional distributions satisfy:
Then GNB (equal variance), LDA, and QDA all reduce to the same discriminant function (up to constraint violations from estimation).
| Method | Bias | Variance | When Best |
|---|---|---|---|
| QDA | Low | High | Large n, genuinely different Σₖ |
| LDA | Medium | Medium | Moderate n, similar Σₖ |
| GNB | High | Low | Small n, high d, approximate independence |
Congratulations! You have completed the Gaussian Naive Bayes module. You now understand: (1) how to model continuous features with Gaussian distributions, (2) parameter estimation via maximum likelihood, (3) the geometry of decision boundaries, and (4) the deep connections to LDA and QDA. This knowledge positions you to choose appropriately among generative classifiers and understand their theoretical foundations.