Gaussian Naive Bayes - Learning Module

Loading content...

0/278

Connection to LDA

A Family of Gaussian Classifiers

Gaussian Naive Bayes is not an isolated technique—it belongs to a rich family of classifiers that model class-conditional distributions as Gaussians. Understanding this family reveals fundamental insights about classifier design, the bias-variance tradeoff, and when different assumptions are appropriate.

The family includes:

Gaussian Naive Bayes (GNB): Assumes feature independence (diagonal covariance)
Linear Discriminant Analysis (LDA): Assumes shared full covariance across classes
Quadratic Discriminant Analysis (QDA): Allows different full covariance per class

These methods differ in what they assume about the covariance structure of class-conditional distributions. Their relationships illuminate deep principles:

More flexible model ≠ better classifier (bias-variance tradeoff)
Shared assumptions reduce parameters (improves estimation with limited data)
Independence assumption is a special case of shared covariance

This page explores these connections mathematically, revealing the unified theory underlying Gaussian generative classifiers.

What You Will Learn

By the end of this page, you will understand: (1) the covariance assumptions of GNB, LDA, and QDA, (2) how these assumptions affect decision boundaries, (3) the mathematical relationship between these methods, (4) parameter count comparison and sample complexity, (5) when to use each method, and (6) regularization as a continuum between methods.

The Gaussian Generative Framework

All three methods share the same fundamental approach: model class-conditional distributions as multivariate Gaussians, then apply Bayes' theorem for classification.

The Multivariate Gaussian

The $d$-dimensional multivariate Gaussian distribution is:

$$f(\mathbf{x} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)$$

Where:

$\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector
$\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ is the covariance matrix (positive definite)
$|\boldsymbol{\Sigma}|$ is the determinant of $\boldsymbol{\Sigma}$

The Generative Classifier

For class $k$, we model: $$\mathbf{x} | y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

With class prior $\pi_k = P(y = k)$.

Classification uses Bayes' rule: $$P(y = k | \mathbf{x}) = \frac{f(\mathbf{x} | y = k) \pi_k}{\sum_j f(\mathbf{x} | y = j) \pi_j}$$

The Log-Discriminant Function

We classify by comparing discriminant functions: $$\delta_k(\mathbf{x}) = \log f(\mathbf{x} | y = k) + \log \pi_k$$

$$= -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) - \frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{d}{2}\log(2\pi) + \log \pi_k$$

The classification rule is: $$\hat{y} = \arg\max_k \delta_k(\mathbf{x})$$

The key question: What structure do we impose on the covariance matrices $\boldsymbol{\Sigma}_k$?

The Mahalanobis Distance

The term $(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k)$ is the squared Mahalanobis distance from $\mathbf{x}$ to class $k$'s mean. It measures distance in units of the class's covariance structure—a point one standard deviation away in the direction of highest variance is 'closer' than one standard deviation in a low-variance direction.

Covariance Assumptions Compared

The three methods differ in their covariance structure assumptions:

Quadratic Discriminant Analysis (QDA)

Assumption: Each class has its own unrestricted covariance matrix. $$\boldsymbol{\Sigma}_k \text{ is a general } d \times d \text{ positive definite matrix, different for each } k$$

Covariance structure: $$\boldsymbol{\Sigma}k = \begin{pmatrix} \sigma^2{1k} & \rho_{12,k}\sigma_{1k}\sigma_{2k} & \cdots \ \rho_{12,k}\sigma_{1k}\sigma_{2k} & \sigma^2_{2k} & \cdots \ \vdots & \vdots & \ddots \end{pmatrix}$$

Parameters per class: $\frac{d(d+1)}{2}$ (symmetric matrix)

Linear Discriminant Analysis (LDA)

Assumption: All classes share the same covariance matrix. $$\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma} \text{ for all } k$$

Covariance structure: Same as QDA, but shared: $$\boldsymbol{\Sigma} = \begin{pmatrix} \sigma^2_{1} & \rho_{12}\sigma_{1}\sigma_{2} & \cdots \ \rho_{12}\sigma_{1}\sigma_{2} & \sigma^2_{2} & \cdots \ \vdots & \vdots & \ddots \end{pmatrix}$$

Parameters (total): $\frac{d(d+1)}{2}$ (shared across classes)

Gaussian Naive Bayes (GNB)

Assumption: Features are conditionally independent given class. Each class has a diagonal covariance matrix. $$\boldsymbol{\Sigma}k = \text{diag}(\sigma^2{1k}, \sigma^2_{2k}, \ldots, \sigma^2_{dk})$$

Covariance structure: $$\boldsymbol{\Sigma}k = \begin{pmatrix} \sigma^2{1k} & 0 & \cdots & 0 \ 0 & \sigma^2_{2k} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \sigma^2_{dk} \end{pmatrix}$$

Parameters per class: $d$ (variances only, no covariances)

Covariance Structure Comparison
Method	Covariance Structure	Params per Class	Total Params	Boundary
QDA	Full, class-specific	$\frac{d(d+1)}{2}$	$K \cdot \frac{d(d+1)}{2} + Kd$	Quadratic
LDA	Full, shared	$0$ (shared)	$\frac{d(d+1)}{2} + Kd$	Linear
GNB	Diagonal, class-specific	$d$	$2Kd$	Linear or Quadratic
GNB (equal var)	Diagonal, shared variances	$d$ (shared)	$d + Kd$	Linear

The Independence ↔ Diagonal Connection

In a multivariate Gaussian, zero covariance between features implies independence. A diagonal covariance matrix (all off-diagonal entries zero) means all feature pairs are independent. This is exactly the naive Bayes assumption expressed in linear algebra terms.

Decision Boundary Implications

The covariance assumptions directly determine decision boundary shapes.

QDA: Quadratic Boundaries

With class-specific covariance matrices, the discriminant function contains: $$-\frac{1}{2}\mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x} + \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x} + \ldots$$

The $\mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x}$ term is quadratic in $\mathbf{x}$ and differs across classes.

$\Rightarrow$ Quadratic decision boundaries (ellipsoids, hyperboloids, paraboloids)

LDA: Linear Boundaries

With shared covariance $\boldsymbol{\Sigma}$: $$\delta_k(\mathbf{x}) = -\frac{1}{2}\mathbf{x}^T \boldsymbol{\Sigma}^{-1} \mathbf{x} + \boldsymbol{\mu}_k^T \boldsymbol{\Sigma}^{-1} \mathbf{x} - \frac{1}{2}\boldsymbol{\mu}_k^T\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k + \log \pi_k$$

The quadratic term $-\frac{1}{2}\mathbf{x}^T \boldsymbol{\Sigma}^{-1} \mathbf{x}$ is the same for all classes.

When comparing $\delta_j(\mathbf{x})$ vs $\delta_k(\mathbf{x})$, the quadratic terms cancel: $$\delta_j(\mathbf{x}) - \delta_k(\mathbf{x}) = (\boldsymbol{\mu}_j - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}^{-1} \mathbf{x} + \text{constant}$$

This is linear in $\mathbf{x}$.

$\Rightarrow$ Linear decision boundaries (hyperplanes)

GNB: Depends on Variance Equality

Unequal variances across classes:

Diagonal covariance, but $\boldsymbol{\Sigma}_k eq \boldsymbol{\Sigma}_j$
Quadratic terms don't cancel
$\Rightarrow$ Quadratic boundaries (but axis-aligned due to diagonal structure)

Equal variances across classes:

Diagonal covariance with $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ for all $k$
Quadratic terms cancel (like LDA)
$\Rightarrow$ Linear boundaries (but axis-aligned)

boundary_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.datasets import make_classification
 
def compare_classifier_boundaries():
    """
    Compare decision boundaries of GNB, LDA, and QDA.
    """
    print("=" * 60)
    print("GAUSSIAN CLASSIFIER FAMILY: BOUNDARY COMPARISON")
    print("=" * 60)
    
    # Generate dataset with correlated features
    np.random.seed(42)
    n = 200
    
    # Class 0: features are correlated
    cov_0 = np.array([[1.0, 0.8], [0.8, 1.0]])
    X_0 = np.random.multivariate_normal([0, 0], cov_0, n)
    
    # Class 1: different correlation structure
    cov_1 = np.array([[1.0, -0.5], [-0.5, 1.0]])
    X_1 = np.random.multivariate_normal([2, 2], cov_1, n)
    
    X = np.vstack([X_0, X_1])
    y = np.array([0]*n + [1]*n)
    
    # Shuffle
    idx = np.random.permutation(len(y))
    X, y = X[idx], y[idx]
    
    # Fit all three models
    gnb = GaussianNB()
    lda = LinearDiscriminantAnalysis()
    qda = QuadraticDiscriminantAnalysis()
    
    gnb.fit(X, y)
    lda.fit(X, y)
    qda.fit(X, y)
    
    print("
--- Dataset Characteristics ---")
    print(f"True Class 0 covariance:
{cov_0}")
    print(f"True Class 1 covariance:
{cov_1}")
    print(f"
Note: Features are CORRELATED (covariance ≠ 0)")
    
    print("
--- Model Parameters ---")
    
    # GNB assumes diagonal covariance
    print("
Gaussian Naive Bayes (assumes diagonal covariance):")
    print(f"  Class 0 variances: {gnb.var_[0].round(4)}")
    print(f"  Class 1 variances: {gnb.var_[1].round(4)}")
    print(f"  Ignores correlation! (diagonal assumption)")
    
    # LDA estimates shared covariance
    print("
LDA (estimates shared full covariance):")
    # LDA doesn't directly expose covariance, but uses shared
    print(f"  Uses pooled covariance across both classes")
    print(f"  Accounts for correlation")
    
    # QDA estimates separate covariances
    print("
QDA (estimates separate full covariance per class):")
    print(f"  Can capture different correlation structures per class")
    
    # Compare predictions
    print("
--- Prediction Comparison ---")
    test_points = np.array([
        [0, 0],    # Near class 0 center
        [2, 2],    # Near class 1 center
        [1, 1],    # Between classes
        [1, 0],    # Along x-axis
        [0, 1],    # Along y-axis
    ])
    
    print(f"{'Point':>15} | {'GNB':>5} | {'LDA':>5} | {'QDA':>5}")
    print("-" * 45)
    
    for point in test_points:
        gnb_pred = gnb.predict([point])[0]
        lda_pred = lda.predict([point])[0]
        qda_pred = qda.predict([point])[0]
        print(f"({point[0]:4.1f}, {point[1]:4.1f}) | {gnb_pred:>5} | {lda_pred:>5} | {qda_pred:>5}")
    
    # Accuracy comparison
    print("
--- Accuracy on Training Data ---")
    print(f"GNB: {gnb.score(X, y):.4f}")
    print(f"LDA: {lda.score(X, y):.4f}")
    print(f"QDA: {qda.score(X, y):.4f}")
    
    print("
--- Interpretation ---")
    print("QDA best fits this data because classes have different")
    print("correlation structures. LDA's shared covariance assumption")
    print("slightly hurts. GNB's independence assumption ignores")
    print("correlations entirely, but may still perform well overall.")
 
compare_classifier_boundaries()

Parameter Count and Sample Complexity

The methods differ dramatically in parameter count, which has profound implications for estimation quality and generalization.

Parameter Counting

For $K$ classes and $d$ features:

QDA:

Per class: $d$ means + $\frac{d(d+1)}{2}$ covariance entries = $d + \frac{d(d+1)}{2}$
Total: $K \cdot (d + \frac{d(d+1)}{2}) + (K-1)$ priors
For $K=2$, $d=100$: $2 \times (100 + 5050) + 1 = 10,301$ parameters

LDA:

Per class: $d$ means
Shared: $\frac{d(d+1)}{2}$ covariance entries
Total: $Kd + \frac{d(d+1)}{2} + (K-1)$
For $K=2$, $d=100$: $200 + 5050 + 1 = 5,251$ parameters

GNB:

Per class: $d$ means + $d$ variances = $2d$
Total: $2Kd + (K-1)$
For $K=2$, $d=100$: $400 + 1 = 401$ parameters

Sample Complexity Implications

Reliable parameter estimation requires more samples than parameters (rule of thumb: 5-10× as many).

For $d=100$, $K=2$:

QDA needs: ~50,000-100,000 samples
LDA needs: ~25,000-50,000 samples
GNB needs: ~2,000-4,000 samples

For $d=1000$, $K=2$:

QDA needs: ~5-10 million samples
LDA needs: ~2.5-5 million samples
GNB needs: ~20,000-40,000 samples

GNB's linear parameter growth makes it feasible for high-dimensional problems where QDA and even LDA are impractical.

Parameter Count for Binary Classification
Method	Formula	d=10	d=100	d=1000
QDA	$2(d + \frac{d(d+1)}{2})$	131	10,301	1,003,001
LDA	$2d + \frac{d(d+1)}{2}$	76	5,251	501,501
GNB	$4d$	41	401	4,001

The Curse of Dimensionality

In high dimensions, full covariance matrices become prohibitively expensive to estimate. A $1000 \times 1000$ covariance matrix has over 500,000 entries! Even with substantial data, the estimates are unreliable. This is why Naive Bayes, despite its strong independence assumption, often outperforms 'correct' models in high dimensions—it can be estimated reliably.

When to Use Each Method

Choosing between GNB, LDA, and QDA depends on data characteristics, sample size, and computational constraints.

Use Gaussian Naive Bayes When:

High-dimensional data ($d > 100$): Only GNB scales linearly
Limited training data: Fewer parameters to estimate
Features are approximately independent (given class)
Need fast training and prediction: Simplest computation
Baseline model: Always worth trying first

Use LDA When:

Features have similar covariance across classes: The shared assumption is reasonable
Linear decision boundaries are expected: Problem structure suggests linearity
Moderate dimensionality ($d < 100$) with sufficient data
Need interpretable coefficients: LDA gives linear weights
Dimensionality reduction is also needed: LDA provides projections

Use QDA When:

Classes have genuinely different covariance structures: One spread differently than another
Sufficient training data: Can estimate $O(Kd^2)$ parameters reliably
Low-to-moderate dimensionality ($d < 50$)
Non-linear boundaries are justified: Prior knowledge or EDA suggests curves
Accuracy is paramount: Worth the extra complexity

Decision Flow

                Start
                  |
          [Is d > 100?]
          /           
        Yes            No
         |              |
        GNB      [Is n > 10 × Kd²?]
                 /              
               Yes               No
                |                 |
         [Different σ?]          LDA or GNB
          /         
        Yes          No
         |            |
        QDA          LDA

GNB Advantages

•Scales to high dimensions
•Works with small datasets
•Very fast training (O(nd))
•No matrix inversion needed
•Robust to irrelevant features
•Often surprisingly accurate

GNB Limitations

•Ignores feature correlations
•May underperform with strong dependencies
•Decision boundaries are axis-aligned
•Probability estimates often poorly calibrated
•Cannot capture feature interactions

Regularized Discriminant Analysis

Rather than choosing between LDA, QDA, and GNB, we can interpolate between them using regularization. This provides a continuum of models with tunable bias-variance tradeoff.

Regularized Discriminant Analysis (RDA)

Introduced by Friedman (1989), RDA defines a regularized covariance: $$\hat{\boldsymbol{\Sigma}}_k(\alpha, \gamma) = (1 - \gamma) \left[ (1 - \alpha) \hat{\boldsymbol{\Sigma}}_k + \alpha \hat{\boldsymbol{\Sigma}} \right] + \gamma \frac{\text{tr}(\hat{\boldsymbol{\Sigma}})}{d} \mathbf{I}$$

Where:

$\alpha \in [0, 1]$: Interpolates between class-specific ($\alpha = 0$, QDA) and shared ($\alpha = 1$, LDA)
$\gamma \in [0, 1]$: Shrinks toward spherical (identity matrix)
$\hat{\boldsymbol{\Sigma}}_k$ is the class-specific sample covariance
$\hat{\boldsymbol{\Sigma}}$ is the pooled sample covariance

Special Cases

$\alpha$	$\gamma$	Result
0	0	QDA
1	0	LDA
0	1	Spherical QDA (isotropic classes)
1	1	Nearest mean classifier

Connection to GNB

While RDA doesn't directly interpolate to GNB (which is diagonal, not spherical), a similar regularization can shrink toward diagonal: $$\hat{\boldsymbol{\Sigma}}_k(\lambda) = (1 - \lambda) \hat{\boldsymbol{\Sigma}}_k + \lambda \cdot \text{diag}(\hat{\boldsymbol{\Sigma}}_k)$$

At $\lambda = 1$, we recover the diagonal covariance of GNB.

regularized_discriminant.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 
def demonstrate_rda_continuum():
    """
    Show how shrinkage interpolates between LDA and more regularized models.
    """
    print("=" * 60)
    print("REGULARIZED DISCRIMINANT ANALYSIS CONTINUUM")
    print("=" * 60)
    
    # Generate data where regularization helps
    np.random.seed(42)
    n_per_class = 50  # Small sample size
    d = 20            # Moderate dimensionality
    
    # Generate correlated features
    true_cov = np.eye(d)
    for i in range(d):
        for j in range(d):
            if i != j:
                true_cov[i, j] = 0.5 ** abs(i - j)  # AR(1) correlation
    
    X_0 = np.random.multivariate_normal(np.zeros(d), true_cov, n_per_class)
    X_1 = np.random.multivariate_normal(np.ones(d), true_cov, n_per_class)
    
    X = np.vstack([X_0, X_1])
    y = np.array([0]*n_per_class + [1]*n_per_class)
    
    print(f"
Dataset: {X.shape[0]} samples, {d} features")
    print(f"This is a challenging setting: n ≈ d")
    
    # Compare different shrinkage values
    from sklearn.model_selection import cross_val_score
    
    print("
--- Effect of Shrinkage Regularization ---")
    print(f"{'Shrinkage':>12} | {'Accuracy':>12} | {'Interpretation'}")
    print("-" * 55)
    
    shrinkages = [None, 0.0, 0.25, 0.5, 0.75, 1.0]
    
    for shrinkage in shrinkages:
        if shrinkage is None:
            lda = LinearDiscriminantAnalysis(solver='svd')
            label = "auto (SVD)"
        else:
            lda = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=shrinkage)
            label = f"{shrinkage:.2f}"
        
        try:
            scores = cross_val_score(lda, X, y, cv=5, scoring='accuracy')
            acc = f"{scores.mean():.4f} ± {scores.std():.3f}"
        except:
            acc = "Failed (singular)"
        
        if shrinkage == 0:
            interp = "Full covariance (LDA)"
        elif shrinkage == 1:
            interp = "Diagonal covariance (like GNB)"
        elif shrinkage is None:
            interp = "SVD-based (regularized)"
        else:
            interp = "Intermediate"
        
        print(f"{label:>12} | {acc:>12} | {interp}")
    
    print("
--- Key Insight ---")
    print("With limited data (n ≈ d), regularization (shrinkage > 0)")
    print("often improves performance by stabilizing covariance estimates.")
    print("At shrinkage = 1, we approach diagonal assumption like GNB.")
 
demonstrate_rda_continuum()

Practical Recommendation

When uncertain whether to use GNB, LDA, or QDA, try LDA with shrinkage and cross-validate over the shrinkage parameter. This lets the data decide the appropriate level of regularization. sklearn's LDA with shrinkage='auto' uses the Ledoit-Wolf estimator for optimal shrinkage.

Mathematical Relationship Summary

Let us formally summarize how GNB, LDA, and QDA relate mathematically.

The Hierarchy of Assumptions

Most General → Most Restrictive:

$$\text{QDA} \supset \text{LDA} \supset \text{GNB (equal var)} \quad \text{and} \quad \text{QDA} \supset \text{GNB}$$

Where:

QDA: $\boldsymbol{\Sigma}_k$ arbitrary positive definite
LDA: $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ (shared)
GNB: $\boldsymbol{\Sigma}k = \text{diag}(\sigma^2{1k}, \ldots, \sigma^2_{dk})$ (diagonal, class-specific)
GNB (equal var): $\boldsymbol{\Sigma}_k = \text{diag}(\sigma^2_1, \ldots, \sigma^2_d) = \boldsymbol{\Sigma}$ (diagonal, shared)

Key Equivalences

GNB is QDA with diagonal covariance: The naive assumption is a special case of quadratic discriminant analysis where off-diagonal entries are zero.
GNB with equal variances is equivalent to LDA with diagonal covariance: Both produce linear boundaries, but LDA's can be oblique while GNB's are axis-aligned.
LDA is QDA with shared covariance: The constraint that $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ leads to linear boundaries.

When They Coincide

If the true class-conditional distributions satisfy:

Features are independent given class
All classes have identical covariance

Then GNB (equal variance), LDA, and QDA all reduce to the same discriminant function (up to constraint violations from estimation).

The Bias-Variance Tradeoff

Method	Bias	Variance	When Best
QDA	Low	High	Large n, genuinely different Σₖ
LDA	Medium	Medium	Moderate n, similar Σₖ
GNB	High	Low	Small n, high d, approximate independence

Key Takeaways

•GNB, LDA, QDA form a family of Gaussian generative classifiers with different covariance assumptions
•Covariance structure determines boundary shape: Shared → linear, class-specific → quadratic
•GNB's diagonal assumption is a special case: Equivalent to assuming feature independence
•Parameter count scales differently: GNB: O(Kd), LDA: O(d²), QDA: O(Kd²)
•Sample complexity follows parameter count: More parameters need more data
•Regularization bridges the methods: Shrinkage interpolates from QDA through LDA toward diagonal
•No method dominates: Best choice depends on n, d, and true structure

Module Complete

Congratulations! You have completed the Gaussian Naive Bayes module. You now understand: (1) how to model continuous features with Gaussian distributions, (2) parameter estimation via maximum likelihood, (3) the geometry of decision boundaries, and (4) the deep connections to LDA and QDA. This knowledge positions you to choose appropriately among generative classifiers and understand their theoretical foundations.

Connection to LDA

A Family of Gaussian Classifiers

The family includes:

Gaussian Naive Bayes (GNB): Assumes feature independence (diagonal covariance)
Linear Discriminant Analysis (LDA): Assumes shared full covariance across classes
Quadratic Discriminant Analysis (QDA): Allows different full covariance per class

These methods differ in what they assume about the covariance structure of class-conditional distributions. Their relationships illuminate deep principles:

More flexible model ≠ better classifier (bias-variance tradeoff)
Shared assumptions reduce parameters (improves estimation with limited data)
Independence assumption is a special case of shared covariance

This page explores these connections mathematically, revealing the unified theory underlying Gaussian generative classifiers.

What You Will Learn

The Gaussian Generative Framework

All three methods share the same fundamental approach: model class-conditional distributions as multivariate Gaussians, then apply Bayes' theorem for classification.

The Multivariate Gaussian

The $d$-dimensional multivariate Gaussian distribution is:

Where:

$\boldsymbol{\mu} \in \mathbb{R}^d$ is the mean vector
$\boldsymbol{\Sigma} \in \mathbb{R}^{d \times d}$ is the covariance matrix (positive definite)
$|\boldsymbol{\Sigma}|$ is the determinant of $\boldsymbol{\Sigma}$

The Generative Classifier

For class $k$, we model: $$\mathbf{x} | y = k \sim \mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

With class prior $\pi_k = P(y = k)$.

Classification uses Bayes' rule: $$P(y = k | \mathbf{x}) = \frac{f(\mathbf{x} | y = k) \pi_k}{\sum_j f(\mathbf{x} | y = j) \pi_j}$$

The Log-Discriminant Function

We classify by comparing discriminant functions: $$\delta_k(\mathbf{x}) = \log f(\mathbf{x} | y = k) + \log \pi_k$$

$$= -\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)^T \boldsymbol{\Sigma}_k^{-1} (\mathbf{x} - \boldsymbol{\mu}_k) - \frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{d}{2}\log(2\pi) + \log \pi_k$$

The classification rule is: $$\hat{y} = \arg\max_k \delta_k(\mathbf{x})$$

The key question: What structure do we impose on the covariance matrices $\boldsymbol{\Sigma}_k$?

The Mahalanobis Distance

Covariance Assumptions Compared

The three methods differ in their covariance structure assumptions:

Quadratic Discriminant Analysis (QDA)

Assumption: Each class has its own unrestricted covariance matrix. $$\boldsymbol{\Sigma}_k \text{ is a general } d \times d \text{ positive definite matrix, different for each } k$$

Parameters per class: $\frac{d(d+1)}{2}$ (symmetric matrix)

Linear Discriminant Analysis (LDA)

Assumption: All classes share the same covariance matrix. $$\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma} \text{ for all } k$$

Parameters (total): $\frac{d(d+1)}{2}$ (shared across classes)

Gaussian Naive Bayes (GNB)

Parameters per class: $d$ (variances only, no covariances)

Covariance Structure Comparison
Method	Covariance Structure	Params per Class	Total Params	Boundary
QDA	Full, class-specific	$\frac{d(d+1)}{2}$	$K \cdot \frac{d(d+1)}{2} + Kd$	Quadratic
LDA	Full, shared	$0$ (shared)	$\frac{d(d+1)}{2} + Kd$	Linear
GNB	Diagonal, class-specific	$d$	$2Kd$	Linear or Quadratic
GNB (equal var)	Diagonal, shared variances	$d$ (shared)	$d + Kd$	Linear

The Independence ↔ Diagonal Connection

Decision Boundary Implications

The covariance assumptions directly determine decision boundary shapes.

QDA: Quadratic Boundaries

The $\mathbf{x}^T \boldsymbol{\Sigma}_k^{-1} \mathbf{x}$ term is quadratic in $\mathbf{x}$ and differs across classes.

$\Rightarrow$ Quadratic decision boundaries (ellipsoids, hyperboloids, paraboloids)

LDA: Linear Boundaries

The quadratic term $-\frac{1}{2}\mathbf{x}^T \boldsymbol{\Sigma}^{-1} \mathbf{x}$ is the same for all classes.

This is linear in $\mathbf{x}$.

$\Rightarrow$ Linear decision boundaries (hyperplanes)

GNB: Depends on Variance Equality

Unequal variances across classes:

Diagonal covariance, but $\boldsymbol{\Sigma}_k eq \boldsymbol{\Sigma}_j$
Quadratic terms don't cancel
$\Rightarrow$ Quadratic boundaries (but axis-aligned due to diagonal structure)

Equal variances across classes:

Diagonal covariance with $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ for all $k$
Quadratic terms cancel (like LDA)
$\Rightarrow$ Linear boundaries (but axis-aligned)

boundary_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.datasets import make_classification
 
def compare_classifier_boundaries():
    """
    Compare decision boundaries of GNB, LDA, and QDA.
    """
    print("=" * 60)
    print("GAUSSIAN CLASSIFIER FAMILY: BOUNDARY COMPARISON")
    print("=" * 60)
    
    # Generate dataset with correlated features
    np.random.seed(42)
    n = 200
    
    # Class 0: features are correlated
    cov_0 = np.array([[1.0, 0.8], [0.8, 1.0]])
    X_0 = np.random.multivariate_normal([0, 0], cov_0, n)
    
    # Class 1: different correlation structure
    cov_1 = np.array([[1.0, -0.5], [-0.5, 1.0]])
    X_1 = np.random.multivariate_normal([2, 2], cov_1, n)
    
    X = np.vstack([X_0, X_1])
    y = np.array([0]*n + [1]*n)
    
    # Shuffle
    idx = np.random.permutation(len(y))
    X, y = X[idx], y[idx]
    
    # Fit all three models
    gnb = GaussianNB()
    lda = LinearDiscriminantAnalysis()
    qda = QuadraticDiscriminantAnalysis()
    
    gnb.fit(X, y)
    lda.fit(X, y)
    qda.fit(X, y)
    
    print("
--- Dataset Characteristics ---")
    print(f"True Class 0 covariance:
{cov_0}")
    print(f"True Class 1 covariance:
{cov_1}")
    print(f"
Note: Features are CORRELATED (covariance ≠ 0)")
    
    print("
--- Model Parameters ---")
    
    # GNB assumes diagonal covariance
    print("
Gaussian Naive Bayes (assumes diagonal covariance):")
    print(f"  Class 0 variances: {gnb.var_[0].round(4)}")
    print(f"  Class 1 variances: {gnb.var_[1].round(4)}")
    print(f"  Ignores correlation! (diagonal assumption)")
    
    # LDA estimates shared covariance
    print("
LDA (estimates shared full covariance):")
    # LDA doesn't directly expose covariance, but uses shared
    print(f"  Uses pooled covariance across both classes")
    print(f"  Accounts for correlation")
    
    # QDA estimates separate covariances
    print("
QDA (estimates separate full covariance per class):")
    print(f"  Can capture different correlation structures per class")
    
    # Compare predictions
    print("
--- Prediction Comparison ---")
    test_points = np.array([
        [0, 0],    # Near class 0 center
        [2, 2],    # Near class 1 center
        [1, 1],    # Between classes
        [1, 0],    # Along x-axis
        [0, 1],    # Along y-axis
    ])
    
    print(f"{'Point':>15} | {'GNB':>5} | {'LDA':>5} | {'QDA':>5}")
    print("-" * 45)
    
    for point in test_points:
        gnb_pred = gnb.predict([point])[0]
        lda_pred = lda.predict([point])[0]
        qda_pred = qda.predict([point])[0]
        print(f"({point[0]:4.1f}, {point[1]:4.1f}) | {gnb_pred:>5} | {lda_pred:>5} | {qda_pred:>5}")
    
    # Accuracy comparison
    print("
--- Accuracy on Training Data ---")
    print(f"GNB: {gnb.score(X, y):.4f}")
    print(f"LDA: {lda.score(X, y):.4f}")
    print(f"QDA: {qda.score(X, y):.4f}")
    
    print("
--- Interpretation ---")
    print("QDA best fits this data because classes have different")
    print("correlation structures. LDA's shared covariance assumption")
    print("slightly hurts. GNB's independence assumption ignores")
    print("correlations entirely, but may still perform well overall.")
 
compare_classifier_boundaries()

Parameter Count and Sample Complexity

The methods differ dramatically in parameter count, which has profound implications for estimation quality and generalization.

Parameter Counting

For $K$ classes and $d$ features:

QDA:

Per class: $d$ means + $\frac{d(d+1)}{2}$ covariance entries = $d + \frac{d(d+1)}{2}$
Total: $K \cdot (d + \frac{d(d+1)}{2}) + (K-1)$ priors
For $K=2$, $d=100$: $2 \times (100 + 5050) + 1 = 10,301$ parameters

LDA:

Per class: $d$ means
Shared: $\frac{d(d+1)}{2}$ covariance entries
Total: $Kd + \frac{d(d+1)}{2} + (K-1)$
For $K=2$, $d=100$: $200 + 5050 + 1 = 5,251$ parameters

GNB:

Per class: $d$ means + $d$ variances = $2d$
Total: $2Kd + (K-1)$
For $K=2$, $d=100$: $400 + 1 = 401$ parameters

Sample Complexity Implications

Reliable parameter estimation requires more samples than parameters (rule of thumb: 5-10× as many).

For $d=100$, $K=2$:

QDA needs: ~50,000-100,000 samples
LDA needs: ~25,000-50,000 samples
GNB needs: ~2,000-4,000 samples

For $d=1000$, $K=2$:

QDA needs: ~5-10 million samples
LDA needs: ~2.5-5 million samples
GNB needs: ~20,000-40,000 samples

GNB's linear parameter growth makes it feasible for high-dimensional problems where QDA and even LDA are impractical.

Parameter Count for Binary Classification
Method	Formula	d=10	d=100	d=1000
QDA	$2(d + \frac{d(d+1)}{2})$	131	10,301	1,003,001
LDA	$2d + \frac{d(d+1)}{2}$	76	5,251	501,501
GNB	$4d$	41	401	4,001

The Curse of Dimensionality

When to Use Each Method

Choosing between GNB, LDA, and QDA depends on data characteristics, sample size, and computational constraints.

Use Gaussian Naive Bayes When:

High-dimensional data ($d > 100$): Only GNB scales linearly
Limited training data: Fewer parameters to estimate
Features are approximately independent (given class)
Need fast training and prediction: Simplest computation
Baseline model: Always worth trying first

Use LDA When:

Features have similar covariance across classes: The shared assumption is reasonable
Linear decision boundaries are expected: Problem structure suggests linearity
Moderate dimensionality ($d < 100$) with sufficient data
Need interpretable coefficients: LDA gives linear weights
Dimensionality reduction is also needed: LDA provides projections

Use QDA When:

Classes have genuinely different covariance structures: One spread differently than another
Sufficient training data: Can estimate $O(Kd^2)$ parameters reliably
Low-to-moderate dimensionality ($d < 50$)
Non-linear boundaries are justified: Prior knowledge or EDA suggests curves
Accuracy is paramount: Worth the extra complexity

Decision Flow

                Start
                  |
          [Is d > 100?]
          /           
        Yes            No
         |              |
        GNB      [Is n > 10 × Kd²?]
                 /              
               Yes               No
                |                 |
         [Different σ?]          LDA or GNB
          /         
        Yes          No
         |            |
        QDA          LDA

GNB Advantages

•Scales to high dimensions
•Works with small datasets
•Very fast training (O(nd))
•No matrix inversion needed
•Robust to irrelevant features
•Often surprisingly accurate

GNB Limitations

•Ignores feature correlations
•May underperform with strong dependencies
•Decision boundaries are axis-aligned
•Probability estimates often poorly calibrated
•Cannot capture feature interactions

Regularized Discriminant Analysis

Rather than choosing between LDA, QDA, and GNB, we can interpolate between them using regularization. This provides a continuum of models with tunable bias-variance tradeoff.

Regularized Discriminant Analysis (RDA)

Where:

$\alpha \in [0, 1]$: Interpolates between class-specific ($\alpha = 0$, QDA) and shared ($\alpha = 1$, LDA)
$\gamma \in [0, 1]$: Shrinks toward spherical (identity matrix)
$\hat{\boldsymbol{\Sigma}}_k$ is the class-specific sample covariance
$\hat{\boldsymbol{\Sigma}}$ is the pooled sample covariance

Special Cases

$\alpha$	$\gamma$	Result
0	0	QDA
1	0	LDA
0	1	Spherical QDA (isotropic classes)
1	1	Nearest mean classifier

Connection to GNB

At $\lambda = 1$, we recover the diagonal covariance of GNB.

regularized_discriminant.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
 
def demonstrate_rda_continuum():
    """
    Show how shrinkage interpolates between LDA and more regularized models.
    """
    print("=" * 60)
    print("REGULARIZED DISCRIMINANT ANALYSIS CONTINUUM")
    print("=" * 60)
    
    # Generate data where regularization helps
    np.random.seed(42)
    n_per_class = 50  # Small sample size
    d = 20            # Moderate dimensionality
    
    # Generate correlated features
    true_cov = np.eye(d)
    for i in range(d):
        for j in range(d):
            if i != j:
                true_cov[i, j] = 0.5 ** abs(i - j)  # AR(1) correlation
    
    X_0 = np.random.multivariate_normal(np.zeros(d), true_cov, n_per_class)
    X_1 = np.random.multivariate_normal(np.ones(d), true_cov, n_per_class)
    
    X = np.vstack([X_0, X_1])
    y = np.array([0]*n_per_class + [1]*n_per_class)
    
    print(f"
Dataset: {X.shape[0]} samples, {d} features")
    print(f"This is a challenging setting: n ≈ d")
    
    # Compare different shrinkage values
    from sklearn.model_selection import cross_val_score
    
    print("
--- Effect of Shrinkage Regularization ---")
    print(f"{'Shrinkage':>12} | {'Accuracy':>12} | {'Interpretation'}")
    print("-" * 55)
    
    shrinkages = [None, 0.0, 0.25, 0.5, 0.75, 1.0]
    
    for shrinkage in shrinkages:
        if shrinkage is None:
            lda = LinearDiscriminantAnalysis(solver='svd')
            label = "auto (SVD)"
        else:
            lda = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=shrinkage)
            label = f"{shrinkage:.2f}"
        
        try:
            scores = cross_val_score(lda, X, y, cv=5, scoring='accuracy')
            acc = f"{scores.mean():.4f} ± {scores.std():.3f}"
        except:
            acc = "Failed (singular)"
        
        if shrinkage == 0:
            interp = "Full covariance (LDA)"
        elif shrinkage == 1:
            interp = "Diagonal covariance (like GNB)"
        elif shrinkage is None:
            interp = "SVD-based (regularized)"
        else:
            interp = "Intermediate"
        
        print(f"{label:>12} | {acc:>12} | {interp}")
    
    print("
--- Key Insight ---")
    print("With limited data (n ≈ d), regularization (shrinkage > 0)")
    print("often improves performance by stabilizing covariance estimates.")
    print("At shrinkage = 1, we approach diagonal assumption like GNB.")
 
demonstrate_rda_continuum()

Practical Recommendation

Mathematical Relationship Summary

Let us formally summarize how GNB, LDA, and QDA relate mathematically.

The Hierarchy of Assumptions

Most General → Most Restrictive:

$$\text{QDA} \supset \text{LDA} \supset \text{GNB (equal var)} \quad \text{and} \quad \text{QDA} \supset \text{GNB}$$

Where:

QDA: $\boldsymbol{\Sigma}_k$ arbitrary positive definite
LDA: $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ (shared)
GNB: $\boldsymbol{\Sigma}k = \text{diag}(\sigma^2{1k}, \ldots, \sigma^2_{dk})$ (diagonal, class-specific)
GNB (equal var): $\boldsymbol{\Sigma}_k = \text{diag}(\sigma^2_1, \ldots, \sigma^2_d) = \boldsymbol{\Sigma}$ (diagonal, shared)

Key Equivalences

GNB is QDA with diagonal covariance: The naive assumption is a special case of quadratic discriminant analysis where off-diagonal entries are zero.
GNB with equal variances is equivalent to LDA with diagonal covariance: Both produce linear boundaries, but LDA's can be oblique while GNB's are axis-aligned.
LDA is QDA with shared covariance: The constraint that $\boldsymbol{\Sigma}_k = \boldsymbol{\Sigma}$ leads to linear boundaries.

When They Coincide

If the true class-conditional distributions satisfy:

Features are independent given class
All classes have identical covariance

Then GNB (equal variance), LDA, and QDA all reduce to the same discriminant function (up to constraint violations from estimation).

The Bias-Variance Tradeoff

Method	Bias	Variance	When Best
QDA	Low	High	Large n, genuinely different Σₖ
LDA	Medium	Medium	Moderate n, similar Σₖ
GNB	High	Low	Small n, high d, approximate independence

Key Takeaways

•GNB, LDA, QDA form a family of Gaussian generative classifiers with different covariance assumptions
•Covariance structure determines boundary shape: Shared → linear, class-specific → quadratic
•GNB's diagonal assumption is a special case: Equivalent to assuming feature independence
•Parameter count scales differently: GNB: O(Kd), LDA: O(d²), QDA: O(Kd²)
•Sample complexity follows parameter count: More parameters need more data
•Regularization bridges the methods: Shrinkage interpolates from QDA through LDA toward diagonal
•No method dominates: Best choice depends on n, d, and true structure

Module Complete