Machine LearningNaive Bayes & Probabilistic Classifiers

Gaussian Naive Bayes

LevelIntermediate

Duration90 mins

TopicNaive Bayes & Probabilistic Classifiers

1 / 5

Continuous Features

Beyond Discrete Counts: The Continuous Challenge

In our exploration of Naive Bayes, we have encountered powerful techniques for discrete features: Multinomial Naive Bayes excels at text classification by modeling word frequencies, while Bernoulli Naive Bayes handles binary feature vectors elegantly. But what happens when our features are continuous—real-valued measurements like height, temperature, stock prices, or pixel intensities?

Consider a medical diagnosis system that must classify patients based on blood pressure (120.5 mmHg), body temperature (37.2°C), and cholesterol level (195 mg/dL). These are not word counts or binary flags—they are measurements from a continuous spectrum. How do we compute $P(\text{blood_pressure} = 120.5 | \text{disease})$? The probability of observing any exact real number is technically zero under continuous distributions.

This fundamental challenge motivates Gaussian Naive Bayes, where we model each continuous feature as following a Gaussian (normal) distribution within each class. This elegant assumption transforms an impossible probability computation into a tractable density estimation problem.

What You Will Learn

By the end of this page, you will understand: (1) why discrete Naive Bayes methods fail for continuous features, (2) the distinction between probability mass functions and probability density functions, (3) how to use probability densities in Bayes' theorem, (4) the concept of likelihood versus probability, and (5) why Gaussian distributions are a natural choice for continuous features.

The Discrete-to-Continuous Gap

Let us first understand precisely why the techniques we developed for Multinomial and Bernoulli Naive Bayes cannot be directly applied to continuous features.

Discrete Feature Modeling: A Review

For discrete features (e.g., word counts in text), we estimate class-conditional probabilities by counting:

$$P(x_j = v | y = k) = \frac{\text{count of samples in class } k \text{ where feature } j = v}{\text{total samples in class } k}$$

This works because:

Features take values from a finite set ${v_1, v_2, \ldots, v_m}$
Multiple samples can have the exact same value
Probabilities are additive: $\sum_{v} P(x_j = v | y = k) = 1$

The Continuous Feature Problem

For continuous features, these assumptions break down:

Problem 1: Infinite possible values

A feature like height can take any value in a continuous range (e.g., 150.00000... to 200.00000... cm). There are uncountably infinite possible values, not a finite set.

Problem 2: Zero probability of exact values

In a continuous distribution, the probability of observing any exact value is zero: $$P(X = 175.3247281...) = 0$$

This is because probability mass must be distributed over infinitely many possible values.

Problem 3: Counting doesn't work

If we try to estimate $P(\text{height} = 175.0 | \text{male})$ by counting, we might find that out of 1000 male samples, exactly zero have height precisely equal to 175.0 (maybe 174.98 or 175.02, but not exactly 175.0). Our probability estimate would be zero, which is both mathematically correct and practically useless.

The Naive Discretization Trap

A common but flawed approach is to discretize continuous features into bins (e.g., height < 160, 160-170, 170-180, > 180). While this can work in some cases, it introduces arbitrary boundaries, loses information, and can dramatically affect classifier performance. The choice of bin edges becomes a hyperparameter that is difficult to optimize and often dataset-specific.

discretization_problems.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
import matplotlib.pyplot as plt
 
# Generate continuous data for two classes
np.random.seed(42)
n_samples = 500
 
# Class 0: lower values
class_0 = np.random.normal(loc=45, scale=8, size=n_samples)
# Class 1: higher values  
class_1 = np.random.normal(loc=55, scale=8, size=n_samples)
 
# Demonstrate the discretization problem
def discretize_and_classify(data, labels, n_bins):
    """
    Show how discretization loses information and creates arbitrary boundaries.
    """
    # Combine data
    all_data = np.concatenate([data[labels == 0], data[labels == 1]])
    all_labels = labels
    
    # Create bin edges
    bin_edges = np.linspace(all_data.min(), all_data.max(), n_bins + 1)
    
    # Assign to bins
    binned_data = np.digitize(all_data, bin_edges[:-1]) - 1
    
    # Count per class per bin
    class_0_counts = np.zeros(n_bins)
    class_1_counts = np.zeros(n_bins)
    
    for bin_idx in range(n_bins):
        class_0_counts[bin_idx] = np.sum((binned_data == bin_idx) & (all_labels == 0))
        class_1_counts[bin_idx] = np.sum((binned_data == bin_idx) & (all_labels == 1))
    
    return bin_edges, class_0_counts, class_1_counts
 
# Compare different numbers of bins
all_data = np.concatenate([class_0, class_1])
labels = np.concatenate([np.zeros(n_samples), np.ones(n_samples)])
 
print("=" * 60)
print("DISCRETIZATION SENSITIVITY ANALYSIS")
print("=" * 60)
 
for n_bins in [5, 10, 20, 50]:
    edges, c0, c1 = discretize_and_classify(all_data, labels, n_bins)
    
    # Find the bin where decision changes (class 0 > class 1 vs class 1 > class 0)
    decision_changes = []
    for i in range(n_bins - 1):
        ratio_current = c0[i] / max(c1[i], 1)
        ratio_next = c0[i+1] / max(c1[i+1], 1)
        if (ratio_current > 1) != (ratio_next > 1):
            decision_changes.append(edges[i+1])
    
    print(f"\nBins: {n_bins}")
    print(f"  Decision boundary locations: {decision_changes}")
    print(f"  Number of boundary shifts: {len(decision_changes)}")
    
    # Check for empty bins (problematic for probability estimation)
    empty_c0 = np.sum(c0 == 0)
    empty_c1 = np.sum(c1 == 0)
    print(f"  Empty bins (class 0): {empty_c0}")
    print(f"  Empty bins (class 1): {empty_c1}")
 
print("\n" + "=" * 60)
print("KEY INSIGHT: The decision boundary MOVES based on arbitrary")
print("bin count choice. This is not a property of the data!")
print("=" * 60)

Probability Density Functions

The mathematical tool for handling continuous random variables is the probability density function (PDF), which fundamentally differs from the probability mass functions (PMFs) used for discrete variables.

From Mass to Density

Probability Mass Function (PMF) — for discrete variables:

$p(x)$ gives the exact probability that $X = x$
$\sum_x p(x) = 1$
Individual values have non-zero probability

Probability Density Function (PDF) — for continuous variables:

$f(x)$ gives the density at point $x$, not probability
$\int_{-\infty}^{\infty} f(x) dx = 1$
Individual points have zero probability; only intervals have non-zero probability

$$P(a \leq X \leq b) = \int_a^b f(x) dx$$

Interpreting Density

The density $f(x)$ indicates the relative likelihood of observing values near $x$. If $f(x_1) = 2 \cdot f(x_2)$, then observing a value near $x_1$ is twice as likely as observing a value near $x_2$.

Crucial insight: While $P(X = x) = 0$ for any specific $x$, the density $f(x)$ can be large or small, telling us where values are more or less likely to concentrate.

$$f(x) = \lim_{\epsilon \to 0} \frac{P(x - \epsilon < X < x + \epsilon)}{2\epsilon}$$

The density is the limit of probability per unit length as the interval shrinks to zero.

PMF vs PDF: Key Differences
Property	PMF (Discrete)	PDF (Continuous)
Output interpretation	$p(x) = P(X = x)$	$f(x) = $ density at $x$
Range of output	$0 \leq p(x) \leq 1$	$f(x) \geq 0$, can exceed 1
Normalization	$\sum_x p(x) = 1$	$\int f(x) dx = 1$
Point probability	$P(X = x) = p(x) > 0$ possible	$P(X = x) = 0$ always
Interval probability	Sum over integers in interval	Integral over interval
Typical examples	Binomial, Poisson, Categorical	Gaussian, Exponential, Uniform

Density Can Exceed 1

Unlike probabilities, density values can be greater than 1. For example, the uniform distribution on [0, 0.5] has density $f(x) = 2$ for $x \in [0, 0.5]$. The total area under the curve (probability) is still 1, but the height (density) is 2. This is a common source of confusion when first encountering continuous distributions.

pmf_vs_pdf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
# Demonstrate PMF vs PDF concepts
print("=" * 60)
print("PMF vs PDF DEMONSTRATION")
print("=" * 60)
 
# PMF Example: Poisson distribution (discrete)
print("\n--- PMF: Poisson Distribution (λ=3) ---")
poisson = stats.poisson(mu=3)
for x in range(10):
    pmf_val = poisson.pmf(x)
    print(f"P(X = {x}) = {pmf_val:.4f}")
print(f"Sum of PMF values: {sum(poisson.pmf(x) for x in range(100)):.6f}")
 
# PDF Example: Normal distribution (continuous)
print("\n--- PDF: Normal Distribution (μ=0, σ=0.3) ---")
normal = stats.norm(loc=0, scale=0.3)
x_values = np.array([-1.0, -0.5, 0.0, 0.5, 1.0])
for x in x_values:
    pdf_val = normal.pdf(x)
    print(f"f(X = {x}) = {pdf_val:.4f}")  # Note: can exceed 1!
 
print(f"\nPDF at x=0: {normal.pdf(0):.4f}")  # > 1 because σ is small
print(f"This exceeds 1, but the INTEGRAL equals: {normal.cdf(np.inf) - normal.cdf(-np.inf):.6f}")
 
# Show that point probabilities are zero
print("\n--- Point vs Interval Probability ---")
print(f"P(X = 0.0 exactly) = {0.0:.10f}  (always zero for continuous)")
print(f"P(-0.1 < X < 0.1) = {normal.cdf(0.1) - normal.cdf(-0.1):.6f}")
print(f"P(-0.01 < X < 0.01) = {normal.cdf(0.01) - normal.cdf(-0.01):.6f}")
print(f"P(-0.001 < X < 0.001) = {normal.cdf(0.001) - normal.cdf(-0.001):.6f}")
print("\nAs interval shrinks → probability → 0")
 
# Demonstrate relative likelihood interpretation
print("\n--- Relative Likelihood Interpretation ---")
# Compare densities at two points
x1, x2 = 0, 1
f_x1, f_x2 = normal.pdf(x1), normal.pdf(x2)
print(f"Density at x={x1}: {f_x1:.4f}")
print(f"Density at x={x2}: {f_x2:.4f}")
print(f"Ratio: {f_x1 / f_x2:.2f}x more likely near x={x1} than near x={x2}")

Using Density in Bayes' Theorem

The elegant insight that enables Gaussian Naive Bayes is that we can substitute probability densities for probability mass in Bayes' theorem, and the classification still works correctly.

Standard Bayes' Theorem (Discrete)

For discrete features: $$P(y = k | \mathbf{x}) = \frac{P(\mathbf{x} | y = k) P(y = k)}{P(\mathbf{x})}$$

We classify by finding: $$\hat{y} = \arg\max_k P(y = k | \mathbf{x}) = \arg\max_k P(\mathbf{x} | y = k) P(y = k)$$

Modified Bayes' Theorem (Continuous)

For continuous features, we replace probabilities with probability densities: $$f(y = k | \mathbf{x}) \propto f(\mathbf{x} | y = k) P(y = k)$$

Where:

$f(\mathbf{x} | y = k)$ is the class-conditional density of feature vector $\mathbf{x}$ given class $k$
$P(y = k)$ is the prior probability of class $k$ (still a probability, not density)

We classify by finding: $$\hat{y} = \arg\max_k f(\mathbf{x} | y = k) P(y = k)$$

Why This Works

The key insight is that we are comparing ratios of likelihoods across classes. Since we're comparing: $$\frac{f(\mathbf{x} | y = 1) P(y = 1)}{f(\mathbf{x} | y = 0) P(y = 0)}$$

Any constants that don't depend on class $k$ cancel out. The densities serve as relative likelihoods—telling us which class makes the observed data more likely.

The Likelihood Interpretation

When we use $f(\mathbf{x} | y = k)$ for classification, we're computing the likelihood of the data under each class model. Higher density means the observed features are more consistent with that class's distribution. We don't need actual probabilities—we just need to compare which class better 'explains' the observed features.

Mathematical Justification

The formal justification involves considering infinitesimally small regions around the observed point. For a small region $\mathcal{B}$ around $\mathbf{x}$:

$$P(y = k | \mathbf{x} \in \mathcal{B}) = \frac{P(\mathbf{x} \in \mathcal{B} | y = k) P(y = k)}{P(\mathbf{x} \in \mathcal{B})}$$

As $\mathcal{B}$ shrinks to a point: $$P(\mathbf{x} \in \mathcal{B} | y = k) \approx f(\mathbf{x} | y = k) \cdot |\mathcal{B}|$$

The volume $|\mathcal{B}|$ appears in both numerator and denominator, canceling out: $$P(y = k | \mathbf{x}) = \lim_{|\mathcal{B}| \to 0} \frac{f(\mathbf{x} | y = k) \cdot |\mathcal{B}| \cdot P(y = k)}{\sum_{j} f(\mathbf{x} | y = j) \cdot |\mathcal{B}| \cdot P(y = j)}$$

$$= \frac{f(\mathbf{x} | y = k) P(y = k)}{\sum_{j} f(\mathbf{x} | y = j) P(y = j)}$$

This is a proper posterior probability, summing to 1 across all classes.

density_classification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy import stats
 
def density_based_classification(x, class_distributions, priors):
    """
    Classify a point using density-based Bayes' theorem.
    
    Parameters:
    -----------
    x : float or array
        The observation(s) to classify
    class_distributions : list of scipy.stats distributions
        Class-conditional density functions
    priors : array
        Prior probabilities for each class
    
    Returns:
    --------
    predicted_class : int
        The predicted class label
    posteriors : array
        Posterior probability for each class
    """
    n_classes = len(class_distributions)
    
    # Compute numerator for each class: f(x|k) * P(k)
    numerators = np.array([
        class_distributions[k].pdf(x) * priors[k]
        for k in range(n_classes)
    ])
    
    # Normalize to get proper posterior probabilities
    denominator = np.sum(numerators)
    posteriors = numerators / denominator
    
    predicted_class = np.argmax(posteriors)
    return predicted_class, posteriors
 
# Example: Two Gaussian classes
print("=" * 60)
print("DENSITY-BASED BAYESIAN CLASSIFICATION")
print("=" * 60)
 
# Define class-conditional distributions
class_0 = stats.norm(loc=40, scale=5)  # Class 0: mean=40, std=5
class_1 = stats.norm(loc=60, scale=5)  # Class 1: mean=60, std=5
 
distributions = [class_0, class_1]
priors = np.array([0.5, 0.5])  # Equal priors
 
# Classify several test points
test_points = [30, 40, 50, 60, 70]
 
print("\nClass 0: N(μ=40, σ=5)")
print("Class 1: N(μ=60, σ=5)")
print("Priors: P(class 0) = P(class 1) = 0.5")
print()
 
print(f"{'x':>6} | {'f(x|0)':>10} | {'f(x|1)':>10} | {'P(0|x)':>8} | {'P(1|x)':>8} | {'Pred':>6}")
print("-" * 65)
 
for x in test_points:
    pred, posteriors = density_based_classification(x, distributions, priors)
    f_0 = class_0.pdf(x)
    f_1 = class_1.pdf(x)
    print(f"{x:>6} | {f_0:>10.6f} | {f_1:>10.6f} | {posteriors[0]:>8.4f} | {posteriors[1]:>8.4f} | {pred:>6}")
 
print()
print("Note: Decision boundary is at x=50 where posteriors are equal")
 
# Show effect of unequal priors
print("\n" + "=" * 60)
print("EFFECT OF UNEQUAL PRIORS")
print("=" * 60)
 
priors_unequal = np.array([0.8, 0.2])  # Class 0 is more common
print(f"\nNew priors: P(class 0) = {priors_unequal[0]}, P(class 1) = {priors_unequal[1]}")
print()
 
x = 50  # The old decision boundary
pred_equal, post_equal = density_based_classification(x, distributions, np.array([0.5, 0.5]))
pred_unequal, post_unequal = density_based_classification(x, distributions, priors_unequal)
 
print(f"At x = {x}:")
print(f"  Equal priors:   P(0|x) = {post_equal[0]:.4f}, P(1|x) = {post_equal[1]:.4f} → Predict {pred_equal}")
print(f"  Unequal priors: P(0|x) = {post_unequal[0]:.4f}, P(1|x) = {post_unequal[1]:.4f} → Predict {pred_unequal}")
print("\nHigher prior for class 0 shifts decision boundary toward class 1!")

Why Choose the Gaussian Distribution

While any continuous density function could theoretically be used in the Bayesian framework, the Gaussian (normal) distribution is the canonical choice for continuous features. This preference is not arbitrary—it rests on deep theoretical and practical foundations.

The Central Limit Theorem

Perhaps the most compelling argument comes from the Central Limit Theorem (CLT):

The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distribution.

Many real-world measurements arise as the aggregate of many small, independent factors:

Human height: sum of genetic and environmental factors
Measurement errors: sum of many small instrument imperfections
Financial returns: aggregate of many individual trades
Temperature: result of countless molecular interactions

Consequence: When we don't know the true distribution, Gaussian is often a reasonable assumption for measurements that result from aggregated effects.

Maximum Entropy Principle

Among all continuous distributions with a given mean and variance, the Gaussian distribution has maximum entropy. In information-theoretic terms:

$$H[X] = -\int f(x) \log f(x) dx$$

Maximizing entropy subject to constraints on mean and variance yields the Gaussian. This means:

If we only know the mean and variance, Gaussian assumes minimal additional structure
It's the "least presumptuous" distribution given limited information
Any other choice would impose additional assumptions not warranted by the data

The Maximum Entropy Argument

Maximum entropy is a principled way to choose distributions: make the fewest assumptions beyond what the data tells you. If you know only the mean and variance of a continuous random variable, the Gaussian is the unique distribution that maximizes uncertainty (entropy) while respecting these constraints. Using any other distribution implicitly assumes you know more than you actually do.

Computational Tractability

Gaussian distributions have remarkable mathematical properties:

Closed-form density: Easy to evaluate $$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
Simple parameter estimation: MLE gives $\hat{\mu} = \bar{x}$ and $\hat{\sigma}^2 = \frac{1}{n}\sum_i(x_i - \bar{x})^2$
Log-density is quadratic: Enables linear algebra in log-space $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$$
Closure under linear operations: Sums of Gaussians are Gaussian
Conjugate prior structure: Enables Bayesian inference

Practical Performance

Empirical evidence consistently shows that Gaussian Naive Bayes:

Performs well on continuous data even when the true distribution is non-Gaussian
Is robust to moderate violations of the Gaussian assumption
Often matches or exceeds more complex methods on classification accuracy
Provides well-calibrated probability estimates for many applications

When Gaussian Assumption Works Well

•Features arise from aggregated effects
•Data is approximately symmetric and unimodal
•Outliers are rare or have been handled
•Features have been appropriately scaled
•Classification boundaries are approximately linear

When Gaussian Assumption May Fail

•Heavy-tailed distributions (financial data)
•Multimodal class-conditional distributions
•Skewed features (income, counts)
•Bounded features (percentages, proportions)
•Features with many outliers or extreme values

The Naive Bayes Factorization for Continuous Features

Having established that we can use probability densities in Bayes' theorem and that Gaussians are a principled choice, we now apply the naive Bayes assumption to factorize the joint class-conditional density.

The Conditional Independence Assumption

For a feature vector $\mathbf{x} = (x_1, x_2, \ldots, x_d)$, the naive Bayes assumption states: $$f(\mathbf{x} | y = k) = \prod_{j=1}^{d} f(x_j | y = k)$$

Features are assumed to be conditionally independent given the class label. This allows us to model each feature's distribution separately within each class.

Gaussian Naive Bayes Model

Under Gaussian Naive Bayes, each univariate conditional density is Gaussian: $$f(x_j | y = k) = \mathcal{N}(x_j | \mu_{jk}, \sigma_{jk}^2) = \frac{1}{\sqrt{2\pi\sigma_{jk}^2}} \exp\left(-\frac{(x_j - \mu_{jk})^2}{2\sigma_{jk}^2}\right)$$

Where:

$\mu_{jk}$ is the mean of feature $j$ for class $k$
$\sigma_{jk}^2$ is the variance of feature $j$ for class $k$

The Complete Model

The full Gaussian Naive Bayes classifier is: $$f(\mathbf{x} | y = k) = \prod_{j=1}^{d} \frac{1}{\sqrt{2\pi\sigma_{jk}^2}} \exp\left(-\frac{(x_j - \mu_{jk})^2}{2\sigma_{jk}^2}\right)$$

Taking the logarithm (for numerical stability and convenience): $$\log f(\mathbf{x} | y = k) = -\frac{1}{2} \sum_{j=1}^{d} \left[ \frac{(x_j - \mu_{jk})^2}{\sigma_{jk}^2} + \log(2\pi\sigma_{jk}^2) \right]$$

Classification Rule

To classify a new point $\mathbf{x}$, compute: $$\hat{y} = \arg\max_k \left[ \log P(y = k) + \sum_{j=1}^{d} \log f(x_j | y = k) \right]$$

$$= \arg\max_k \left[ \log \pi_k - \frac{1}{2} \sum_{j=1}^{d} \left( \frac{(x_j - \mu_{jk})^2}{\sigma_{jk}^2} + \log \sigma_{jk}^2 \right) \right]$$

Where $\pi_k = P(y = k)$ is the class prior.

Log-Space Computation

Always work in log-space when implementing Naive Bayes. Multiplying many small probabilities or densities causes numerical underflow (the product rounds to zero). Adding log-densities is numerically stable and computationally efficient. Most implementations use log-likelihoods throughout.

gaussian_naive_bayes_foundation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from scipy import stats
 
class GaussianNaiveBayesFromScratch:
    """
    Gaussian Naive Bayes classifier built from first principles.
    
    This implementation clearly shows the mathematical structure:
    - Each feature j in each class k has its own Gaussian distribution
    - Classification uses log-densities for numerical stability
    """
    
    def __init__(self):
        self.classes_ = None
        self.means_ = None      # Shape: (n_classes, n_features)
        self.variances_ = None  # Shape: (n_classes, n_features)
        self.priors_ = None     # Shape: (n_classes,)
        
    def fit(self, X, y):
        """
        Estimate Gaussian parameters for each feature in each class.
        """
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        # Initialize parameter arrays
        self.means_ = np.zeros((n_classes, n_features))
        self.variances_ = np.zeros((n_classes, n_features))
        self.priors_ = np.zeros(n_classes)
        
        for k, cls in enumerate(self.classes_):
            X_cls = X[y == cls]  # Samples belonging to class k
            n_cls = len(X_cls)
            
            # Prior: P(y = k) = fraction of samples in class k
            self.priors_[k] = n_cls / n_samples
            
            # Mean and variance for each feature in this class
            self.means_[k, :] = X_cls.mean(axis=0)
            self.variances_[k, :] = X_cls.var(axis=0)
        
        # Add small epsilon for numerical stability (avoid division by zero)
        self.variances_ = np.maximum(self.variances_, 1e-9)
        
        return self
    
    def _log_likelihood(self, X, class_idx):
        """
        Compute log-likelihood of samples under class distribution.
        
        log f(x | y=k) = sum_j log f(x_j | y=k)
                       = -0.5 * sum_j [(x_j - μ_jk)² / σ²_jk + log(2π σ²_jk)]
        """
        means = self.means_[class_idx]
        variances = self.variances_[class_idx]
        
        # Compute log density for each feature, then sum
        log_densities = -0.5 * (
            ((X - means) ** 2) / variances +
            np.log(2 * np.pi * variances)
        )
        
        # Sum across features (independence assumption)
        return log_densities.sum(axis=1)
    
    def predict_log_proba(self, X):
        """
        Compute log-posterior probabilities for each class.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        
        # log P(x|k) + log P(k) for each class
        log_posteriors = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            log_posteriors[:, k] = (
                self._log_likelihood(X, k) +
                np.log(self.priors_[k])
            )
        
        # Normalize using log-sum-exp trick
        log_sum = np.log(np.exp(log_posteriors - log_posteriors.max(axis=1, keepdims=True)).sum(axis=1, keepdims=True))
        log_posteriors = log_posteriors - log_posteriors.max(axis=1, keepdims=True) - log_sum
        
        return log_posteriors
    
    def predict_proba(self, X):
        """Convert log probabilities to probabilities."""
        return np.exp(self.predict_log_proba(X))
    
    def predict(self, X):
        """Predict class labels."""
        return self.classes_[np.argmax(self.predict_log_proba(X), axis=1)]
 
 
# Demonstration
np.random.seed(42)
 
# Generate synthetic 2-class data with 3 features
n_per_class = 200
 
# Class 0: lower values on all features
X_0 = np.column_stack([
    np.random.normal(50, 10, n_per_class),
    np.random.normal(30, 5, n_per_class),
    np.random.normal(100, 20, n_per_class)
])
 
# Class 1: higher values on all features
X_1 = np.column_stack([
    np.random.normal(70, 10, n_per_class),
    np.random.normal(50, 5, n_per_class),
    np.random.normal(140, 20, n_per_class)
])
 
X = np.vstack([X_0, X_1])
y = np.array([0] * n_per_class + [1] * n_per_class)
 
# Shuffle
shuffle_idx = np.random.permutation(len(y))
X, y = X[shuffle_idx], y[shuffle_idx]
 
# Split into train/test
X_train, X_test = X[:300], X[300:]
y_train, y_test = y[:300], y[300:]
 
# Train and evaluate
gnb = GaussianNaiveBayesFromScratch()
gnb.fit(X_train, y_train)
 
print("=" * 60)
print("GAUSSIAN NAIVE BAYES: LEARNED PARAMETERS")
print("=" * 60)
 
for k in range(2):
    print(f"\nClass {k}:")
    print(f"  Prior P(y={k}) = {gnb.priors_[k]:.4f}")
    for j in range(3):
        print(f"  Feature {j}: μ = {gnb.means_[k,j]:.2f}, σ² = {gnb.variances_[k,j]:.2f}")
 
# Predictions
y_pred = gnb.predict(X_test)
accuracy = np.mean(y_pred == y_test)
 
print(f"\nTest Accuracy: {accuracy:.4f}")
 
# Example predictions with probabilities
print("\n" + "=" * 60)
print("EXAMPLE PREDICTIONS")
print("=" * 60)
for i in range(5):
    probs = gnb.predict_proba(X_test[i:i+1])[0]
    print(f"Sample {i}: True={y_test[i]}, Pred={y_pred[i]}, P(y=0)={probs[0]:.4f}, P(y=1)={probs[1]:.4f}")

Parameter Count and Model Complexity

Understanding the number of parameters in Gaussian Naive Bayes reveals why it's such an efficient model and why the naive independence assumption is so powerful.

Counting Parameters

For a problem with $K$ classes, $d$ features:

Class priors: $K - 1$ free parameters (the $K$-th is determined by constraint $\sum_k \pi_k = 1$)

Gaussian parameters:

Each feature in each class has a mean $\mu_{jk}$: $K \times d$ means
Each feature in each class has a variance $\sigma_{jk}^2$: $K \times d$ variances

Total parameters: $$\text{Parameters} = (K - 1) + 2Kd = 2Kd + K - 1$$

For typical cases:

Binary classification ($K = 2$), 10 features: $2 \times 2 \times 10 + 1 = 41$ parameters
10 classes, 100 features: $2 \times 10 \times 100 + 9 = 2009$ parameters

Comparison: Full Covariance Model

Without the naive Bayes assumption, we would model the full class-conditional distribution as a multivariate Gaussian: $$f(\mathbf{x} | y = k) = \mathcal{N}(\mathbf{x} | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

A full $d \times d$ covariance matrix has $\frac{d(d+1)}{2}$ parameters per class: $$\text{Full model parameters} = Kd + K \cdot \frac{d(d+1)}{2} = Kd + \frac{Kd(d+1)}{2}$$

For 10 classes, 100 features: $10 \times 100 + 10 \times \frac{100 \times 101}{2} = 1000 + 50,500 = 51,500$ parameters!

Parameter Comparison: Naive Bayes vs Full Covariance
Model	Formula	K=2, d=10	K=10, d=100
Gaussian Naive Bayes	$2Kd + K - 1$	41	2,009
Full Covariance GDA	$Kd + \frac{Kd(d+1)}{2}$	130	51,500
Ratio (Full/NB)		3.2×	25.6×

The Bias-Variance Tradeoff

Gaussian Naive Bayes has fewer parameters (lower variance, less overfitting risk) but makes a stronger assumption (higher bias). The full covariance model is more flexible but requires more data to estimate reliably. In high dimensions with limited data, the naive model often outperforms the 'correct' model because it can be estimated more reliably.

The Diagonal Covariance Interpretation

The naive Bayes assumption is equivalent to assuming a diagonal covariance matrix:

$$\boldsymbol{\Sigma}k = \begin{pmatrix} \sigma{1k}^2 & 0 & \cdots & 0 \ 0 & \sigma_{2k}^2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \sigma_{dk}^2 \end{pmatrix}$$

Zero off-diagonal entries mean zero correlation between features (conditional on class). This is exactly the conditional independence assumption expressed in linear algebra.

Key insight: Gaussian Naive Bayes assumes that knowing feature $x_1$ tells us nothing about feature $x_2$, once we know the class. The features may still be marginally correlated (in the overall data), but within each class, they are assumed independent.

Sample Complexity Implications

With fewer parameters, Gaussian Naive Bayes can be trained reliably with less data:

Naive Bayes: Needs enough samples to estimate $2d$ parameters per class (means and variances)
Full covariance: Needs enough samples to estimate $\frac{d(d+1)}{2}$ covariance parameters per class

Rule of thumb: Need at least 5-10× as many samples as parameters for reliable estimation. For 100 features:

Naive Bayes: $\sim$1,000-2,000 samples per class
Full covariance: $\sim$25,000-50,000 samples per class

This explains why Naive Bayes often outperforms more sophisticated models on small-to-medium datasets.

Summary: Foundations of Gaussian Naive Bayes

We have established the theoretical and practical foundations for applying Naive Bayes to continuous features. Let us consolidate the key concepts:

Key Concepts

•Discrete methods fail for continuous features: Exact values have zero probability; counting doesn't work
•Probability densities replace probabilities: Densities indicate relative likelihood; intervals have non-zero probability
•Bayes' theorem works with densities: We compare class-conditional densities, which serve as likelihoods
•Gaussian distributions are principled: Central limit theorem, maximum entropy, and computational tractability justify their use
•Naive assumption enables factorization: Joint density decomposes into product of univariate Gaussians
•Efficient parameter count: $O(Kd)$ parameters vs $O(Kd^2)$ for full covariance models

What's next:

Having understood why and how we model continuous features with Gaussians, the next page dives deep into the Gaussian distribution itself. We will explore its mathematical properties, the role of mean and variance parameters, and develop intuition for how Gaussian shapes encode information about data distributions.

Page Complete

You now understand the fundamental challenge of continuous features and why Gaussian Naive Bayes provides an elegant solution. The key insight is that probability densities serve as relative likelihoods—even though point probabilities are zero, we can still compare which class better explains the observed data. Next, we explore the Gaussian distribution in depth.

1 / 5

Loading learning content...

Machine LearningNaive Bayes & Probabilistic Classifiers

Gaussian Naive Bayes

LevelIntermediate

Duration90 mins

TopicNaive Bayes & Probabilistic Classifiers

1 / 5

Continuous Features

Beyond Discrete Counts: The Continuous Challenge

What You Will Learn

The Discrete-to-Continuous Gap

Let us first understand precisely why the techniques we developed for Multinomial and Bernoulli Naive Bayes cannot be directly applied to continuous features.

Discrete Feature Modeling: A Review

For discrete features (e.g., word counts in text), we estimate class-conditional probabilities by counting:

$$P(x_j = v | y = k) = \frac{\text{count of samples in class } k \text{ where feature } j = v}{\text{total samples in class } k}$$

This works because:

Features take values from a finite set ${v_1, v_2, \ldots, v_m}$
Multiple samples can have the exact same value
Probabilities are additive: $\sum_{v} P(x_j = v | y = k) = 1$

The Continuous Feature Problem

For continuous features, these assumptions break down:

Problem 1: Infinite possible values

A feature like height can take any value in a continuous range (e.g., 150.00000... to 200.00000... cm). There are uncountably infinite possible values, not a finite set.

Problem 2: Zero probability of exact values

In a continuous distribution, the probability of observing any exact value is zero: $$P(X = 175.3247281...) = 0$$

This is because probability mass must be distributed over infinitely many possible values.

Problem 3: Counting doesn't work

The Naive Discretization Trap

discretization_problems.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
import matplotlib.pyplot as plt
 
# Generate continuous data for two classes
np.random.seed(42)
n_samples = 500
 
# Class 0: lower values
class_0 = np.random.normal(loc=45, scale=8, size=n_samples)
# Class 1: higher values  
class_1 = np.random.normal(loc=55, scale=8, size=n_samples)
 
# Demonstrate the discretization problem
def discretize_and_classify(data, labels, n_bins):
    """
    Show how discretization loses information and creates arbitrary boundaries.
    """
    # Combine data
    all_data = np.concatenate([data[labels == 0], data[labels == 1]])
    all_labels = labels
    
    # Create bin edges
    bin_edges = np.linspace(all_data.min(), all_data.max(), n_bins + 1)
    
    # Assign to bins
    binned_data = np.digitize(all_data, bin_edges[:-1]) - 1
    
    # Count per class per bin
    class_0_counts = np.zeros(n_bins)
    class_1_counts = np.zeros(n_bins)
    
    for bin_idx in range(n_bins):
        class_0_counts[bin_idx] = np.sum((binned_data == bin_idx) & (all_labels == 0))
        class_1_counts[bin_idx] = np.sum((binned_data == bin_idx) & (all_labels == 1))
    
    return bin_edges, class_0_counts, class_1_counts
 
# Compare different numbers of bins
all_data = np.concatenate([class_0, class_1])
labels = np.concatenate([np.zeros(n_samples), np.ones(n_samples)])
 
print("=" * 60)
print("DISCRETIZATION SENSITIVITY ANALYSIS")
print("=" * 60)
 
for n_bins in [5, 10, 20, 50]:
    edges, c0, c1 = discretize_and_classify(all_data, labels, n_bins)
    
    # Find the bin where decision changes (class 0 > class 1 vs class 1 > class 0)
    decision_changes = []
    for i in range(n_bins - 1):
        ratio_current = c0[i] / max(c1[i], 1)
        ratio_next = c0[i+1] / max(c1[i+1], 1)
        if (ratio_current > 1) != (ratio_next > 1):
            decision_changes.append(edges[i+1])
    
    print(f"\nBins: {n_bins}")
    print(f"  Decision boundary locations: {decision_changes}")
    print(f"  Number of boundary shifts: {len(decision_changes)}")
    
    # Check for empty bins (problematic for probability estimation)
    empty_c0 = np.sum(c0 == 0)
    empty_c1 = np.sum(c1 == 0)
    print(f"  Empty bins (class 0): {empty_c0}")
    print(f"  Empty bins (class 1): {empty_c1}")
 
print("\n" + "=" * 60)
print("KEY INSIGHT: The decision boundary MOVES based on arbitrary")
print("bin count choice. This is not a property of the data!")
print("=" * 60)

Probability Density Functions

From Mass to Density

Probability Mass Function (PMF) — for discrete variables:

$p(x)$ gives the exact probability that $X = x$
$\sum_x p(x) = 1$
Individual values have non-zero probability

Probability Density Function (PDF) — for continuous variables:

$f(x)$ gives the density at point $x$, not probability
$\int_{-\infty}^{\infty} f(x) dx = 1$
Individual points have zero probability; only intervals have non-zero probability

$$P(a \leq X \leq b) = \int_a^b f(x) dx$$

Interpreting Density

Crucial insight: While $P(X = x) = 0$ for any specific $x$, the density $f(x)$ can be large or small, telling us where values are more or less likely to concentrate.

$$f(x) = \lim_{\epsilon \to 0} \frac{P(x - \epsilon < X < x + \epsilon)}{2\epsilon}$$

The density is the limit of probability per unit length as the interval shrinks to zero.

PMF vs PDF: Key Differences
Property	PMF (Discrete)	PDF (Continuous)
Output interpretation	$p(x) = P(X = x)$	$f(x) = $ density at $x$
Range of output	$0 \leq p(x) \leq 1$	$f(x) \geq 0$, can exceed 1
Normalization	$\sum_x p(x) = 1$	$\int f(x) dx = 1$
Point probability	$P(X = x) = p(x) > 0$ possible	$P(X = x) = 0$ always
Interval probability	Sum over integers in interval	Integral over interval
Typical examples	Binomial, Poisson, Categorical	Gaussian, Exponential, Uniform

Density Can Exceed 1

pmf_vs_pdf.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
# Demonstrate PMF vs PDF concepts
print("=" * 60)
print("PMF vs PDF DEMONSTRATION")
print("=" * 60)
 
# PMF Example: Poisson distribution (discrete)
print("\n--- PMF: Poisson Distribution (λ=3) ---")
poisson = stats.poisson(mu=3)
for x in range(10):
    pmf_val = poisson.pmf(x)
    print(f"P(X = {x}) = {pmf_val:.4f}")
print(f"Sum of PMF values: {sum(poisson.pmf(x) for x in range(100)):.6f}")
 
# PDF Example: Normal distribution (continuous)
print("\n--- PDF: Normal Distribution (μ=0, σ=0.3) ---")
normal = stats.norm(loc=0, scale=0.3)
x_values = np.array([-1.0, -0.5, 0.0, 0.5, 1.0])
for x in x_values:
    pdf_val = normal.pdf(x)
    print(f"f(X = {x}) = {pdf_val:.4f}")  # Note: can exceed 1!
 
print(f"\nPDF at x=0: {normal.pdf(0):.4f}")  # > 1 because σ is small
print(f"This exceeds 1, but the INTEGRAL equals: {normal.cdf(np.inf) - normal.cdf(-np.inf):.6f}")
 
# Show that point probabilities are zero
print("\n--- Point vs Interval Probability ---")
print(f"P(X = 0.0 exactly) = {0.0:.10f}  (always zero for continuous)")
print(f"P(-0.1 < X < 0.1) = {normal.cdf(0.1) - normal.cdf(-0.1):.6f}")
print(f"P(-0.01 < X < 0.01) = {normal.cdf(0.01) - normal.cdf(-0.01):.6f}")
print(f"P(-0.001 < X < 0.001) = {normal.cdf(0.001) - normal.cdf(-0.001):.6f}")
print("\nAs interval shrinks → probability → 0")
 
# Demonstrate relative likelihood interpretation
print("\n--- Relative Likelihood Interpretation ---")
# Compare densities at two points
x1, x2 = 0, 1
f_x1, f_x2 = normal.pdf(x1), normal.pdf(x2)
print(f"Density at x={x1}: {f_x1:.4f}")
print(f"Density at x={x2}: {f_x2:.4f}")
print(f"Ratio: {f_x1 / f_x2:.2f}x more likely near x={x1} than near x={x2}")

Using Density in Bayes' Theorem

The elegant insight that enables Gaussian Naive Bayes is that we can substitute probability densities for probability mass in Bayes' theorem, and the classification still works correctly.

Standard Bayes' Theorem (Discrete)

For discrete features: $$P(y = k | \mathbf{x}) = \frac{P(\mathbf{x} | y = k) P(y = k)}{P(\mathbf{x})}$$

We classify by finding: $$\hat{y} = \arg\max_k P(y = k | \mathbf{x}) = \arg\max_k P(\mathbf{x} | y = k) P(y = k)$$

Modified Bayes' Theorem (Continuous)

For continuous features, we replace probabilities with probability densities: $$f(y = k | \mathbf{x}) \propto f(\mathbf{x} | y = k) P(y = k)$$

Where:

$f(\mathbf{x} | y = k)$ is the class-conditional density of feature vector $\mathbf{x}$ given class $k$
$P(y = k)$ is the prior probability of class $k$ (still a probability, not density)

We classify by finding: $$\hat{y} = \arg\max_k f(\mathbf{x} | y = k) P(y = k)$$

Why This Works

The key insight is that we are comparing ratios of likelihoods across classes. Since we're comparing: $$\frac{f(\mathbf{x} | y = 1) P(y = 1)}{f(\mathbf{x} | y = 0) P(y = 0)}$$

Any constants that don't depend on class $k$ cancel out. The densities serve as relative likelihoods—telling us which class makes the observed data more likely.

The Likelihood Interpretation

Mathematical Justification

The formal justification involves considering infinitesimally small regions around the observed point. For a small region $\mathcal{B}$ around $\mathbf{x}$:

$$P(y = k | \mathbf{x} \in \mathcal{B}) = \frac{P(\mathbf{x} \in \mathcal{B} | y = k) P(y = k)}{P(\mathbf{x} \in \mathcal{B})}$$

As $\mathcal{B}$ shrinks to a point: $$P(\mathbf{x} \in \mathcal{B} | y = k) \approx f(\mathbf{x} | y = k) \cdot |\mathcal{B}|$$

$$= \frac{f(\mathbf{x} | y = k) P(y = k)}{\sum_{j} f(\mathbf{x} | y = j) P(y = j)}$$

This is a proper posterior probability, summing to 1 across all classes.

density_classification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy import stats
 
def density_based_classification(x, class_distributions, priors):
    """
    Classify a point using density-based Bayes' theorem.
    
    Parameters:
    -----------
    x : float or array
        The observation(s) to classify
    class_distributions : list of scipy.stats distributions
        Class-conditional density functions
    priors : array
        Prior probabilities for each class
    
    Returns:
    --------
    predicted_class : int
        The predicted class label
    posteriors : array
        Posterior probability for each class
    """
    n_classes = len(class_distributions)
    
    # Compute numerator for each class: f(x|k) * P(k)
    numerators = np.array([
        class_distributions[k].pdf(x) * priors[k]
        for k in range(n_classes)
    ])
    
    # Normalize to get proper posterior probabilities
    denominator = np.sum(numerators)
    posteriors = numerators / denominator
    
    predicted_class = np.argmax(posteriors)
    return predicted_class, posteriors
 
# Example: Two Gaussian classes
print("=" * 60)
print("DENSITY-BASED BAYESIAN CLASSIFICATION")
print("=" * 60)
 
# Define class-conditional distributions
class_0 = stats.norm(loc=40, scale=5)  # Class 0: mean=40, std=5
class_1 = stats.norm(loc=60, scale=5)  # Class 1: mean=60, std=5
 
distributions = [class_0, class_1]
priors = np.array([0.5, 0.5])  # Equal priors
 
# Classify several test points
test_points = [30, 40, 50, 60, 70]
 
print("\nClass 0: N(μ=40, σ=5)")
print("Class 1: N(μ=60, σ=5)")
print("Priors: P(class 0) = P(class 1) = 0.5")
print()
 
print(f"{'x':>6} | {'f(x|0)':>10} | {'f(x|1)':>10} | {'P(0|x)':>8} | {'P(1|x)':>8} | {'Pred':>6}")
print("-" * 65)
 
for x in test_points:
    pred, posteriors = density_based_classification(x, distributions, priors)
    f_0 = class_0.pdf(x)
    f_1 = class_1.pdf(x)
    print(f"{x:>6} | {f_0:>10.6f} | {f_1:>10.6f} | {posteriors[0]:>8.4f} | {posteriors[1]:>8.4f} | {pred:>6}")
 
print()
print("Note: Decision boundary is at x=50 where posteriors are equal")
 
# Show effect of unequal priors
print("\n" + "=" * 60)
print("EFFECT OF UNEQUAL PRIORS")
print("=" * 60)
 
priors_unequal = np.array([0.8, 0.2])  # Class 0 is more common
print(f"\nNew priors: P(class 0) = {priors_unequal[0]}, P(class 1) = {priors_unequal[1]}")
print()
 
x = 50  # The old decision boundary
pred_equal, post_equal = density_based_classification(x, distributions, np.array([0.5, 0.5]))
pred_unequal, post_unequal = density_based_classification(x, distributions, priors_unequal)
 
print(f"At x = {x}:")
print(f"  Equal priors:   P(0|x) = {post_equal[0]:.4f}, P(1|x) = {post_equal[1]:.4f} → Predict {pred_equal}")
print(f"  Unequal priors: P(0|x) = {post_unequal[0]:.4f}, P(1|x) = {post_unequal[1]:.4f} → Predict {pred_unequal}")
print("\nHigher prior for class 0 shifts decision boundary toward class 1!")

Why Choose the Gaussian Distribution

The Central Limit Theorem

Perhaps the most compelling argument comes from the Central Limit Theorem (CLT):

The sum (or average) of many independent random variables tends toward a normal distribution, regardless of the original distribution.

Many real-world measurements arise as the aggregate of many small, independent factors:

Human height: sum of genetic and environmental factors
Measurement errors: sum of many small instrument imperfections
Financial returns: aggregate of many individual trades
Temperature: result of countless molecular interactions

Consequence: When we don't know the true distribution, Gaussian is often a reasonable assumption for measurements that result from aggregated effects.

Maximum Entropy Principle

Among all continuous distributions with a given mean and variance, the Gaussian distribution has maximum entropy. In information-theoretic terms:

$$H[X] = -\int f(x) \log f(x) dx$$

Maximizing entropy subject to constraints on mean and variance yields the Gaussian. This means:

If we only know the mean and variance, Gaussian assumes minimal additional structure
It's the "least presumptuous" distribution given limited information
Any other choice would impose additional assumptions not warranted by the data

The Maximum Entropy Argument

Computational Tractability

Gaussian distributions have remarkable mathematical properties:

Closed-form density: Easy to evaluate $$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
Simple parameter estimation: MLE gives $\hat{\mu} = \bar{x}$ and $\hat{\sigma}^2 = \frac{1}{n}\sum_i(x_i - \bar{x})^2$
Log-density is quadratic: Enables linear algebra in log-space $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$$
Closure under linear operations: Sums of Gaussians are Gaussian
Conjugate prior structure: Enables Bayesian inference

Practical Performance

Empirical evidence consistently shows that Gaussian Naive Bayes:

Performs well on continuous data even when the true distribution is non-Gaussian
Is robust to moderate violations of the Gaussian assumption
Often matches or exceeds more complex methods on classification accuracy
Provides well-calibrated probability estimates for many applications

When Gaussian Assumption Works Well

•Features arise from aggregated effects
•Data is approximately symmetric and unimodal
•Outliers are rare or have been handled
•Features have been appropriately scaled
•Classification boundaries are approximately linear

When Gaussian Assumption May Fail

•Heavy-tailed distributions (financial data)
•Multimodal class-conditional distributions
•Skewed features (income, counts)
•Bounded features (percentages, proportions)
•Features with many outliers or extreme values

The Naive Bayes Factorization for Continuous Features

The Conditional Independence Assumption

For a feature vector $\mathbf{x} = (x_1, x_2, \ldots, x_d)$, the naive Bayes assumption states: $$f(\mathbf{x} | y = k) = \prod_{j=1}^{d} f(x_j | y = k)$$

Features are assumed to be conditionally independent given the class label. This allows us to model each feature's distribution separately within each class.

Gaussian Naive Bayes Model

Where:

$\mu_{jk}$ is the mean of feature $j$ for class $k$
$\sigma_{jk}^2$ is the variance of feature $j$ for class $k$

The Complete Model

The full Gaussian Naive Bayes classifier is: $$f(\mathbf{x} | y = k) = \prod_{j=1}^{d} \frac{1}{\sqrt{2\pi\sigma_{jk}^2}} \exp\left(-\frac{(x_j - \mu_{jk})^2}{2\sigma_{jk}^2}\right)$$

Classification Rule

To classify a new point $\mathbf{x}$, compute: $$\hat{y} = \arg\max_k \left[ \log P(y = k) + \sum_{j=1}^{d} \log f(x_j | y = k) \right]$$

$$= \arg\max_k \left[ \log \pi_k - \frac{1}{2} \sum_{j=1}^{d} \left( \frac{(x_j - \mu_{jk})^2}{\sigma_{jk}^2} + \log \sigma_{jk}^2 \right) \right]$$

Where $\pi_k = P(y = k)$ is the class prior.

Log-Space Computation

gaussian_naive_bayes_foundation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
import numpy as np
from scipy import stats
 
class GaussianNaiveBayesFromScratch:
    """
    Gaussian Naive Bayes classifier built from first principles.
    
    This implementation clearly shows the mathematical structure:
    - Each feature j in each class k has its own Gaussian distribution
    - Classification uses log-densities for numerical stability
    """
    
    def __init__(self):
        self.classes_ = None
        self.means_ = None      # Shape: (n_classes, n_features)
        self.variances_ = None  # Shape: (n_classes, n_features)
        self.priors_ = None     # Shape: (n_classes,)
        
    def fit(self, X, y):
        """
        Estimate Gaussian parameters for each feature in each class.
        """
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        # Initialize parameter arrays
        self.means_ = np.zeros((n_classes, n_features))
        self.variances_ = np.zeros((n_classes, n_features))
        self.priors_ = np.zeros(n_classes)
        
        for k, cls in enumerate(self.classes_):
            X_cls = X[y == cls]  # Samples belonging to class k
            n_cls = len(X_cls)
            
            # Prior: P(y = k) = fraction of samples in class k
            self.priors_[k] = n_cls / n_samples
            
            # Mean and variance for each feature in this class
            self.means_[k, :] = X_cls.mean(axis=0)
            self.variances_[k, :] = X_cls.var(axis=0)
        
        # Add small epsilon for numerical stability (avoid division by zero)
        self.variances_ = np.maximum(self.variances_, 1e-9)
        
        return self
    
    def _log_likelihood(self, X, class_idx):
        """
        Compute log-likelihood of samples under class distribution.
        
        log f(x | y=k) = sum_j log f(x_j | y=k)
                       = -0.5 * sum_j [(x_j - μ_jk)² / σ²_jk + log(2π σ²_jk)]
        """
        means = self.means_[class_idx]
        variances = self.variances_[class_idx]
        
        # Compute log density for each feature, then sum
        log_densities = -0.5 * (
            ((X - means) ** 2) / variances +
            np.log(2 * np.pi * variances)
        )
        
        # Sum across features (independence assumption)
        return log_densities.sum(axis=1)
    
    def predict_log_proba(self, X):
        """
        Compute log-posterior probabilities for each class.
        """
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        
        # log P(x|k) + log P(k) for each class
        log_posteriors = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            log_posteriors[:, k] = (
                self._log_likelihood(X, k) +
                np.log(self.priors_[k])
            )
        
        # Normalize using log-sum-exp trick
        log_sum = np.log(np.exp(log_posteriors - log_posteriors.max(axis=1, keepdims=True)).sum(axis=1, keepdims=True))
        log_posteriors = log_posteriors - log_posteriors.max(axis=1, keepdims=True) - log_sum
        
        return log_posteriors
    
    def predict_proba(self, X):
        """Convert log probabilities to probabilities."""
        return np.exp(self.predict_log_proba(X))
    
    def predict(self, X):
        """Predict class labels."""
        return self.classes_[np.argmax(self.predict_log_proba(X), axis=1)]
 
 
# Demonstration
np.random.seed(42)
 
# Generate synthetic 2-class data with 3 features
n_per_class = 200
 
# Class 0: lower values on all features
X_0 = np.column_stack([
    np.random.normal(50, 10, n_per_class),
    np.random.normal(30, 5, n_per_class),
    np.random.normal(100, 20, n_per_class)
])
 
# Class 1: higher values on all features
X_1 = np.column_stack([
    np.random.normal(70, 10, n_per_class),
    np.random.normal(50, 5, n_per_class),
    np.random.normal(140, 20, n_per_class)
])
 
X = np.vstack([X_0, X_1])
y = np.array([0] * n_per_class + [1] * n_per_class)
 
# Shuffle
shuffle_idx = np.random.permutation(len(y))
X, y = X[shuffle_idx], y[shuffle_idx]
 
# Split into train/test
X_train, X_test = X[:300], X[300:]
y_train, y_test = y[:300], y[300:]
 
# Train and evaluate
gnb = GaussianNaiveBayesFromScratch()
gnb.fit(X_train, y_train)
 
print("=" * 60)
print("GAUSSIAN NAIVE BAYES: LEARNED PARAMETERS")
print("=" * 60)
 
for k in range(2):
    print(f"\nClass {k}:")
    print(f"  Prior P(y={k}) = {gnb.priors_[k]:.4f}")
    for j in range(3):
        print(f"  Feature {j}: μ = {gnb.means_[k,j]:.2f}, σ² = {gnb.variances_[k,j]:.2f}")
 
# Predictions
y_pred = gnb.predict(X_test)
accuracy = np.mean(y_pred == y_test)
 
print(f"\nTest Accuracy: {accuracy:.4f}")
 
# Example predictions with probabilities
print("\n" + "=" * 60)
print("EXAMPLE PREDICTIONS")
print("=" * 60)
for i in range(5):
    probs = gnb.predict_proba(X_test[i:i+1])[0]
    print(f"Sample {i}: True={y_test[i]}, Pred={y_pred[i]}, P(y=0)={probs[0]:.4f}, P(y=1)={probs[1]:.4f}")

Parameter Count and Model Complexity

Understanding the number of parameters in Gaussian Naive Bayes reveals why it's such an efficient model and why the naive independence assumption is so powerful.

Counting Parameters

For a problem with $K$ classes, $d$ features:

Class priors: $K - 1$ free parameters (the $K$-th is determined by constraint $\sum_k \pi_k = 1$)

Gaussian parameters:

Each feature in each class has a mean $\mu_{jk}$: $K \times d$ means
Each feature in each class has a variance $\sigma_{jk}^2$: $K \times d$ variances

Total parameters: $$\text{Parameters} = (K - 1) + 2Kd = 2Kd + K - 1$$

For typical cases:

Binary classification ($K = 2$), 10 features: $2 \times 2 \times 10 + 1 = 41$ parameters
10 classes, 100 features: $2 \times 10 \times 100 + 9 = 2009$ parameters

Comparison: Full Covariance Model

A full $d \times d$ covariance matrix has $\frac{d(d+1)}{2}$ parameters per class: $$\text{Full model parameters} = Kd + K \cdot \frac{d(d+1)}{2} = Kd + \frac{Kd(d+1)}{2}$$

For 10 classes, 100 features: $10 \times 100 + 10 \times \frac{100 \times 101}{2} = 1000 + 50,500 = 51,500$ parameters!

Parameter Comparison: Naive Bayes vs Full Covariance
Model	Formula	K=2, d=10	K=10, d=100
Gaussian Naive Bayes	$2Kd + K - 1$	41	2,009
Full Covariance GDA	$Kd + \frac{Kd(d+1)}{2}$	130	51,500
Ratio (Full/NB)		3.2×	25.6×

The Bias-Variance Tradeoff

The Diagonal Covariance Interpretation

The naive Bayes assumption is equivalent to assuming a diagonal covariance matrix:

$$\boldsymbol{\Sigma}k = \begin{pmatrix} \sigma{1k}^2 & 0 & \cdots & 0 \ 0 & \sigma_{2k}^2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \sigma_{dk}^2 \end{pmatrix}$$

Zero off-diagonal entries mean zero correlation between features (conditional on class). This is exactly the conditional independence assumption expressed in linear algebra.

Sample Complexity Implications

With fewer parameters, Gaussian Naive Bayes can be trained reliably with less data:

Naive Bayes: Needs enough samples to estimate $2d$ parameters per class (means and variances)
Full covariance: Needs enough samples to estimate $\frac{d(d+1)}{2}$ covariance parameters per class

Rule of thumb: Need at least 5-10× as many samples as parameters for reliable estimation. For 100 features:

Naive Bayes: $\sim$1,000-2,000 samples per class
Full covariance: $\sim$25,000-50,000 samples per class

This explains why Naive Bayes often outperforms more sophisticated models on small-to-medium datasets.

Summary: Foundations of Gaussian Naive Bayes

We have established the theoretical and practical foundations for applying Naive Bayes to continuous features. Let us consolidate the key concepts:

Key Concepts

•Discrete methods fail for continuous features: Exact values have zero probability; counting doesn't work
•Probability densities replace probabilities: Densities indicate relative likelihood; intervals have non-zero probability
•Bayes' theorem works with densities: We compare class-conditional densities, which serve as likelihoods
•Gaussian distributions are principled: Central limit theorem, maximum entropy, and computational tractability justify their use
•Naive assumption enables factorization: Joint density decomposes into product of univariate Gaussians
•Efficient parameter count: $O(Kd)$ parameters vs $O(Kd^2)$ for full covariance models

What's next:

Page Complete

1 / 5