Gaussian Naive Bayes - Learning Module

Loading content...

0/278

Gaussian Assumption

The Bell Curve: Nature's Favorite Distribution

The Gaussian distribution—the iconic bell curve—appears with remarkable frequency in nature, science, and engineering. From the distribution of measurement errors to the heights of human populations, from thermal noise in electronics to test scores in educational assessment, the Gaussian emerges again and again as the natural description of continuous phenomena.

In Gaussian Naive Bayes, we make a specific assumption: within each class, each feature follows a Gaussian distribution. This assumption transforms the abstract problem of density estimation into a simple parameter estimation problem. But to use this assumption wisely, we must deeply understand what it means.

What does it mean for a feature to be Gaussian? How do the parameters $\mu$ (mean) and $\sigma^2$ (variance) shape the distribution? When is this assumption reasonable, and when might it lead us astray? This page answers these questions with mathematical rigor and practical insight.

What You Will Learn

By the end of this page, you will understand: (1) the mathematical form of the Gaussian PDF and its key properties, (2) how μ and σ² control location and spread, (3) the 68-95-99.7 rule and standard normal form, (4) when the Gaussian assumption is justified, and (5) diagnostic techniques for checking Gaussianity.

The Gaussian Probability Density Function

The univariate Gaussian distribution is defined by its probability density function:

$$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Alternatively written as: $$f(x | \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

We denote this distribution as $\mathcal{N}(\mu, \sigma^2)$ or $X \sim \mathcal{N}(\mu, \sigma^2)$.

Dissecting the Formula

The exponential term: $\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

This is the heart of the Gaussian. It's a quadratic function of $(x - \mu)$ inside a negative exponential:

Maximum value of 1 when $x = \mu$ (the exponent is 0)
Decays symmetrically as $x$ moves away from $\mu$
Rate of decay controlled by $\sigma^2$: larger variance means slower decay

The normalization constant: $\frac{1}{\sqrt{2\pi\sigma^2}}$

This ensures the density integrates to 1: $$\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) dx = 1$$

The $\sqrt{2\pi}$ comes from the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi}$, one of the most beautiful results in mathematics.

Why the Exponent is Negative Quadratic

The negative quadratic exponent creates the characteristic bell shape: rapidly increasing toward the mean, then rapidly decreasing away from it. Any positive quadratic function in the exponent would explode to infinity; only negative quadratics give integrable, well-behaved densities. This is why quadratic loss functions and Gaussian distributions are intimately connected.

gaussian_pdf_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def gaussian_pdf(x, mu, sigma_sq):
    """
    Compute Gaussian PDF manually to understand each component.
    """
    normalization = 1.0 / np.sqrt(2 * np.pi * sigma_sq)
    exponent = -((x - mu) ** 2) / (2 * sigma_sq)
    return normalization * np.exp(exponent)
 
# Verify our implementation matches scipy
x = np.linspace(-5, 5, 1000)
mu, sigma = 0, 1
 
pdf_manual = gaussian_pdf(x, mu, sigma**2)
pdf_scipy = stats.norm(mu, sigma).pdf(x)
 
print("=" * 60)
print("GAUSSIAN PDF VERIFICATION")
print("=" * 60)
print(f"Max difference between manual and scipy: {np.max(np.abs(pdf_manual - pdf_scipy)):.2e}")
 
# Demonstrate PDF properties
print("
" + "=" * 60)
print("KEY PDF PROPERTIES")
print("=" * 60)
 
# Property 1: Maximum at the mean
mu, sigma = 3.0, 2.0
x_vals = [mu - 2*sigma, mu - sigma, mu, mu + sigma, mu + 2*sigma]
for x_val in x_vals:
    density = gaussian_pdf(x_val, mu, sigma**2)
    print(f"f({x_val:.1f}) = {density:.6f}")
print(f"Maximum is at x = μ = {mu}")
 
# Property 2: Symmetry
print("
Symmetry check:")
for delta in [0.5, 1.0, 2.0]:
    left = gaussian_pdf(mu - delta, mu, sigma**2)
    right = gaussian_pdf(mu + delta, mu, sigma**2)
    print(f"f(μ - {delta}) = {left:.6f}, f(μ + {delta}) = {right:.6f}, Equal: {np.isclose(left, right)}")
 
# Property 3: Integration to 1
from scipy.integrate import quad
integral, _ = quad(lambda x: gaussian_pdf(x, mu, sigma**2), -np.inf, np.inf)
print(f"
Integral of PDF: {integral:.10f} (should be 1.0)")
 
# Property 4: Density can exceed 1 for small variance
print("
Density values for different variances (at x = μ):")
for sigma_val in [2.0, 1.0, 0.5, 0.3, 0.1]:
    density_at_mean = gaussian_pdf(mu, mu, sigma_val**2)
    print(f"σ = {sigma_val}: f(μ) = {density_at_mean:.4f}")
print("
Note: Density > 1 is perfectly valid! Only AREA must equal 1.")
 
# Property 5: Heavy tail behavior
print("
Tail probabilities (how much probability beyond k standard deviations):")
normal = stats.norm(0, 1)
for k in [1, 2, 3, 4, 5]:
    tail_prob = 2 * (1 - normal.cdf(k))  # Two-tailed
    print(f"P(|X| > {k}σ) = {tail_prob:.6f} = 1 in {1/tail_prob:.0f}")

Parameters: Mean and Variance

The Gaussian distribution is fully specified by just two parameters: the mean $\mu$ and variance $\sigma^2$. Understanding their roles geometrically and statistically is essential for Gaussian Naive Bayes.

The Mean ($\mu$): Location Parameter

The mean determines where the distribution is centered:

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x|\mu, \sigma^2) dx = \mu$$

Geometric interpretation:

The peak of the bell curve is at $x = \mu$
The distribution is symmetric around $\mu$
Changing $\mu$ shifts the entire curve left or right
The shape remains unchanged

In Gaussian Naive Bayes:

$\mu_{jk}$ is the mean of feature $j$ in class $k$
Estimated as the sample average: $\hat{\mu}{jk} = \frac{1}{n_k} \sum{i: y_i = k} x_{ij}$

The Variance ($\sigma^2$): Scale/Spread Parameter

The variance determines how spread out the distribution is:

$$\text{Var}[X] = E[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x|\mu, \sigma^2) dx = \sigma^2$$

Geometric interpretation:

Larger $\sigma^2$ → flatter, wider bell curve
Smaller $\sigma^2$ → taller, narrower bell curve
Total area is always 1, so width and height trade off

In Gaussian Naive Bayes:

$\sigma^2_{jk}$ is the variance of feature $j$ in class $k$
Estimated as the sample variance: $\hat{\sigma}^2_{jk} = \frac{1}{n_k} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$

Effect of Parameters on Gaussian Shape
Parameter Change	Effect on PDF	Effect on Classification
Increase $\mu$	Shift curve rightward	Move class decision region rightward
Decrease $\mu$	Shift curve leftward	Move class decision region leftward
Increase $\sigma^2$	Flatten and widen curve	Broaden contribution to posterior
Decrease $\sigma^2$	Sharpen and narrow curve	Concentrate probability near mean

Standard Deviation ($\sigma$)

The standard deviation $\sigma = \sqrt{\sigma^2}$ is often more interpretable than variance because it has the same units as the data:

If heights are in cm, $\sigma$ is in cm
If temperatures are in °C, $\sigma$ is in °C

The standard deviation directly measures typical deviation from the mean: $$\sigma = \sqrt{E[(X - \mu)^2]}$$

The Standard Normal: $\mathcal{N}(0, 1)$

The standard normal distribution has $\mu = 0$ and $\sigma^2 = 1$: $$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$

Any Gaussian can be standardized to the standard normal: $$\text{If } X \sim \mathcal{N}(\mu, \sigma^2), \text{ then } Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$$

This transformation centers the data (subtract mean) and scales it (divide by standard deviation).

parameter_effects.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy import stats
 
def demonstrate_parameter_effects():
    """
    Show how μ and σ² affect the Gaussian distribution.
    """
    print("=" * 60)
    print("EFFECT OF PARAMETERS ON GAUSSIAN PDF")
    print("=" * 60)
    
    x_test = 5.0  # Test point
    
    # Effect of mean (μ)
    print("
--- Effect of Mean (μ) ---")
    print(f"Evaluating density at x = {x_test}")
    sigma = 2.0
    for mu in [0, 3, 5, 7, 10]:
        density = stats.norm(mu, sigma).pdf(x_test)
        distance = abs(x_test - mu)
        print(f"μ = {mu:2d}: f({x_test}) = {density:.6f}, distance from peak = {distance:.1f}")
    print("
→ Density is highest when x is closest to μ")
    
    # Effect of variance (σ²)
    print("
--- Effect of Variance (σ²) ---")
    print(f"Evaluating density at x = {x_test}, with μ = {x_test}")
    mu = 5.0
    for sigma in [0.5, 1.0, 2.0, 4.0, 8.0]:
        density = stats.norm(mu, sigma).pdf(x_test)
        print(f"σ = {sigma:.1f}: f({x_test}) = {density:.6f}")
    print(f"
→ Smaller σ gives higher peak density (more concentrated)")
    
    # But what about points away from the mean?
    print(f"
--- Density at x = 3.0 (below mean of 5.0) ---")
    x_off = 3.0
    for sigma in [0.5, 1.0, 2.0, 4.0, 8.0]:
        density = stats.norm(mu, sigma).pdf(x_off)
        print(f"σ = {sigma:.1f}: f({x_off}) = {density:.6f}")
    print("
→ Larger σ assigns more density to points far from mean")
    
    # This has classification implications
    print("
" + "=" * 60)
    print("CLASSIFICATION IMPLICATIONS")
    print("=" * 60)
    
    # Two classes with different variances
    class_0 = stats.norm(0, 1)   # Class 0: μ=0, σ=1 (tight)
    class_1 = stats.norm(0, 3)   # Class 1: μ=0, σ=3 (spread out)
    priors = [0.5, 0.5]
    
    print("
Class 0: N(0, 1²)  - tight distribution")
    print("Class 1: N(0, 3²)  - spread distribution")
    print("Equal priors")
    print()
    
    test_points = [0, 1, 2, 3, 4, 5]
    print(f"{'x':>4} | {'f(x|0)':>10} | {'f(x|1)':>10} | {'P(0|x)':>8} | Predicted")
    print("-" * 55)
    
    for x in test_points:
        f0 = class_0.pdf(x)
        f1 = class_1.pdf(x)
        p0 = f0 * priors[0] / (f0 * priors[0] + f1 * priors[1])
        pred = 0 if p0 > 0.5 else 1
        print(f"{x:>4} | {f0:>10.6f} | {f1:>10.6f} | {p0:>8.4f} | Class {pred}")
    
    print("
→ Near the shared mean, the tighter class (0) is preferred")
    print("→ Far from the mean, the spread class (1) is preferred")
 
demonstrate_parameter_effects()

The 68-95-99.7 Rule

One of the most practically useful properties of the Gaussian distribution is the empirical rule (or 68-95-99.7 rule), which quantifies how much probability lies within standard deviation intervals.

The Rule

For any Gaussian distribution $X \sim \mathcal{N}(\mu, \sigma^2)$:

$$P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.6827 \quad (\approx 68%)$$ $$P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.9545 \quad (\approx 95%)$$ $$P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.9973 \quad (\approx 99.7%)$$

Interpretation for Classification

In Gaussian Naive Bayes, the empirical rule helps us understand feature contributions:

Within 1σ of the mean: These are "typical" values for the class. High density, strong evidence for this class.
Between 1σ and 2σ: Moderately unusual values. Still reasonable for this class, but less common.
Between 2σ and 3σ: Unusual values. Low density, weak evidence for this class.
Beyond 3σ: Extreme values. Very low density, the observed value is surprising if this class is correct.

Tail Behavior

The Gaussian has light tails compared to some other distributions:

$P(|X - \mu| > 4\sigma) \approx 0.00006$ (1 in 16,000)
$P(|X - \mu| > 5\sigma) \approx 0.0000006$ (1 in 1.7 million)
$P(|X - \mu| > 6\sigma) \approx 0.000000002$ (1 in 500 million)

This has important implications: if we observe a feature value many standard deviations from a class mean, the Gaussian model assigns extremely low likelihood to that class.

Outlier Sensitivity

The light tails of the Gaussian make it sensitive to outliers. A single extreme observation (e.g., 10σ from the mean) gets nearly zero likelihood, effectively vetoing that class regardless of other features. This is both a strength (extreme values are informative) and a weakness (measurement errors can dominate). Robust alternatives like the Student's t-distribution have heavier tails.

empirical_rule.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy import stats
 
def demonstrate_empirical_rule():
    """
    Verify and explore the 68-95-99.7 rule.
    """
    print("=" * 60)
    print("THE 68-95-99.7 RULE")
    print("=" * 60)
    
    # Standard normal for exact calculations
    Z = stats.norm(0, 1)
    
    # Verify the rule
    print("
Exact probabilities for standard normal Z ~ N(0,1):")
    for k in [1, 2, 3, 4, 5, 6]:
        prob_within = Z.cdf(k) - Z.cdf(-k)
        prob_beyond = 1 - prob_within
        one_in = 1 / prob_beyond if prob_beyond > 0 else float('inf')
        print(f"P(|Z| ≤ {k}) = {prob_within:.6f} ({prob_within*100:.2f}%)")
        print(f"P(|Z| > {k}) = {prob_beyond:.10f} (1 in {one_in:,.0f})")
        print()
    
    # Classification implications
    print("=" * 60)
    print("CLASSIFICATION IMPLICATIONS")
    print("=" * 60)
    
    # Two classes with different means
    class_0 = stats.norm(50, 10)  # Class 0: μ=50, σ=10
    class_1 = stats.norm(70, 10)  # Class 1: μ=70, σ=10
    
    print("
Class 0: N(50, 10²)")
    print("Class 1: N(70, 10²)")
    print("Equal priors")
    
    # Test observations at various distances from each mean
    print("
How many σ away from each class mean?")
    observations = [50, 55, 60, 65, 70, 75, 80, 30, 90]
    
    print(f"
{'x':>5} | {'σ from μ₀':>10} | {'σ from μ₁':>10} | {'f(x|0)':>12} | {'f(x|1)':>12} | {'Ratio':>10}")
    print("-" * 75)
    
    for x in observations:
        sigma_0 = abs(x - 50) / 10
        sigma_1 = abs(x - 70) / 10
        f0 = class_0.pdf(x)
        f1 = class_1.pdf(x)
        ratio = f0 / f1 if f1 > 0 else float('inf')
        print(f"{x:>5} | {sigma_0:>10.1f} | {sigma_1:>10.1f} | {f0:>12.8f} | {f1:>12.8f} | {ratio:>10.4f}")
    
    # Outlier example
    print("
" + "=" * 60)
    print("OUTLIER SENSITIVITY")
    print("=" * 60)
    
    # Extreme observation
    x_extreme = 20  # Very far from both class means
    sigma_0_extreme = abs(x_extreme - 50) / 10
    sigma_1_extreme = abs(x_extreme - 70) / 10
    f0_extreme = class_0.pdf(x_extreme)
    f1_extreme = class_1.pdf(x_extreme)
    
    print(f"
Extreme observation: x = {x_extreme}")
    print(f"Distance from class 0 mean: {sigma_0_extreme:.1f}σ")
    print(f"Distance from class 1 mean: {sigma_1_extreme:.1f}σ")
    print(f"Density under class 0: {f0_extreme:.2e}")
    print(f"Density under class 1: {f1_extreme:.2e}")
    print("
→ Both densities are extremely small!")
    print("→ Neither class 'explains' this observation well")
    print("→ Could indicate error, outlier, or unknown class")
 
demonstrate_empirical_rule()

Log-Gaussian: Working in Log-Space

In practical implementations of Gaussian Naive Bayes, we almost always work with log-densities rather than raw densities. This section explains why and develops the log-space mathematics.

The Numerical Underflow Problem

When multiplying many small probabilities or densities, the product can become smaller than the smallest representable floating-point number:

$$f(\mathbf{x} | y=k) = \prod_{j=1}^{d} f(x_j | y=k)$$

For $d = 100$ features, each with density $\approx 0.01$: $$\prod_{j=1}^{100} 0.01 = 10^{-200}$$

But the smallest double-precision float is $\approx 10^{-308}$, and practical precision is lost much earlier. The computer rounds this to exactly zero, making classification impossible.

The Log-Transform Solution

Working in log-space converts products to sums: $$\log f(\mathbf{x} | y=k) = \sum_{j=1}^{d} \log f(x_j | y=k)$$

Sums of logs are numerically stable even when the underlying densities are tiny.

Log-Gaussian PDF

The log of the Gaussian PDF is: $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$$

Or equivalently: $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi) - \log(\sigma)$$

Notice:

The log-density is a quadratic function of $x$
The coefficient of $(x-\mu)^2$ is $-\frac{1}{2\sigma^2}$ (always negative)
The log-density is maximized at $x = \mu$

The Quadratic Form in Log-Space

The quadratic nature of log-Gaussian density is key to understanding Gaussian Naive Bayes decision boundaries. Classification involves comparing sums of quadratics across classes. The geometry of these quadratics determines whether boundaries are linear or curved.

Complete Log-Posterior for Classification

The classification decision is based on: $$\hat{y} = \arg\max_k \left[ \log P(y=k) + \sum_{j=1}^{d} \log f(x_j | y=k) \right]$$

Substituting the Gaussian log-density: $$\hat{y} = \arg\max_k \left[ \log \pi_k - \frac{1}{2} \sum_{j=1}^{d} \left( \frac{(x_j - \mu_{jk})^2}{\sigma^2_{jk}} + \log(2\pi\sigma^2_{jk}) \right) \right]$$

Since $\log(2\pi)$ is constant across classes, we can simplify: $$\hat{y} = \arg\max_k \left[ \log \pi_k - \frac{1}{2} \sum_{j=1}^{d} \left( \frac{(x_j - \mu_{jk})^2}{\sigma^2_{jk}} + \log \sigma^2_{jk} \right) \right]$$

The Mahalanobis Distance Interpretation

The term $\sum_{j=1}^{d} \frac{(x_j - \mu_{jk})^2}{\sigma^2_{jk}}$ is the squared Mahalanobis distance from $\mathbf{x}$ to the class $k$ mean, under the diagonal covariance assumption.

Lower Mahalanobis distance means the observation is closer to the class center (in standardized units), leading to higher posterior probability for that class.

log_space_computation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from scipy import stats
 
def compare_raw_vs_log_space():
    """
    Demonstrate why log-space computation is essential.
    """
    print("=" * 60)
    print("RAW DENSITY VS LOG-DENSITY COMPUTATION")
    print("=" * 60)
    
    # Setup: 100 features, each with density around 0.1
    n_features = 100
    np.random.seed(42)
    
    # Generate random densities (simulating Gaussian PDF values)
    densities = np.random.uniform(0.05, 0.2, n_features)
    
    print(f"
Number of features: {n_features}")
    print(f"Individual densities range: [{densities.min():.4f}, {densities.max():.4f}]")
    
    # Raw multiplication (will underflow)
    raw_product = 1.0
    for i, d in enumerate(densities):
        raw_product *= d
        if i % 20 == 19:
            print(f"After {i+1} multiplications: {raw_product:.2e}")
    
    print(f"
Final raw product: {raw_product}")
    print("→ Underflowed to zero!")
    
    # Log-space addition (stable)
    log_sum = 0.0
    for d in densities:
        log_sum += np.log(d)
    
    print(f"
Log-space sum: {log_sum:.2f}")
    print(f"Equivalent raw value: e^{log_sum:.2f} = {np.exp(log_sum):.2e}")
    print("→ Log-space computation gives meaningful result!")
    
    # Practical Gaussian example
    print("
" + "=" * 60)
    print("PRACTICAL GAUSSIAN NAIVE BAYES EXAMPLE")
    print("=" * 60)
    
    # 50-dimensional problem
    n_dim = 50
    np.random.seed(42)
    
    # Class parameters
    mu_0 = np.zeros(n_dim)
    mu_1 = np.ones(n_dim) * 2
    sigma = np.ones(n_dim)  # Unit variance
    
    # Test point
    x = np.random.randn(n_dim) + 1  # Roughly equidistant from both classes
    
    # Raw computation
    raw_lik_0 = 1.0
    raw_lik_1 = 1.0
    for j in range(n_dim):
        raw_lik_0 *= stats.norm(mu_0[j], sigma[j]).pdf(x[j])
        raw_lik_1 *= stats.norm(mu_1[j], sigma[j]).pdf(x[j])
    
    print(f"
{n_dim}-dimensional classification problem")
    print(f"
Raw likelihoods (will underflow):")
    print(f"  Class 0: {raw_lik_0}")
    print(f"  Class 1: {raw_lik_1}")
    
    # Log-space computation
    log_lik_0 = 0.0
    log_lik_1 = 0.0
    for j in range(n_dim):
        log_lik_0 += stats.norm(mu_0[j], sigma[j]).logpdf(x[j])
        log_lik_1 += stats.norm(mu_1[j], sigma[j]).logpdf(x[j])
    
    print(f"
Log-likelihoods (stable):")
    print(f"  Class 0: {log_lik_0:.2f}")
    print(f"  Class 1: {log_lik_1:.2f}")
    
    # Compare for classification
    if log_lik_0 > log_lik_1:
        print(f"
Prediction: Class 0 (log-likelihood {log_lik_0 - log_lik_1:.2f} higher)")
    else:
        print(f"
Prediction: Class 1 (log-likelihood {log_lik_1 - log_lik_0:.2f} higher)")
    
    # The log-sum-exp trick for posterior probabilities
    print("
" + "=" * 60)
    print("LOG-SUM-EXP TRICK FOR POSTERIORS")
    print("=" * 60)
    
    # To get P(y=0|x), we need: exp(log_lik_0) / (exp(log_lik_0) + exp(log_lik_1))
    # But this can overflow/underflow!
    
    # Log-sum-exp trick:
    max_log = max(log_lik_0, log_lik_1)
    log_sum = max_log + np.log(np.exp(log_lik_0 - max_log) + np.exp(log_lik_1 - max_log))
    log_post_0 = log_lik_0 - log_sum
    log_post_1 = log_lik_1 - log_sum
    
    print(f"Log-posteriors:")
    print(f"  log P(0|x) = {log_post_0:.4f}")
    print(f"  log P(1|x) = {log_post_1:.4f}")
    print(f"
Posteriors:")
    print(f"  P(0|x) = {np.exp(log_post_0):.4f}")
    print(f"  P(1|x) = {np.exp(log_post_1):.4f}")
    print(f"  Sum = {np.exp(log_post_0) + np.exp(log_post_1):.6f}")
 
compare_raw_vs_log_space()

When the Gaussian Assumption Holds

The Gaussian assumption is a modeling choice, not a universal truth. Understanding when it's justified—and when it might fail—is crucial for practical application.

Conditions Favoring Gaussianity

1. Central Limit Theorem Effects

When a feature is the sum or average of many independent contributions, it tends toward Gaussian:

Physical measurements (aggregates of molecular effects)
Averaged signals (noise averaging)
Financial returns over longer periods

2. Natural Symmetry

Symmetric distributions around a central value often appear approximately Gaussian:

Measurement errors (equally likely positive or negative)
Biological traits (height, weight when within normal range)

3. Unbounded Range

Gaussian is defined on $(-\infty, +\infty)$. Features naturally fitting this range are better candidates:

Temperature differences
Stock returns (can be positive or negative)
Residuals from mean-centered data

4. Unimodal Distribution

Gaussian has a single peak. Features with one dominant mode fit better than multimodal features.

Approximate Gaussianity Suffices

Gaussian Naive Bayes doesn't require exact Gaussianity—it only requires that the Gaussian provides a reasonable approximation. Many classification tasks are robust to moderate violations of the assumption. The key is that the Gaussian captures the essential behavior: central tendency, spread, and relative likelihood of different values.

Conditions Violating Gaussianity

1. Bounded Features

Features with strict bounds are not Gaussian:

Percentages (0-100)
Proportions (0-1)
Counts (≥ 0)
Ages (≥ 0)

Gaussian assigns non-zero probability to impossible values.

2. Heavy Tails

Some distributions have more extreme values than Gaussian predicts:

Financial returns (fat tails, extreme events)
Income distributions (long right tail)
Network degree distributions (power law)

3. Skewed Distributions

Asymmetric distributions don't fit the symmetric Gaussian:

Wait times (exponential, always positive)
Rainfall amounts (often zero, sometimes extreme)
Income (right-skewed)

4. Multimodal Distributions

Within-class distributions with multiple peaks:

Heights when class contains both children and adults
Biometric measurements for mixed populations

5. Discretization or Truncation

Data with artificial boundaries or discretization effects:

Rounded measurements
Censored observations
Saturated sensors

Good Candidates for Gaussian

•Standardized test scores
•Body measurements (height, weight)
•Temperature readings
•Sensor noise levels
•Log-transformed positive values
•Averaged measurements

Poor Candidates for Gaussian

•Raw income or wealth
•Event counts (use Poisson/NB)
•Time durations (use Exponential)
•Proportions (use Beta)
•Sparse binary features
•Categorical encoded as numeric

Diagnostic Techniques for Gaussianity

Before applying Gaussian Naive Bayes, it's prudent to check whether the Gaussian assumption is reasonable for your features. Several diagnostic techniques can help.

Visual Diagnostics

1. Histograms

Overlay the empirical histogram with the fitted Gaussian PDF. Large discrepancies indicate poor fit.

2. Q-Q Plots (Quantile-Quantile)

Plot the sample quantiles against theoretical Gaussian quantiles. Gaussianity implies a straight line:

S-shaped curve: Light tails (less extreme than Gaussian)
Inverted S-shape: Heavy tails (more extreme than Gaussian)
Curved upward on right: Right skew
Curved downward on left: Left skew

3. Box Plots

Outliers beyond 1.5×IQR suggest non-Gaussian behavior. Asymmetric boxes indicate skewness.

Statistical Tests

1. Shapiro-Wilk Test

Most powerful test for normality for small to moderate samples. Tests null hypothesis that data is normally distributed.

p-value < 0.05: Reject normality
p-value ≥ 0.05: Cannot reject normality (not proof of normality!)

2. D'Agostino-Pearson Test

Combines skewness and kurtosis into a single test statistic. Good for larger samples.

3. Kolmogorov-Smirnov Test

Compares empirical CDF to theoretical Gaussian CDF. Less powerful than Shapiro-Wilk but applicable to larger samples.

Tests Have Limited Value for Classification

Rejecting normality doesn't mean Gaussian NB will perform poorly! The tests are very sensitive to small deviations, especially with large samples. What matters for classification is whether the Gaussian approximation captures enough structure to distinguish classes, not whether it's statistically perfect. Always evaluate by classification accuracy, not just normality tests.

gaussianity_diagnostics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy import stats
 
def diagnose_gaussianity(data, name="Feature"):
    """
    Comprehensive diagnostics for checking Gaussianity.
    """
    n = len(data)
    
    print("=" * 60)
    print(f"GAUSSIANITY DIAGNOSTICS: {name}")
    print("=" * 60)
    
    # Basic statistics
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)  # Excess kurtosis (0 for Gaussian)
    
    print(f"
Sample size: {n}")
    print(f"Mean: {mean:.4f}")
    print(f"Std: {std:.4f}")
    print(f"Skewness: {skewness:.4f} (Gaussian: 0)")
    print(f"Excess kurtosis: {kurtosis:.4f} (Gaussian: 0)")
    
    # Interpretation
    if abs(skewness) < 0.5:
        skew_interpretation = "approximately symmetric"
    elif skewness > 0:
        skew_interpretation = "right-skewed"
    else:
        skew_interpretation = "left-skewed"
    
    if abs(kurtosis) < 1:
        kurt_interpretation = "approximately mesokurtic (Gaussian-like tails)"
    elif kurtosis > 0:
        kurt_interpretation = "leptokurtic (heavy tails)"
    else:
        kurt_interpretation = "platykurtic (light tails)"
    
    print(f"
Interpretation:")
    print(f"  Shape: {skew_interpretation}")
    print(f"  Tails: {kurt_interpretation}")
    
    # Statistical tests
    print(f"
Statistical Tests:")
    
    # Shapiro-Wilk (best for n < 5000)
    if n <= 5000:
        stat_sw, p_sw = stats.shapiro(data)
        print(f"  Shapiro-Wilk: W = {stat_sw:.4f}, p-value = {p_sw:.4f}")
        if p_sw < 0.05:
            print(f"    → Reject normality at α=0.05")
        else:
            print(f"    → Cannot reject normality at α=0.05")
    else:
        print(f"  Shapiro-Wilk: Skipped (n > 5000)")
    
    # D'Agostino-Pearson
    stat_dp, p_dp = stats.normaltest(data)
    print(f"  D'Agostino-Pearson: K² = {stat_dp:.4f}, p-value = {p_dp:.4f}")
    if p_dp < 0.05:
        print(f"    → Reject normality at α=0.05")
    else:
        print(f"    → Cannot reject normality at α=0.05")
    
    # Kolmogorov-Smirnov
    stat_ks, p_ks = stats.kstest(data, 'norm', args=(mean, std))
    print(f"  Kolmogorov-Smirnov: D = {stat_ks:.4f}, p-value = {p_ks:.4f}")
    if p_ks < 0.05:
        print(f"    → Reject normality at α=0.05")
    else:
        print(f"    → Cannot reject normality at α=0.05")
    
    # Empirical rule check
    print(f"
Empirical Rule Check (68-95-99.7):")
    for k, expected in [(1, 0.6827), (2, 0.9545), (3, 0.9973)]:
        within = np.mean(np.abs(data - mean) <= k * std)
        deviation = (within - expected) / expected * 100
        print(f"  Within {k}σ: {within:.4f} (expected {expected:.4f}, {deviation:+.1f}% deviation)")
    
    return {
        'skewness': skewness,
        'kurtosis': kurtosis,
        'shapiro_p': p_sw if n <= 5000 else None,
        'dagostino_p': p_dp
    }
 
 
# Test on different distributions
np.random.seed(42)
n = 1000
 
print("
" + "="*60)
print("TESTING DIFFERENT DISTRIBUTIONS")
print("="*60)
 
# 1. True Gaussian
print("
" + "-"*60)
diagnose_gaussianity(np.random.normal(0, 1, n), "True Gaussian N(0,1)")
 
# 2. Slightly skewed (log-normal)
print("
" + "-"*60)
diagnose_gaussianity(np.random.lognormal(0, 0.3, n), "Log-Normal (mild skew)")
 
# 3. Heavy-tailed (Student's t)
print("
" + "-"*60)
diagnose_gaussianity(stats.t.rvs(df=3, size=n), "Student's t (df=3, heavy tails)")
 
# 4. Uniform (bounded)
print("
" + "-"*60)
diagnose_gaussianity(np.random.uniform(-1, 1, n), "Uniform (bounded)")
 
# 5. Mixture (bimodal)
print("
" + "-"*60)
mixture = np.concatenate([
    np.random.normal(-2, 0.5, n//2),
    np.random.normal(2, 0.5, n//2)
])
diagnose_gaussianity(mixture, "Mixture of two Gaussians (bimodal)")

Summary: The Gaussian Assumption

We have developed a deep understanding of the Gaussian distribution and its role in Gaussian Naive Bayes. Let us consolidate the key concepts:

Key Concepts

•Gaussian PDF: Bell-shaped curve defined by mean (location) and variance (spread), with characteristic $e^{-x^2}$ form
•Mean and variance: Two parameters that fully specify the distribution; mean centers the curve, variance controls width
•68-95-99.7 rule: Quantifies probability within standard deviation intervals; useful for interpreting feature values
•Log-space computation: Essential for numerical stability; converts products to sums, avoids underflow
•Gaussian assumption validity: Works well for aggregated, symmetric, unbounded, unimodal features; fails for bounded, skewed, or multimodal features
•Diagnostic tools: Visual (histograms, Q-Q plots) and statistical (Shapiro-Wilk, D'Agostino) tests help assess assumption validity

What's next:

With the Gaussian distribution understood, we now turn to the practical task of estimating its parameters from data. The next page covers parameter estimation—how to compute the maximum likelihood estimates of means and variances from training samples, and how these estimates are used in classification.

Page Complete

You now have a comprehensive understanding of the Gaussian distribution and when the Gaussian assumption is appropriate. The key insight is that while exact Gaussianity is rarely achieved in practice, the Gaussian often provides a sufficient approximation for effective classification. Next, we learn how to estimate Gaussian parameters from training data.

Gaussian Assumption

The Bell Curve: Nature's Favorite Distribution

What You Will Learn

The Gaussian Probability Density Function

The univariate Gaussian distribution is defined by its probability density function:

$$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Alternatively written as: $$f(x | \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

We denote this distribution as $\mathcal{N}(\mu, \sigma^2)$ or $X \sim \mathcal{N}(\mu, \sigma^2)$.

Dissecting the Formula

The exponential term: $\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$

This is the heart of the Gaussian. It's a quadratic function of $(x - \mu)$ inside a negative exponential:

Maximum value of 1 when $x = \mu$ (the exponent is 0)
Decays symmetrically as $x$ moves away from $\mu$
Rate of decay controlled by $\sigma^2$: larger variance means slower decay

The normalization constant: $\frac{1}{\sqrt{2\pi\sigma^2}}$

This ensures the density integrates to 1: $$\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) dx = 1$$

The $\sqrt{2\pi}$ comes from the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi}$, one of the most beautiful results in mathematics.

Why the Exponent is Negative Quadratic

gaussian_pdf_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def gaussian_pdf(x, mu, sigma_sq):
    """
    Compute Gaussian PDF manually to understand each component.
    """
    normalization = 1.0 / np.sqrt(2 * np.pi * sigma_sq)
    exponent = -((x - mu) ** 2) / (2 * sigma_sq)
    return normalization * np.exp(exponent)
 
# Verify our implementation matches scipy
x = np.linspace(-5, 5, 1000)
mu, sigma = 0, 1
 
pdf_manual = gaussian_pdf(x, mu, sigma**2)
pdf_scipy = stats.norm(mu, sigma).pdf(x)
 
print("=" * 60)
print("GAUSSIAN PDF VERIFICATION")
print("=" * 60)
print(f"Max difference between manual and scipy: {np.max(np.abs(pdf_manual - pdf_scipy)):.2e}")
 
# Demonstrate PDF properties
print("
" + "=" * 60)
print("KEY PDF PROPERTIES")
print("=" * 60)
 
# Property 1: Maximum at the mean
mu, sigma = 3.0, 2.0
x_vals = [mu - 2*sigma, mu - sigma, mu, mu + sigma, mu + 2*sigma]
for x_val in x_vals:
    density = gaussian_pdf(x_val, mu, sigma**2)
    print(f"f({x_val:.1f}) = {density:.6f}")
print(f"Maximum is at x = μ = {mu}")
 
# Property 2: Symmetry
print("
Symmetry check:")
for delta in [0.5, 1.0, 2.0]:
    left = gaussian_pdf(mu - delta, mu, sigma**2)
    right = gaussian_pdf(mu + delta, mu, sigma**2)
    print(f"f(μ - {delta}) = {left:.6f}, f(μ + {delta}) = {right:.6f}, Equal: {np.isclose(left, right)}")
 
# Property 3: Integration to 1
from scipy.integrate import quad
integral, _ = quad(lambda x: gaussian_pdf(x, mu, sigma**2), -np.inf, np.inf)
print(f"
Integral of PDF: {integral:.10f} (should be 1.0)")
 
# Property 4: Density can exceed 1 for small variance
print("
Density values for different variances (at x = μ):")
for sigma_val in [2.0, 1.0, 0.5, 0.3, 0.1]:
    density_at_mean = gaussian_pdf(mu, mu, sigma_val**2)
    print(f"σ = {sigma_val}: f(μ) = {density_at_mean:.4f}")
print("
Note: Density > 1 is perfectly valid! Only AREA must equal 1.")
 
# Property 5: Heavy tail behavior
print("
Tail probabilities (how much probability beyond k standard deviations):")
normal = stats.norm(0, 1)
for k in [1, 2, 3, 4, 5]:
    tail_prob = 2 * (1 - normal.cdf(k))  # Two-tailed
    print(f"P(|X| > {k}σ) = {tail_prob:.6f} = 1 in {1/tail_prob:.0f}")

Parameters: Mean and Variance

The Mean ($\mu$): Location Parameter

The mean determines where the distribution is centered:

$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x|\mu, \sigma^2) dx = \mu$$

Geometric interpretation:

The peak of the bell curve is at $x = \mu$
The distribution is symmetric around $\mu$
Changing $\mu$ shifts the entire curve left or right
The shape remains unchanged

In Gaussian Naive Bayes:

$\mu_{jk}$ is the mean of feature $j$ in class $k$
Estimated as the sample average: $\hat{\mu}{jk} = \frac{1}{n_k} \sum{i: y_i = k} x_{ij}$

The Variance ($\sigma^2$): Scale/Spread Parameter

The variance determines how spread out the distribution is:

$$\text{Var}[X] = E[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x|\mu, \sigma^2) dx = \sigma^2$$

Geometric interpretation:

Larger $\sigma^2$ → flatter, wider bell curve
Smaller $\sigma^2$ → taller, narrower bell curve
Total area is always 1, so width and height trade off

In Gaussian Naive Bayes:

$\sigma^2_{jk}$ is the variance of feature $j$ in class $k$
Estimated as the sample variance: $\hat{\sigma}^2_{jk} = \frac{1}{n_k} \sum_{i: y_i = k} (x_{ij} - \hat{\mu}_{jk})^2$

Effect of Parameters on Gaussian Shape
Parameter Change	Effect on PDF	Effect on Classification
Increase $\mu$	Shift curve rightward	Move class decision region rightward
Decrease $\mu$	Shift curve leftward	Move class decision region leftward
Increase $\sigma^2$	Flatten and widen curve	Broaden contribution to posterior
Decrease $\sigma^2$	Sharpen and narrow curve	Concentrate probability near mean

Standard Deviation ($\sigma$)

The standard deviation $\sigma = \sqrt{\sigma^2}$ is often more interpretable than variance because it has the same units as the data:

If heights are in cm, $\sigma$ is in cm
If temperatures are in °C, $\sigma$ is in °C

The standard deviation directly measures typical deviation from the mean: $$\sigma = \sqrt{E[(X - \mu)^2]}$$

The Standard Normal: $\mathcal{N}(0, 1)$

The standard normal distribution has $\mu = 0$ and $\sigma^2 = 1$: $$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$

Any Gaussian can be standardized to the standard normal: $$\text{If } X \sim \mathcal{N}(\mu, \sigma^2), \text{ then } Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$$

This transformation centers the data (subtract mean) and scales it (divide by standard deviation).

parameter_effects.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import numpy as np
from scipy import stats
 
def demonstrate_parameter_effects():
    """
    Show how μ and σ² affect the Gaussian distribution.
    """
    print("=" * 60)
    print("EFFECT OF PARAMETERS ON GAUSSIAN PDF")
    print("=" * 60)
    
    x_test = 5.0  # Test point
    
    # Effect of mean (μ)
    print("
--- Effect of Mean (μ) ---")
    print(f"Evaluating density at x = {x_test}")
    sigma = 2.0
    for mu in [0, 3, 5, 7, 10]:
        density = stats.norm(mu, sigma).pdf(x_test)
        distance = abs(x_test - mu)
        print(f"μ = {mu:2d}: f({x_test}) = {density:.6f}, distance from peak = {distance:.1f}")
    print("
→ Density is highest when x is closest to μ")
    
    # Effect of variance (σ²)
    print("
--- Effect of Variance (σ²) ---")
    print(f"Evaluating density at x = {x_test}, with μ = {x_test}")
    mu = 5.0
    for sigma in [0.5, 1.0, 2.0, 4.0, 8.0]:
        density = stats.norm(mu, sigma).pdf(x_test)
        print(f"σ = {sigma:.1f}: f({x_test}) = {density:.6f}")
    print(f"
→ Smaller σ gives higher peak density (more concentrated)")
    
    # But what about points away from the mean?
    print(f"
--- Density at x = 3.0 (below mean of 5.0) ---")
    x_off = 3.0
    for sigma in [0.5, 1.0, 2.0, 4.0, 8.0]:
        density = stats.norm(mu, sigma).pdf(x_off)
        print(f"σ = {sigma:.1f}: f({x_off}) = {density:.6f}")
    print("
→ Larger σ assigns more density to points far from mean")
    
    # This has classification implications
    print("
" + "=" * 60)
    print("CLASSIFICATION IMPLICATIONS")
    print("=" * 60)
    
    # Two classes with different variances
    class_0 = stats.norm(0, 1)   # Class 0: μ=0, σ=1 (tight)
    class_1 = stats.norm(0, 3)   # Class 1: μ=0, σ=3 (spread out)
    priors = [0.5, 0.5]
    
    print("
Class 0: N(0, 1²)  - tight distribution")
    print("Class 1: N(0, 3²)  - spread distribution")
    print("Equal priors")
    print()
    
    test_points = [0, 1, 2, 3, 4, 5]
    print(f"{'x':>4} | {'f(x|0)':>10} | {'f(x|1)':>10} | {'P(0|x)':>8} | Predicted")
    print("-" * 55)
    
    for x in test_points:
        f0 = class_0.pdf(x)
        f1 = class_1.pdf(x)
        p0 = f0 * priors[0] / (f0 * priors[0] + f1 * priors[1])
        pred = 0 if p0 > 0.5 else 1
        print(f"{x:>4} | {f0:>10.6f} | {f1:>10.6f} | {p0:>8.4f} | Class {pred}")
    
    print("
→ Near the shared mean, the tighter class (0) is preferred")
    print("→ Far from the mean, the spread class (1) is preferred")
 
demonstrate_parameter_effects()

The 68-95-99.7 Rule

The Rule

For any Gaussian distribution $X \sim \mathcal{N}(\mu, \sigma^2)$:

Interpretation for Classification

In Gaussian Naive Bayes, the empirical rule helps us understand feature contributions:

Within 1σ of the mean: These are "typical" values for the class. High density, strong evidence for this class.
Between 1σ and 2σ: Moderately unusual values. Still reasonable for this class, but less common.
Between 2σ and 3σ: Unusual values. Low density, weak evidence for this class.
Beyond 3σ: Extreme values. Very low density, the observed value is surprising if this class is correct.

Tail Behavior

The Gaussian has light tails compared to some other distributions:

$P(|X - \mu| > 4\sigma) \approx 0.00006$ (1 in 16,000)
$P(|X - \mu| > 5\sigma) \approx 0.0000006$ (1 in 1.7 million)
$P(|X - \mu| > 6\sigma) \approx 0.000000002$ (1 in 500 million)

This has important implications: if we observe a feature value many standard deviations from a class mean, the Gaussian model assigns extremely low likelihood to that class.

Outlier Sensitivity

empirical_rule.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import numpy as np
from scipy import stats
 
def demonstrate_empirical_rule():
    """
    Verify and explore the 68-95-99.7 rule.
    """
    print("=" * 60)
    print("THE 68-95-99.7 RULE")
    print("=" * 60)
    
    # Standard normal for exact calculations
    Z = stats.norm(0, 1)
    
    # Verify the rule
    print("
Exact probabilities for standard normal Z ~ N(0,1):")
    for k in [1, 2, 3, 4, 5, 6]:
        prob_within = Z.cdf(k) - Z.cdf(-k)
        prob_beyond = 1 - prob_within
        one_in = 1 / prob_beyond if prob_beyond > 0 else float('inf')
        print(f"P(|Z| ≤ {k}) = {prob_within:.6f} ({prob_within*100:.2f}%)")
        print(f"P(|Z| > {k}) = {prob_beyond:.10f} (1 in {one_in:,.0f})")
        print()
    
    # Classification implications
    print("=" * 60)
    print("CLASSIFICATION IMPLICATIONS")
    print("=" * 60)
    
    # Two classes with different means
    class_0 = stats.norm(50, 10)  # Class 0: μ=50, σ=10
    class_1 = stats.norm(70, 10)  # Class 1: μ=70, σ=10
    
    print("
Class 0: N(50, 10²)")
    print("Class 1: N(70, 10²)")
    print("Equal priors")
    
    # Test observations at various distances from each mean
    print("
How many σ away from each class mean?")
    observations = [50, 55, 60, 65, 70, 75, 80, 30, 90]
    
    print(f"
{'x':>5} | {'σ from μ₀':>10} | {'σ from μ₁':>10} | {'f(x|0)':>12} | {'f(x|1)':>12} | {'Ratio':>10}")
    print("-" * 75)
    
    for x in observations:
        sigma_0 = abs(x - 50) / 10
        sigma_1 = abs(x - 70) / 10
        f0 = class_0.pdf(x)
        f1 = class_1.pdf(x)
        ratio = f0 / f1 if f1 > 0 else float('inf')
        print(f"{x:>5} | {sigma_0:>10.1f} | {sigma_1:>10.1f} | {f0:>12.8f} | {f1:>12.8f} | {ratio:>10.4f}")
    
    # Outlier example
    print("
" + "=" * 60)
    print("OUTLIER SENSITIVITY")
    print("=" * 60)
    
    # Extreme observation
    x_extreme = 20  # Very far from both class means
    sigma_0_extreme = abs(x_extreme - 50) / 10
    sigma_1_extreme = abs(x_extreme - 70) / 10
    f0_extreme = class_0.pdf(x_extreme)
    f1_extreme = class_1.pdf(x_extreme)
    
    print(f"
Extreme observation: x = {x_extreme}")
    print(f"Distance from class 0 mean: {sigma_0_extreme:.1f}σ")
    print(f"Distance from class 1 mean: {sigma_1_extreme:.1f}σ")
    print(f"Density under class 0: {f0_extreme:.2e}")
    print(f"Density under class 1: {f1_extreme:.2e}")
    print("
→ Both densities are extremely small!")
    print("→ Neither class 'explains' this observation well")
    print("→ Could indicate error, outlier, or unknown class")
 
demonstrate_empirical_rule()

Log-Gaussian: Working in Log-Space

In practical implementations of Gaussian Naive Bayes, we almost always work with log-densities rather than raw densities. This section explains why and develops the log-space mathematics.

The Numerical Underflow Problem

When multiplying many small probabilities or densities, the product can become smaller than the smallest representable floating-point number:

$$f(\mathbf{x} | y=k) = \prod_{j=1}^{d} f(x_j | y=k)$$

For $d = 100$ features, each with density $\approx 0.01$: $$\prod_{j=1}^{100} 0.01 = 10^{-200}$$

But the smallest double-precision float is $\approx 10^{-308}$, and practical precision is lost much earlier. The computer rounds this to exactly zero, making classification impossible.

The Log-Transform Solution

Working in log-space converts products to sums: $$\log f(\mathbf{x} | y=k) = \sum_{j=1}^{d} \log f(x_j | y=k)$$

Sums of logs are numerically stable even when the underlying densities are tiny.

Log-Gaussian PDF

The log of the Gaussian PDF is: $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$$

Or equivalently: $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi) - \log(\sigma)$$

Notice:

The log-density is a quadratic function of $x$
The coefficient of $(x-\mu)^2$ is $-\frac{1}{2\sigma^2}$ (always negative)
The log-density is maximized at $x = \mu$

The Quadratic Form in Log-Space

Complete Log-Posterior for Classification

The classification decision is based on: $$\hat{y} = \arg\max_k \left[ \log P(y=k) + \sum_{j=1}^{d} \log f(x_j | y=k) \right]$$

The Mahalanobis Distance Interpretation

The term $\sum_{j=1}^{d} \frac{(x_j - \mu_{jk})^2}{\sigma^2_{jk}}$ is the squared Mahalanobis distance from $\mathbf{x}$ to the class $k$ mean, under the diagonal covariance assumption.

Lower Mahalanobis distance means the observation is closer to the class center (in standardized units), leading to higher posterior probability for that class.

log_space_computation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from scipy import stats
 
def compare_raw_vs_log_space():
    """
    Demonstrate why log-space computation is essential.
    """
    print("=" * 60)
    print("RAW DENSITY VS LOG-DENSITY COMPUTATION")
    print("=" * 60)
    
    # Setup: 100 features, each with density around 0.1
    n_features = 100
    np.random.seed(42)
    
    # Generate random densities (simulating Gaussian PDF values)
    densities = np.random.uniform(0.05, 0.2, n_features)
    
    print(f"
Number of features: {n_features}")
    print(f"Individual densities range: [{densities.min():.4f}, {densities.max():.4f}]")
    
    # Raw multiplication (will underflow)
    raw_product = 1.0
    for i, d in enumerate(densities):
        raw_product *= d
        if i % 20 == 19:
            print(f"After {i+1} multiplications: {raw_product:.2e}")
    
    print(f"
Final raw product: {raw_product}")
    print("→ Underflowed to zero!")
    
    # Log-space addition (stable)
    log_sum = 0.0
    for d in densities:
        log_sum += np.log(d)
    
    print(f"
Log-space sum: {log_sum:.2f}")
    print(f"Equivalent raw value: e^{log_sum:.2f} = {np.exp(log_sum):.2e}")
    print("→ Log-space computation gives meaningful result!")
    
    # Practical Gaussian example
    print("
" + "=" * 60)
    print("PRACTICAL GAUSSIAN NAIVE BAYES EXAMPLE")
    print("=" * 60)
    
    # 50-dimensional problem
    n_dim = 50
    np.random.seed(42)
    
    # Class parameters
    mu_0 = np.zeros(n_dim)
    mu_1 = np.ones(n_dim) * 2
    sigma = np.ones(n_dim)  # Unit variance
    
    # Test point
    x = np.random.randn(n_dim) + 1  # Roughly equidistant from both classes
    
    # Raw computation
    raw_lik_0 = 1.0
    raw_lik_1 = 1.0
    for j in range(n_dim):
        raw_lik_0 *= stats.norm(mu_0[j], sigma[j]).pdf(x[j])
        raw_lik_1 *= stats.norm(mu_1[j], sigma[j]).pdf(x[j])
    
    print(f"
{n_dim}-dimensional classification problem")
    print(f"
Raw likelihoods (will underflow):")
    print(f"  Class 0: {raw_lik_0}")
    print(f"  Class 1: {raw_lik_1}")
    
    # Log-space computation
    log_lik_0 = 0.0
    log_lik_1 = 0.0
    for j in range(n_dim):
        log_lik_0 += stats.norm(mu_0[j], sigma[j]).logpdf(x[j])
        log_lik_1 += stats.norm(mu_1[j], sigma[j]).logpdf(x[j])
    
    print(f"
Log-likelihoods (stable):")
    print(f"  Class 0: {log_lik_0:.2f}")
    print(f"  Class 1: {log_lik_1:.2f}")
    
    # Compare for classification
    if log_lik_0 > log_lik_1:
        print(f"
Prediction: Class 0 (log-likelihood {log_lik_0 - log_lik_1:.2f} higher)")
    else:
        print(f"
Prediction: Class 1 (log-likelihood {log_lik_1 - log_lik_0:.2f} higher)")
    
    # The log-sum-exp trick for posterior probabilities
    print("
" + "=" * 60)
    print("LOG-SUM-EXP TRICK FOR POSTERIORS")
    print("=" * 60)
    
    # To get P(y=0|x), we need: exp(log_lik_0) / (exp(log_lik_0) + exp(log_lik_1))
    # But this can overflow/underflow!
    
    # Log-sum-exp trick:
    max_log = max(log_lik_0, log_lik_1)
    log_sum = max_log + np.log(np.exp(log_lik_0 - max_log) + np.exp(log_lik_1 - max_log))
    log_post_0 = log_lik_0 - log_sum
    log_post_1 = log_lik_1 - log_sum
    
    print(f"Log-posteriors:")
    print(f"  log P(0|x) = {log_post_0:.4f}")
    print(f"  log P(1|x) = {log_post_1:.4f}")
    print(f"
Posteriors:")
    print(f"  P(0|x) = {np.exp(log_post_0):.4f}")
    print(f"  P(1|x) = {np.exp(log_post_1):.4f}")
    print(f"  Sum = {np.exp(log_post_0) + np.exp(log_post_1):.6f}")
 
compare_raw_vs_log_space()

When the Gaussian Assumption Holds

The Gaussian assumption is a modeling choice, not a universal truth. Understanding when it's justified—and when it might fail—is crucial for practical application.

Conditions Favoring Gaussianity

1. Central Limit Theorem Effects

When a feature is the sum or average of many independent contributions, it tends toward Gaussian:

Physical measurements (aggregates of molecular effects)
Averaged signals (noise averaging)
Financial returns over longer periods

2. Natural Symmetry

Symmetric distributions around a central value often appear approximately Gaussian:

Measurement errors (equally likely positive or negative)
Biological traits (height, weight when within normal range)

3. Unbounded Range

Gaussian is defined on $(-\infty, +\infty)$. Features naturally fitting this range are better candidates:

Temperature differences
Stock returns (can be positive or negative)
Residuals from mean-centered data

4. Unimodal Distribution

Gaussian has a single peak. Features with one dominant mode fit better than multimodal features.

Approximate Gaussianity Suffices

Conditions Violating Gaussianity

1. Bounded Features

Features with strict bounds are not Gaussian:

Percentages (0-100)
Proportions (0-1)
Counts (≥ 0)
Ages (≥ 0)

Gaussian assigns non-zero probability to impossible values.

2. Heavy Tails

Some distributions have more extreme values than Gaussian predicts:

Financial returns (fat tails, extreme events)
Income distributions (long right tail)
Network degree distributions (power law)

3. Skewed Distributions

Asymmetric distributions don't fit the symmetric Gaussian:

Wait times (exponential, always positive)
Rainfall amounts (often zero, sometimes extreme)
Income (right-skewed)

4. Multimodal Distributions

Within-class distributions with multiple peaks:

Heights when class contains both children and adults
Biometric measurements for mixed populations

5. Discretization or Truncation

Data with artificial boundaries or discretization effects:

Rounded measurements
Censored observations
Saturated sensors

Good Candidates for Gaussian

•Standardized test scores
•Body measurements (height, weight)
•Temperature readings
•Sensor noise levels
•Log-transformed positive values
•Averaged measurements

Poor Candidates for Gaussian

•Raw income or wealth
•Event counts (use Poisson/NB)
•Time durations (use Exponential)
•Proportions (use Beta)
•Sparse binary features
•Categorical encoded as numeric

Diagnostic Techniques for Gaussianity

Before applying Gaussian Naive Bayes, it's prudent to check whether the Gaussian assumption is reasonable for your features. Several diagnostic techniques can help.

Visual Diagnostics

1. Histograms

Overlay the empirical histogram with the fitted Gaussian PDF. Large discrepancies indicate poor fit.

2. Q-Q Plots (Quantile-Quantile)

Plot the sample quantiles against theoretical Gaussian quantiles. Gaussianity implies a straight line:

S-shaped curve: Light tails (less extreme than Gaussian)
Inverted S-shape: Heavy tails (more extreme than Gaussian)
Curved upward on right: Right skew
Curved downward on left: Left skew

3. Box Plots

Outliers beyond 1.5×IQR suggest non-Gaussian behavior. Asymmetric boxes indicate skewness.

Statistical Tests

1. Shapiro-Wilk Test

Most powerful test for normality for small to moderate samples. Tests null hypothesis that data is normally distributed.

p-value < 0.05: Reject normality
p-value ≥ 0.05: Cannot reject normality (not proof of normality!)

2. D'Agostino-Pearson Test

Combines skewness and kurtosis into a single test statistic. Good for larger samples.

3. Kolmogorov-Smirnov Test

Compares empirical CDF to theoretical Gaussian CDF. Less powerful than Shapiro-Wilk but applicable to larger samples.

Tests Have Limited Value for Classification

gaussianity_diagnostics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
from scipy import stats
 
def diagnose_gaussianity(data, name="Feature"):
    """
    Comprehensive diagnostics for checking Gaussianity.
    """
    n = len(data)
    
    print("=" * 60)
    print(f"GAUSSIANITY DIAGNOSTICS: {name}")
    print("=" * 60)
    
    # Basic statistics
    mean = np.mean(data)
    std = np.std(data, ddof=1)
    skewness = stats.skew(data)
    kurtosis = stats.kurtosis(data)  # Excess kurtosis (0 for Gaussian)
    
    print(f"
Sample size: {n}")
    print(f"Mean: {mean:.4f}")
    print(f"Std: {std:.4f}")
    print(f"Skewness: {skewness:.4f} (Gaussian: 0)")
    print(f"Excess kurtosis: {kurtosis:.4f} (Gaussian: 0)")
    
    # Interpretation
    if abs(skewness) < 0.5:
        skew_interpretation = "approximately symmetric"
    elif skewness > 0:
        skew_interpretation = "right-skewed"
    else:
        skew_interpretation = "left-skewed"
    
    if abs(kurtosis) < 1:
        kurt_interpretation = "approximately mesokurtic (Gaussian-like tails)"
    elif kurtosis > 0:
        kurt_interpretation = "leptokurtic (heavy tails)"
    else:
        kurt_interpretation = "platykurtic (light tails)"
    
    print(f"
Interpretation:")
    print(f"  Shape: {skew_interpretation}")
    print(f"  Tails: {kurt_interpretation}")
    
    # Statistical tests
    print(f"
Statistical Tests:")
    
    # Shapiro-Wilk (best for n < 5000)
    if n <= 5000:
        stat_sw, p_sw = stats.shapiro(data)
        print(f"  Shapiro-Wilk: W = {stat_sw:.4f}, p-value = {p_sw:.4f}")
        if p_sw < 0.05:
            print(f"    → Reject normality at α=0.05")
        else:
            print(f"    → Cannot reject normality at α=0.05")
    else:
        print(f"  Shapiro-Wilk: Skipped (n > 5000)")
    
    # D'Agostino-Pearson
    stat_dp, p_dp = stats.normaltest(data)
    print(f"  D'Agostino-Pearson: K² = {stat_dp:.4f}, p-value = {p_dp:.4f}")
    if p_dp < 0.05:
        print(f"    → Reject normality at α=0.05")
    else:
        print(f"    → Cannot reject normality at α=0.05")
    
    # Kolmogorov-Smirnov
    stat_ks, p_ks = stats.kstest(data, 'norm', args=(mean, std))
    print(f"  Kolmogorov-Smirnov: D = {stat_ks:.4f}, p-value = {p_ks:.4f}")
    if p_ks < 0.05:
        print(f"    → Reject normality at α=0.05")
    else:
        print(f"    → Cannot reject normality at α=0.05")
    
    # Empirical rule check
    print(f"
Empirical Rule Check (68-95-99.7):")
    for k, expected in [(1, 0.6827), (2, 0.9545), (3, 0.9973)]:
        within = np.mean(np.abs(data - mean) <= k * std)
        deviation = (within - expected) / expected * 100
        print(f"  Within {k}σ: {within:.4f} (expected {expected:.4f}, {deviation:+.1f}% deviation)")
    
    return {
        'skewness': skewness,
        'kurtosis': kurtosis,
        'shapiro_p': p_sw if n <= 5000 else None,
        'dagostino_p': p_dp
    }
 
 
# Test on different distributions
np.random.seed(42)
n = 1000
 
print("
" + "="*60)
print("TESTING DIFFERENT DISTRIBUTIONS")
print("="*60)
 
# 1. True Gaussian
print("
" + "-"*60)
diagnose_gaussianity(np.random.normal(0, 1, n), "True Gaussian N(0,1)")
 
# 2. Slightly skewed (log-normal)
print("
" + "-"*60)
diagnose_gaussianity(np.random.lognormal(0, 0.3, n), "Log-Normal (mild skew)")
 
# 3. Heavy-tailed (Student's t)
print("
" + "-"*60)
diagnose_gaussianity(stats.t.rvs(df=3, size=n), "Student's t (df=3, heavy tails)")
 
# 4. Uniform (bounded)
print("
" + "-"*60)
diagnose_gaussianity(np.random.uniform(-1, 1, n), "Uniform (bounded)")
 
# 5. Mixture (bimodal)
print("
" + "-"*60)
mixture = np.concatenate([
    np.random.normal(-2, 0.5, n//2),
    np.random.normal(2, 0.5, n//2)
])
diagnose_gaussianity(mixture, "Mixture of two Gaussians (bimodal)")

Summary: The Gaussian Assumption

We have developed a deep understanding of the Gaussian distribution and its role in Gaussian Naive Bayes. Let us consolidate the key concepts:

Key Concepts

•Gaussian PDF: Bell-shaped curve defined by mean (location) and variance (spread), with characteristic $e^{-x^2}$ form
•Mean and variance: Two parameters that fully specify the distribution; mean centers the curve, variance controls width
•68-95-99.7 rule: Quantifies probability within standard deviation intervals; useful for interpreting feature values
•Log-space computation: Essential for numerical stability; converts products to sums, avoids underflow
•Gaussian assumption validity: Works well for aggregated, symmetric, unbounded, unimodal features; fails for bounded, skewed, or multimodal features
•Diagnostic tools: Visual (histograms, Q-Q plots) and statistical (Shapiro-Wilk, D'Agostino) tests help assess assumption validity

What's next:

Page Complete