Loading content...
The Gaussian distribution—the iconic bell curve—appears with remarkable frequency in nature, science, and engineering. From the distribution of measurement errors to the heights of human populations, from thermal noise in electronics to test scores in educational assessment, the Gaussian emerges again and again as the natural description of continuous phenomena.
In Gaussian Naive Bayes, we make a specific assumption: within each class, each feature follows a Gaussian distribution. This assumption transforms the abstract problem of density estimation into a simple parameter estimation problem. But to use this assumption wisely, we must deeply understand what it means.
What does it mean for a feature to be Gaussian? How do the parameters $\mu$ (mean) and $\sigma^2$ (variance) shape the distribution? When is this assumption reasonable, and when might it lead us astray? This page answers these questions with mathematical rigor and practical insight.
By the end of this page, you will understand: (1) the mathematical form of the Gaussian PDF and its key properties, (2) how μ and σ² control location and spread, (3) the 68-95-99.7 rule and standard normal form, (4) when the Gaussian assumption is justified, and (5) diagnostic techniques for checking Gaussianity.
The univariate Gaussian distribution is defined by its probability density function:
$$f(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
Alternatively written as: $$f(x | \mu, \sigma^2) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
We denote this distribution as $\mathcal{N}(\mu, \sigma^2)$ or $X \sim \mathcal{N}(\mu, \sigma^2)$.
The exponential term: $\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$
This is the heart of the Gaussian. It's a quadratic function of $(x - \mu)$ inside a negative exponential:
The normalization constant: $\frac{1}{\sqrt{2\pi\sigma^2}}$
This ensures the density integrates to 1: $$\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) dx = 1$$
The $\sqrt{2\pi}$ comes from the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi}$, one of the most beautiful results in mathematics.
The negative quadratic exponent creates the characteristic bell shape: rapidly increasing toward the mean, then rapidly decreasing away from it. Any positive quadratic function in the exponent would explode to infinity; only negative quadratics give integrable, well-behaved densities. This is why quadratic loss functions and Gaussian distributions are intimately connected.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def gaussian_pdf(x, mu, sigma_sq): """ Compute Gaussian PDF manually to understand each component. """ normalization = 1.0 / np.sqrt(2 * np.pi * sigma_sq) exponent = -((x - mu) ** 2) / (2 * sigma_sq) return normalization * np.exp(exponent) # Verify our implementation matches scipyx = np.linspace(-5, 5, 1000)mu, sigma = 0, 1 pdf_manual = gaussian_pdf(x, mu, sigma**2)pdf_scipy = stats.norm(mu, sigma).pdf(x) print("=" * 60)print("GAUSSIAN PDF VERIFICATION")print("=" * 60)print(f"Max difference between manual and scipy: {np.max(np.abs(pdf_manual - pdf_scipy)):.2e}") # Demonstrate PDF propertiesprint("" + "=" * 60)print("KEY PDF PROPERTIES")print("=" * 60) # Property 1: Maximum at the meanmu, sigma = 3.0, 2.0x_vals = [mu - 2*sigma, mu - sigma, mu, mu + sigma, mu + 2*sigma]for x_val in x_vals: density = gaussian_pdf(x_val, mu, sigma**2) print(f"f({x_val:.1f}) = {density:.6f}")print(f"Maximum is at x = μ = {mu}") # Property 2: Symmetryprint("Symmetry check:")for delta in [0.5, 1.0, 2.0]: left = gaussian_pdf(mu - delta, mu, sigma**2) right = gaussian_pdf(mu + delta, mu, sigma**2) print(f"f(μ - {delta}) = {left:.6f}, f(μ + {delta}) = {right:.6f}, Equal: {np.isclose(left, right)}") # Property 3: Integration to 1from scipy.integrate import quadintegral, _ = quad(lambda x: gaussian_pdf(x, mu, sigma**2), -np.inf, np.inf)print(f"Integral of PDF: {integral:.10f} (should be 1.0)") # Property 4: Density can exceed 1 for small varianceprint("Density values for different variances (at x = μ):")for sigma_val in [2.0, 1.0, 0.5, 0.3, 0.1]: density_at_mean = gaussian_pdf(mu, mu, sigma_val**2) print(f"σ = {sigma_val}: f(μ) = {density_at_mean:.4f}")print("Note: Density > 1 is perfectly valid! Only AREA must equal 1.") # Property 5: Heavy tail behaviorprint("Tail probabilities (how much probability beyond k standard deviations):")normal = stats.norm(0, 1)for k in [1, 2, 3, 4, 5]: tail_prob = 2 * (1 - normal.cdf(k)) # Two-tailed print(f"P(|X| > {k}σ) = {tail_prob:.6f} = 1 in {1/tail_prob:.0f}")The Gaussian distribution is fully specified by just two parameters: the mean $\mu$ and variance $\sigma^2$. Understanding their roles geometrically and statistically is essential for Gaussian Naive Bayes.
The mean determines where the distribution is centered:
$$E[X] = \int_{-\infty}^{\infty} x \cdot f(x|\mu, \sigma^2) dx = \mu$$
Geometric interpretation:
In Gaussian Naive Bayes:
The variance determines how spread out the distribution is:
$$\text{Var}[X] = E[(X - \mu)^2] = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x|\mu, \sigma^2) dx = \sigma^2$$
Geometric interpretation:
In Gaussian Naive Bayes:
| Parameter Change | Effect on PDF | Effect on Classification |
|---|---|---|
| Increase $\mu$ | Shift curve rightward | Move class decision region rightward |
| Decrease $\mu$ | Shift curve leftward | Move class decision region leftward |
| Increase $\sigma^2$ | Flatten and widen curve | Broaden contribution to posterior |
| Decrease $\sigma^2$ | Sharpen and narrow curve | Concentrate probability near mean |
The standard deviation $\sigma = \sqrt{\sigma^2}$ is often more interpretable than variance because it has the same units as the data:
The standard deviation directly measures typical deviation from the mean: $$\sigma = \sqrt{E[(X - \mu)^2]}$$
The standard normal distribution has $\mu = 0$ and $\sigma^2 = 1$: $$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$
Any Gaussian can be standardized to the standard normal: $$\text{If } X \sim \mathcal{N}(\mu, \sigma^2), \text{ then } Z = \frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$$
This transformation centers the data (subtract mean) and scales it (divide by standard deviation).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
import numpy as npfrom scipy import stats def demonstrate_parameter_effects(): """ Show how μ and σ² affect the Gaussian distribution. """ print("=" * 60) print("EFFECT OF PARAMETERS ON GAUSSIAN PDF") print("=" * 60) x_test = 5.0 # Test point # Effect of mean (μ) print("--- Effect of Mean (μ) ---") print(f"Evaluating density at x = {x_test}") sigma = 2.0 for mu in [0, 3, 5, 7, 10]: density = stats.norm(mu, sigma).pdf(x_test) distance = abs(x_test - mu) print(f"μ = {mu:2d}: f({x_test}) = {density:.6f}, distance from peak = {distance:.1f}") print("→ Density is highest when x is closest to μ") # Effect of variance (σ²) print("--- Effect of Variance (σ²) ---") print(f"Evaluating density at x = {x_test}, with μ = {x_test}") mu = 5.0 for sigma in [0.5, 1.0, 2.0, 4.0, 8.0]: density = stats.norm(mu, sigma).pdf(x_test) print(f"σ = {sigma:.1f}: f({x_test}) = {density:.6f}") print(f"→ Smaller σ gives higher peak density (more concentrated)") # But what about points away from the mean? print(f"--- Density at x = 3.0 (below mean of 5.0) ---") x_off = 3.0 for sigma in [0.5, 1.0, 2.0, 4.0, 8.0]: density = stats.norm(mu, sigma).pdf(x_off) print(f"σ = {sigma:.1f}: f({x_off}) = {density:.6f}") print("→ Larger σ assigns more density to points far from mean") # This has classification implications print("" + "=" * 60) print("CLASSIFICATION IMPLICATIONS") print("=" * 60) # Two classes with different variances class_0 = stats.norm(0, 1) # Class 0: μ=0, σ=1 (tight) class_1 = stats.norm(0, 3) # Class 1: μ=0, σ=3 (spread out) priors = [0.5, 0.5] print("Class 0: N(0, 1²) - tight distribution") print("Class 1: N(0, 3²) - spread distribution") print("Equal priors") print() test_points = [0, 1, 2, 3, 4, 5] print(f"{'x':>4} | {'f(x|0)':>10} | {'f(x|1)':>10} | {'P(0|x)':>8} | Predicted") print("-" * 55) for x in test_points: f0 = class_0.pdf(x) f1 = class_1.pdf(x) p0 = f0 * priors[0] / (f0 * priors[0] + f1 * priors[1]) pred = 0 if p0 > 0.5 else 1 print(f"{x:>4} | {f0:>10.6f} | {f1:>10.6f} | {p0:>8.4f} | Class {pred}") print("→ Near the shared mean, the tighter class (0) is preferred") print("→ Far from the mean, the spread class (1) is preferred") demonstrate_parameter_effects()One of the most practically useful properties of the Gaussian distribution is the empirical rule (or 68-95-99.7 rule), which quantifies how much probability lies within standard deviation intervals.
For any Gaussian distribution $X \sim \mathcal{N}(\mu, \sigma^2)$:
$$P(\mu - \sigma \leq X \leq \mu + \sigma) \approx 0.6827 \quad (\approx 68%)$$ $$P(\mu - 2\sigma \leq X \leq \mu + 2\sigma) \approx 0.9545 \quad (\approx 95%)$$ $$P(\mu - 3\sigma \leq X \leq \mu + 3\sigma) \approx 0.9973 \quad (\approx 99.7%)$$
In Gaussian Naive Bayes, the empirical rule helps us understand feature contributions:
Within 1σ of the mean: These are "typical" values for the class. High density, strong evidence for this class.
Between 1σ and 2σ: Moderately unusual values. Still reasonable for this class, but less common.
Between 2σ and 3σ: Unusual values. Low density, weak evidence for this class.
Beyond 3σ: Extreme values. Very low density, the observed value is surprising if this class is correct.
The Gaussian has light tails compared to some other distributions:
This has important implications: if we observe a feature value many standard deviations from a class mean, the Gaussian model assigns extremely low likelihood to that class.
The light tails of the Gaussian make it sensitive to outliers. A single extreme observation (e.g., 10σ from the mean) gets nearly zero likelihood, effectively vetoing that class regardless of other features. This is both a strength (extreme values are informative) and a weakness (measurement errors can dominate). Robust alternatives like the Student's t-distribution have heavier tails.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
import numpy as npfrom scipy import stats def demonstrate_empirical_rule(): """ Verify and explore the 68-95-99.7 rule. """ print("=" * 60) print("THE 68-95-99.7 RULE") print("=" * 60) # Standard normal for exact calculations Z = stats.norm(0, 1) # Verify the rule print("Exact probabilities for standard normal Z ~ N(0,1):") for k in [1, 2, 3, 4, 5, 6]: prob_within = Z.cdf(k) - Z.cdf(-k) prob_beyond = 1 - prob_within one_in = 1 / prob_beyond if prob_beyond > 0 else float('inf') print(f"P(|Z| ≤ {k}) = {prob_within:.6f} ({prob_within*100:.2f}%)") print(f"P(|Z| > {k}) = {prob_beyond:.10f} (1 in {one_in:,.0f})") print() # Classification implications print("=" * 60) print("CLASSIFICATION IMPLICATIONS") print("=" * 60) # Two classes with different means class_0 = stats.norm(50, 10) # Class 0: μ=50, σ=10 class_1 = stats.norm(70, 10) # Class 1: μ=70, σ=10 print("Class 0: N(50, 10²)") print("Class 1: N(70, 10²)") print("Equal priors") # Test observations at various distances from each mean print("How many σ away from each class mean?") observations = [50, 55, 60, 65, 70, 75, 80, 30, 90] print(f"{'x':>5} | {'σ from μ₀':>10} | {'σ from μ₁':>10} | {'f(x|0)':>12} | {'f(x|1)':>12} | {'Ratio':>10}") print("-" * 75) for x in observations: sigma_0 = abs(x - 50) / 10 sigma_1 = abs(x - 70) / 10 f0 = class_0.pdf(x) f1 = class_1.pdf(x) ratio = f0 / f1 if f1 > 0 else float('inf') print(f"{x:>5} | {sigma_0:>10.1f} | {sigma_1:>10.1f} | {f0:>12.8f} | {f1:>12.8f} | {ratio:>10.4f}") # Outlier example print("" + "=" * 60) print("OUTLIER SENSITIVITY") print("=" * 60) # Extreme observation x_extreme = 20 # Very far from both class means sigma_0_extreme = abs(x_extreme - 50) / 10 sigma_1_extreme = abs(x_extreme - 70) / 10 f0_extreme = class_0.pdf(x_extreme) f1_extreme = class_1.pdf(x_extreme) print(f"Extreme observation: x = {x_extreme}") print(f"Distance from class 0 mean: {sigma_0_extreme:.1f}σ") print(f"Distance from class 1 mean: {sigma_1_extreme:.1f}σ") print(f"Density under class 0: {f0_extreme:.2e}") print(f"Density under class 1: {f1_extreme:.2e}") print("→ Both densities are extremely small!") print("→ Neither class 'explains' this observation well") print("→ Could indicate error, outlier, or unknown class") demonstrate_empirical_rule()In practical implementations of Gaussian Naive Bayes, we almost always work with log-densities rather than raw densities. This section explains why and develops the log-space mathematics.
When multiplying many small probabilities or densities, the product can become smaller than the smallest representable floating-point number:
$$f(\mathbf{x} | y=k) = \prod_{j=1}^{d} f(x_j | y=k)$$
For $d = 100$ features, each with density $\approx 0.01$: $$\prod_{j=1}^{100} 0.01 = 10^{-200}$$
But the smallest double-precision float is $\approx 10^{-308}$, and practical precision is lost much earlier. The computer rounds this to exactly zero, making classification impossible.
Working in log-space converts products to sums: $$\log f(\mathbf{x} | y=k) = \sum_{j=1}^{d} \log f(x_j | y=k)$$
Sums of logs are numerically stable even when the underlying densities are tiny.
The log of the Gaussian PDF is: $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2)$$
Or equivalently: $$\log f(x | \mu, \sigma^2) = -\frac{(x-\mu)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi) - \log(\sigma)$$
Notice:
The quadratic nature of log-Gaussian density is key to understanding Gaussian Naive Bayes decision boundaries. Classification involves comparing sums of quadratics across classes. The geometry of these quadratics determines whether boundaries are linear or curved.
The classification decision is based on: $$\hat{y} = \arg\max_k \left[ \log P(y=k) + \sum_{j=1}^{d} \log f(x_j | y=k) \right]$$
Substituting the Gaussian log-density: $$\hat{y} = \arg\max_k \left[ \log \pi_k - \frac{1}{2} \sum_{j=1}^{d} \left( \frac{(x_j - \mu_{jk})^2}{\sigma^2_{jk}} + \log(2\pi\sigma^2_{jk}) \right) \right]$$
Since $\log(2\pi)$ is constant across classes, we can simplify: $$\hat{y} = \arg\max_k \left[ \log \pi_k - \frac{1}{2} \sum_{j=1}^{d} \left( \frac{(x_j - \mu_{jk})^2}{\sigma^2_{jk}} + \log \sigma^2_{jk} \right) \right]$$
The term $\sum_{j=1}^{d} \frac{(x_j - \mu_{jk})^2}{\sigma^2_{jk}}$ is the squared Mahalanobis distance from $\mathbf{x}$ to the class $k$ mean, under the diagonal covariance assumption.
Lower Mahalanobis distance means the observation is closer to the class center (in standardized units), leading to higher posterior probability for that class.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npfrom scipy import stats def compare_raw_vs_log_space(): """ Demonstrate why log-space computation is essential. """ print("=" * 60) print("RAW DENSITY VS LOG-DENSITY COMPUTATION") print("=" * 60) # Setup: 100 features, each with density around 0.1 n_features = 100 np.random.seed(42) # Generate random densities (simulating Gaussian PDF values) densities = np.random.uniform(0.05, 0.2, n_features) print(f"Number of features: {n_features}") print(f"Individual densities range: [{densities.min():.4f}, {densities.max():.4f}]") # Raw multiplication (will underflow) raw_product = 1.0 for i, d in enumerate(densities): raw_product *= d if i % 20 == 19: print(f"After {i+1} multiplications: {raw_product:.2e}") print(f"Final raw product: {raw_product}") print("→ Underflowed to zero!") # Log-space addition (stable) log_sum = 0.0 for d in densities: log_sum += np.log(d) print(f"Log-space sum: {log_sum:.2f}") print(f"Equivalent raw value: e^{log_sum:.2f} = {np.exp(log_sum):.2e}") print("→ Log-space computation gives meaningful result!") # Practical Gaussian example print("" + "=" * 60) print("PRACTICAL GAUSSIAN NAIVE BAYES EXAMPLE") print("=" * 60) # 50-dimensional problem n_dim = 50 np.random.seed(42) # Class parameters mu_0 = np.zeros(n_dim) mu_1 = np.ones(n_dim) * 2 sigma = np.ones(n_dim) # Unit variance # Test point x = np.random.randn(n_dim) + 1 # Roughly equidistant from both classes # Raw computation raw_lik_0 = 1.0 raw_lik_1 = 1.0 for j in range(n_dim): raw_lik_0 *= stats.norm(mu_0[j], sigma[j]).pdf(x[j]) raw_lik_1 *= stats.norm(mu_1[j], sigma[j]).pdf(x[j]) print(f"{n_dim}-dimensional classification problem") print(f"Raw likelihoods (will underflow):") print(f" Class 0: {raw_lik_0}") print(f" Class 1: {raw_lik_1}") # Log-space computation log_lik_0 = 0.0 log_lik_1 = 0.0 for j in range(n_dim): log_lik_0 += stats.norm(mu_0[j], sigma[j]).logpdf(x[j]) log_lik_1 += stats.norm(mu_1[j], sigma[j]).logpdf(x[j]) print(f"Log-likelihoods (stable):") print(f" Class 0: {log_lik_0:.2f}") print(f" Class 1: {log_lik_1:.2f}") # Compare for classification if log_lik_0 > log_lik_1: print(f"Prediction: Class 0 (log-likelihood {log_lik_0 - log_lik_1:.2f} higher)") else: print(f"Prediction: Class 1 (log-likelihood {log_lik_1 - log_lik_0:.2f} higher)") # The log-sum-exp trick for posterior probabilities print("" + "=" * 60) print("LOG-SUM-EXP TRICK FOR POSTERIORS") print("=" * 60) # To get P(y=0|x), we need: exp(log_lik_0) / (exp(log_lik_0) + exp(log_lik_1)) # But this can overflow/underflow! # Log-sum-exp trick: max_log = max(log_lik_0, log_lik_1) log_sum = max_log + np.log(np.exp(log_lik_0 - max_log) + np.exp(log_lik_1 - max_log)) log_post_0 = log_lik_0 - log_sum log_post_1 = log_lik_1 - log_sum print(f"Log-posteriors:") print(f" log P(0|x) = {log_post_0:.4f}") print(f" log P(1|x) = {log_post_1:.4f}") print(f"Posteriors:") print(f" P(0|x) = {np.exp(log_post_0):.4f}") print(f" P(1|x) = {np.exp(log_post_1):.4f}") print(f" Sum = {np.exp(log_post_0) + np.exp(log_post_1):.6f}") compare_raw_vs_log_space()The Gaussian assumption is a modeling choice, not a universal truth. Understanding when it's justified—and when it might fail—is crucial for practical application.
1. Central Limit Theorem Effects
When a feature is the sum or average of many independent contributions, it tends toward Gaussian:
2. Natural Symmetry
Symmetric distributions around a central value often appear approximately Gaussian:
3. Unbounded Range
Gaussian is defined on $(-\infty, +\infty)$. Features naturally fitting this range are better candidates:
4. Unimodal Distribution
Gaussian has a single peak. Features with one dominant mode fit better than multimodal features.
Gaussian Naive Bayes doesn't require exact Gaussianity—it only requires that the Gaussian provides a reasonable approximation. Many classification tasks are robust to moderate violations of the assumption. The key is that the Gaussian captures the essential behavior: central tendency, spread, and relative likelihood of different values.
1. Bounded Features
Features with strict bounds are not Gaussian:
Gaussian assigns non-zero probability to impossible values.
2. Heavy Tails
Some distributions have more extreme values than Gaussian predicts:
3. Skewed Distributions
Asymmetric distributions don't fit the symmetric Gaussian:
4. Multimodal Distributions
Within-class distributions with multiple peaks:
5. Discretization or Truncation
Data with artificial boundaries or discretization effects:
Before applying Gaussian Naive Bayes, it's prudent to check whether the Gaussian assumption is reasonable for your features. Several diagnostic techniques can help.
1. Histograms
Overlay the empirical histogram with the fitted Gaussian PDF. Large discrepancies indicate poor fit.
2. Q-Q Plots (Quantile-Quantile)
Plot the sample quantiles against theoretical Gaussian quantiles. Gaussianity implies a straight line:
3. Box Plots
Outliers beyond 1.5×IQR suggest non-Gaussian behavior. Asymmetric boxes indicate skewness.
1. Shapiro-Wilk Test
Most powerful test for normality for small to moderate samples. Tests null hypothesis that data is normally distributed.
2. D'Agostino-Pearson Test
Combines skewness and kurtosis into a single test statistic. Good for larger samples.
3. Kolmogorov-Smirnov Test
Compares empirical CDF to theoretical Gaussian CDF. Less powerful than Shapiro-Wilk but applicable to larger samples.
Rejecting normality doesn't mean Gaussian NB will perform poorly! The tests are very sensitive to small deviations, especially with large samples. What matters for classification is whether the Gaussian approximation captures enough structure to distinguish classes, not whether it's statistically perfect. Always evaluate by classification accuracy, not just normality tests.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import numpy as npfrom scipy import stats def diagnose_gaussianity(data, name="Feature"): """ Comprehensive diagnostics for checking Gaussianity. """ n = len(data) print("=" * 60) print(f"GAUSSIANITY DIAGNOSTICS: {name}") print("=" * 60) # Basic statistics mean = np.mean(data) std = np.std(data, ddof=1) skewness = stats.skew(data) kurtosis = stats.kurtosis(data) # Excess kurtosis (0 for Gaussian) print(f"Sample size: {n}") print(f"Mean: {mean:.4f}") print(f"Std: {std:.4f}") print(f"Skewness: {skewness:.4f} (Gaussian: 0)") print(f"Excess kurtosis: {kurtosis:.4f} (Gaussian: 0)") # Interpretation if abs(skewness) < 0.5: skew_interpretation = "approximately symmetric" elif skewness > 0: skew_interpretation = "right-skewed" else: skew_interpretation = "left-skewed" if abs(kurtosis) < 1: kurt_interpretation = "approximately mesokurtic (Gaussian-like tails)" elif kurtosis > 0: kurt_interpretation = "leptokurtic (heavy tails)" else: kurt_interpretation = "platykurtic (light tails)" print(f"Interpretation:") print(f" Shape: {skew_interpretation}") print(f" Tails: {kurt_interpretation}") # Statistical tests print(f"Statistical Tests:") # Shapiro-Wilk (best for n < 5000) if n <= 5000: stat_sw, p_sw = stats.shapiro(data) print(f" Shapiro-Wilk: W = {stat_sw:.4f}, p-value = {p_sw:.4f}") if p_sw < 0.05: print(f" → Reject normality at α=0.05") else: print(f" → Cannot reject normality at α=0.05") else: print(f" Shapiro-Wilk: Skipped (n > 5000)") # D'Agostino-Pearson stat_dp, p_dp = stats.normaltest(data) print(f" D'Agostino-Pearson: K² = {stat_dp:.4f}, p-value = {p_dp:.4f}") if p_dp < 0.05: print(f" → Reject normality at α=0.05") else: print(f" → Cannot reject normality at α=0.05") # Kolmogorov-Smirnov stat_ks, p_ks = stats.kstest(data, 'norm', args=(mean, std)) print(f" Kolmogorov-Smirnov: D = {stat_ks:.4f}, p-value = {p_ks:.4f}") if p_ks < 0.05: print(f" → Reject normality at α=0.05") else: print(f" → Cannot reject normality at α=0.05") # Empirical rule check print(f"Empirical Rule Check (68-95-99.7):") for k, expected in [(1, 0.6827), (2, 0.9545), (3, 0.9973)]: within = np.mean(np.abs(data - mean) <= k * std) deviation = (within - expected) / expected * 100 print(f" Within {k}σ: {within:.4f} (expected {expected:.4f}, {deviation:+.1f}% deviation)") return { 'skewness': skewness, 'kurtosis': kurtosis, 'shapiro_p': p_sw if n <= 5000 else None, 'dagostino_p': p_dp } # Test on different distributionsnp.random.seed(42)n = 1000 print("" + "="*60)print("TESTING DIFFERENT DISTRIBUTIONS")print("="*60) # 1. True Gaussianprint("" + "-"*60)diagnose_gaussianity(np.random.normal(0, 1, n), "True Gaussian N(0,1)") # 2. Slightly skewed (log-normal)print("" + "-"*60)diagnose_gaussianity(np.random.lognormal(0, 0.3, n), "Log-Normal (mild skew)") # 3. Heavy-tailed (Student's t)print("" + "-"*60)diagnose_gaussianity(stats.t.rvs(df=3, size=n), "Student's t (df=3, heavy tails)") # 4. Uniform (bounded)print("" + "-"*60)diagnose_gaussianity(np.random.uniform(-1, 1, n), "Uniform (bounded)") # 5. Mixture (bimodal)print("" + "-"*60)mixture = np.concatenate([ np.random.normal(-2, 0.5, n//2), np.random.normal(2, 0.5, n//2)])diagnose_gaussianity(mixture, "Mixture of two Gaussians (bimodal)")We have developed a deep understanding of the Gaussian distribution and its role in Gaussian Naive Bayes. Let us consolidate the key concepts:
What's next:
With the Gaussian distribution understood, we now turn to the practical task of estimating its parameters from data. The next page covers parameter estimation—how to compute the maximum likelihood estimates of means and variances from training samples, and how these estimates are used in classification.
You now have a comprehensive understanding of the Gaussian distribution and when the Gaussian assumption is appropriate. The key insight is that while exact Gaussianity is rarely achieved in practice, the Gaussian often provides a sufficient approximation for effective classification. Next, we learn how to estimate Gaussian parameters from training data.