Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

2 / 5

Gaussian (Normal) Distribution

The Universal Distribution of Nature and Science

If you could learn only one probability distribution, it should be the Gaussian distribution—also known as the Normal distribution. Named after Carl Friedrich Gauss (1777–1855), who used it to analyze astronomical data, this distribution appears with remarkable ubiquity across science, engineering, and machine learning.

Why is the Gaussian so special? The answer lies in the Central Limit Theorem: the sum of many independent random variables (regardless of their original distributions) tends toward a Gaussian. This explains why measurement errors, biological traits, stock returns, and countless other phenomena follow bell-shaped curves—they are aggregates of many small, independent effects.

In machine learning, the Gaussian is foundational:

Linear regression assumes Gaussian noise
Gaussian Processes define distributions over functions
Variational Autoencoders use Gaussian latent spaces
Batch Normalization transforms activations toward Gaussian
Maximum likelihood estimation often leads to least-squares when noise is Gaussian

Mastering the Gaussian distribution is essential for any serious machine learning practitioner.

What You Will Learn

By the end of this page, you will understand the Gaussian distribution from first principles: its mathematical definition, key properties, the standard normal and z-scores, the Central Limit Theorem, parameter estimation via MLE, and its pervasive applications throughout machine learning.

Mathematical Definition

A continuous random variable X follows a Gaussian (Normal) distribution with mean μ and variance σ², written X ~ N(μ, σ²), if its probability density function (PDF) is:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Equivalently, using the standard notation:

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$

where:

μ (mu) is the mean or location parameter—the center of the distribution
σ (sigma) is the standard deviation—controls the spread
σ² is the variance
The support is the entire real line: x ∈ (-∞, +∞)

Understanding the PDF Components

The Gaussian PDF has two parts: (1) The normalizing constant 1/(σ√(2π)) ensures the total area under the curve equals 1. (2) The exponential term exp(-(x-μ)²/(2σ²)) creates the characteristic bell shape, with the quadratic in the exponent producing symmetry around μ and decay as x moves away from μ.

Why This Specific Form?

The Gaussian is uniquely characterized by several mathematical properties:

1. Maximum Entropy: Among all distributions with specified mean μ and variance σ², the Gaussian has maximum entropy. It makes the fewest assumptions beyond the first two moments.

$$H(X) = \frac{1}{2} \ln(2\pi e \sigma^2)$$

2. Central Limit Theorem Limit: The sum of many independent random variables converges to a Gaussian.

3. Stability Under Linear Transformations: If X ~ N(μ, σ²), then aX + b ~ N(aμ + b, a²σ²).

4. Closure Under Convolution: The sum of independent Gaussians is Gaussian.

5. Zero Correlation Implies Independence: For jointly Gaussian variables, uncorrelated implies independent (unique among distributions).

These properties make the Gaussian mathematically tractable and theoretically justified as a default assumption when we know only the mean and variance.

Gaussian Distribution: Key Properties
Property	Formula/Value	Interpretation
Support	x ∈ (-∞, +∞)	Unbounded; assigns small probability to extreme values
Mean	E[X] = μ	Center of the distribution
Variance	Var(X) = σ²	Spread around the mean
Mode	μ	Most likely value (peak of PDF)
Median	μ	50th percentile
Skewness	0	Perfectly symmetric around μ
Kurtosis	3 (excess = 0)	Reference for 'normal' tail behavior
MGF	exp(μt + σ²t²/2)	Moment generating function
Entropy	½ ln(2πeσ²)	Bits of information

The Standard Normal Distribution

The Standard Normal Distribution is the special case with μ = 0 and σ² = 1, denoted Z ~ N(0, 1).

PDF of Standard Normal:

$$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$

CDF of Standard Normal:

$$\Phi(z) = \int_{-\infty}^{z} \phi(t) \, dt = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]$$

where erf is the error function.

Standardization (Z-Scores)

Any Gaussian can be converted to the standard normal via standardization:

If X ~ N(μ, σ²), then:

$$Z = \frac{X - \mu}{\sigma} \sim N(0, 1)$$

Conversely, if Z ~ N(0, 1):

$$X = \mu + \sigma Z \sim N(\mu, \sigma^2)$$

The z-score tells us how many standard deviations a value is from the mean.

The 68-95-99.7 Rule (Empirical Rule)

For any normal distribution:

Interval	Probability	Interpretation
μ ± 1σ	68.27%	About 2/3 of data falls within 1 SD
μ ± 2σ	95.45%	About 95% within 2 SDs
μ ± 3σ	99.73%	Nearly all within 3 SDs
μ ± 4σ	99.994%	Extreme outliers beyond 4 SDs

This rule provides quick mental estimates. If you see a value more than 3 standard deviations from the mean, it's exceptionally rare under normality assumptions.

standard_normal.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from scipy import stats
 
# Standard normal distribution
standard_normal = stats.norm(loc=0, scale=1)
 
# Key quantiles
print("Standard Normal Distribution Z ~ N(0,1)")
print("=" * 50)
 
# Probability within k standard deviations
for k in [1, 2, 3, 4]:
    prob = standard_normal.cdf(k) - standard_normal.cdf(-k)
    print(f"P(-{k} < Z < {k}) = {prob:.6f} ({prob*100:.3f}%)")
 
print()
 
# Critical values for confidence intervals
confidence_levels = [0.90, 0.95, 0.99]
print("Critical z-values for two-tailed confidence intervals:")
for cl in confidence_levels:
    alpha = 1 - cl
    z_critical = standard_normal.ppf(1 - alpha/2)
    print(f"  {cl*100:.0f}% CI: z = ±{z_critical:.4f}")
 
print()
 
# Converting between X ~ N(μ,σ²) and Z ~ N(0,1)
mu, sigma = 100, 15  # Example: IQ scores
x = 130  # A specific IQ score
 
z = (x - mu) / sigma  # Standardize
prob_below = standard_normal.cdf(z)
print(f"IQ scores: X ~ N({mu}, {sigma}²)")
print(f"IQ = {x} → z-score = {z:.2f}")
print(f"P(IQ < {x}) = P(Z < {z:.2f}) = {prob_below:.4f}")
print(f"Percentile: {prob_below*100:.1f}th")

The Universal Reference

The standard normal serves as a universal reference. Any question about a Gaussian N(μ,σ²) can be answered by standardizing to N(0,1) and using precomputed tables or functions. This is why software libraries need only implement the standard normal—all other Gaussians follow from linear transformation.

The Central Limit Theorem

The Central Limit Theorem (CLT) is one of the most profound results in probability theory, explaining why the Gaussian distribution is so prevalent. It states that the sum (or average) of many independent random variables tends toward a Gaussian, regardless of the original distributions.

Formal Statement

Let X₁, X₂, ..., Xₙ be independent and identically distributed (i.i.d.) random variables with:

Mean: E[Xᵢ] = μ
Variance: Var(Xᵢ) = σ² < ∞

Define the sample mean: $$\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i$$

Then as n → ∞:

$$\sqrt{n}\left(\bar{X}_n - \mu\right) \stackrel{d}{\rightarrow} N(0, \sigma^2)$$

Equivalently, the standardized sum converges to a standard normal:

$$Z_n = \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \stackrel{d}{\rightarrow} N(0, 1)$$

The Remarkable Universality

The CLT is remarkable because it doesn't matter what distribution the Xᵢ come from—uniform, exponential, Poisson, Bernoulli, or any other with finite variance. The sum always tends toward Gaussian. This universality explains the Gaussian's ubiquity in natural phenomena, which are often the result of many small, independent contributions.

Practical Implications

1. Finite Sample Approximation: For finite n, we approximate:

$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$

The approximation improves as n grows. Rule of thumb: n ≥ 30 often suffices, but more skewed distributions need larger n.

2. Binomial to Normal: For X ~ Binomial(n, p) with large n:

$$X \approx N(np, np(1-p))$$

3. Sample Mean Distribution: Even if individual observations are non-normal, the sampling distribution of the mean is approximately normal for large samples.

4. Confidence Intervals: The CLT justifies using z-intervals and t-intervals for population means.

Rate of Convergence (Berry-Esseen)

The Berry-Esseen theorem quantifies how fast the CLT convergence occurs:

$$\sup_z |P(Z_n \leq z) - \Phi(z)| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}$$

where ρ = E[|X - μ|³] is the third absolute moment and C ≈ 0.4748. The convergence is O(1/√n).

central_limit_theorem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def demonstrate_clt(original_dist, dist_name, n_samples=10000):
    """
    Demonstrate CLT by showing how sample means converge to Gaussian.
    """
    sample_sizes = [1, 2, 5, 10, 30, 100]
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    for ax, n in zip(axes.flatten(), sample_sizes):
        # Generate n_samples sets of n observations each
        data = original_dist.rvs(size=(n_samples, n))
        sample_means = data.mean(axis=1)
        
        # True parameters of sample mean distribution
        true_mean = original_dist.mean()
        true_std = original_dist.std() / np.sqrt(n)
        
        # Plot histogram of sample means
        ax.hist(sample_means, bins=50, density=True, alpha=0.7, 
                color='steelblue', edgecolor='navy')
        
        # Overlay the normal approximation
        x = np.linspace(sample_means.min(), sample_means.max(), 200)
        normal_pdf = stats.norm.pdf(x, true_mean, true_std)
        ax.plot(x, normal_pdf, 'r-', linewidth=2, label='Normal approx.')
        
        ax.set_title(f'n = {n}')
        ax.set_xlabel('Sample Mean')
        ax.legend()
    
    plt.suptitle(f'CLT Demonstration: {dist_name}', fontsize=14)
    plt.tight_layout()
    plt.show()
 
# Demonstrate with highly non-normal distributions
 
# 1. Exponential (skewed right)
print("Exponential Distribution (λ=1):")
exponential = stats.expon(scale=1)
print(f"  Mean: {exponential.mean():.2f}, Variance: {exponential.var():.2f}")
print(f"  Skewness: {exponential.stats(moments='s')[0]:.2f} (highly skewed)")
# demonstrate_clt(exponential, "Exponential(λ=1)")
 
# 2. Uniform (symmetric, non-normal)
print("
Uniform[0,1] Distribution:")
uniform = stats.uniform(0, 1)
print(f"  Mean: {uniform.mean():.2f}, Variance: {uniform.var():.4f}")
# demonstrate_clt(uniform, "Uniform[0,1]")
 
# 3. Bernoulli (discrete)
print("
Bernoulli(p=0.3) Distribution:")
bernoulli = stats.bernoulli(0.3)
print(f"  Mean: {bernoulli.mean():.2f}, Variance: {bernoulli.var():.4f}")
 
# Manual CLT demonstration
np.random.seed(42)
n = 50  # Sample size
n_experiments = 10000
 
# Draw n_experiments sample means, each from n observations
bernoulli_means = np.random.binomial(n, 0.3, n_experiments) / n
 
# Compare to normal
mu = 0.3
sigma = np.sqrt(0.3 * 0.7 / n)
print(f"
For n=50 Bernoulli samples:")
print(f"  μ_X̄ = {mu:.3f}, σ_X̄ = {sigma:.4f}")
print(f"  Observed: mean = {bernoulli_means.mean():.4f}, std = {bernoulli_means.std():.4f}")

Parameter Estimation via MLE

Given observations x₁, x₂, ..., xₙ from a Gaussian distribution, we estimate the unknown parameters μ and σ² using Maximum Likelihood Estimation (MLE).

Deriving the MLE

The likelihood of n i.i.d. Gaussian observations is:

$$L(\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$

The log-likelihood is:

$$\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$

Solving for μ

Taking the derivative with respect to μ:

$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu) = 0$$

Solving: $$\sum_{i=1}^n x_i - n\mu = 0 \implies \hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}$$

The MLE for μ is the sample mean.

Solving for σ²

Taking the derivative with respect to σ²:

$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2 = 0$$

Solving: $$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$$

The MLE for σ² is the sample variance (with n denominator).

Bias Correction

Interestingly, the MLE for variance is biased:

$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 eq \sigma^2$$

The unbiased estimator divides by (n-1) instead:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2$$

This is called Bessel's correction. The intuition: we use one degree of freedom to estimate μ, leaving only n-1 for estimating spread around it.

For large n, the difference is negligible.

Gaussian Parameter Estimators
Parameter	MLE Estimator	Unbiased Estimator	Properties
μ (mean)	x̄ = (1/n)Σxᵢ	Same (unbiased)	Var(x̄) = σ²/n
σ² (variance)	(1/n)Σ(xᵢ-x̄)²	s² = (1/(n-1))Σ(xᵢ-x̄)²	MLE biased by factor (n-1)/n
σ (std dev)	√[(1/n)Σ(xᵢ-x̄)²]	√s² (still slightly biased)	Unbiasing σ is more complex

gaussian_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats
 
def gaussian_mle(samples: np.ndarray) -> dict:
    """
    Maximum Likelihood Estimation for Gaussian parameters.
    
    Returns both MLE estimates and unbiased estimates with confidence intervals.
    """
    n = len(samples)
    
    # MLE estimates
    mu_mle = np.mean(samples)
    var_mle = np.var(samples, ddof=0)  # ddof=0 for MLE (divide by n)
    
    # Unbiased estimates  
    var_unbiased = np.var(samples, ddof=1)  # ddof=1 for unbiased (divide by n-1)
    sigma_unbiased = np.std(samples, ddof=1)
    
    # Standard error of the mean
    se_mean = sigma_unbiased / np.sqrt(n)
    
    # Confidence intervals for μ using t-distribution
    t_critical = stats.t.ppf(0.975, df=n-1)  # 95% CI
    ci_mean = (mu_mle - t_critical * se_mean, 
               mu_mle + t_critical * se_mean)
    
    # Confidence interval for σ² using chi-squared distribution
    # (n-1)s²/σ² ~ χ²(n-1)
    chi2_lower = stats.chi2.ppf(0.025, df=n-1)
    chi2_upper = stats.chi2.ppf(0.975, df=n-1)
    ci_var = ((n-1) * var_unbiased / chi2_upper,
              (n-1) * var_unbiased / chi2_lower)
    
    return {
        'n': n,
        'mu_mle': mu_mle,
        'sigma2_mle': var_mle,
        'sigma2_unbiased': var_unbiased,
        'sigma_unbiased': sigma_unbiased,
        'se_mean': se_mean,
        'ci_mean_95': ci_mean,
        'ci_var_95': ci_var
    }
 
# Example: Estimate parameters from data
np.random.seed(42)
true_mu, true_sigma = 50, 10
n = 100
data = np.random.normal(true_mu, true_sigma, n)
 
results = gaussian_mle(data)
print(f"True parameters: μ = {true_mu}, σ = {true_sigma}, σ² = {true_sigma**2}")
print(f"
MLE estimates:")
print(f"  μ̂ = {results['mu_mle']:.4f}")
print(f"  σ̂² (MLE) = {results['sigma2_mle']:.4f}")
print(f"  σ̂² (unbiased) = {results['sigma2_unbiased']:.4f}")
print(f"
95% CI for μ: [{results['ci_mean_95'][0]:.4f}, {results['ci_mean_95'][1]:.4f}]")
print(f"95% CI for σ²: [{results['ci_var_95'][0]:.4f}, {results['ci_var_95'][1]:.4f}]")
 
# Demonstrate bias in variance estimation
print("
--- Bias Demonstration ---")
n_simulations = 10000
mle_vars = []
unbiased_vars = []
 
for _ in range(n_simulations):
    sample = np.random.normal(true_mu, true_sigma, n)
    mle_vars.append(np.var(sample, ddof=0))
    unbiased_vars.append(np.var(sample, ddof=1))
 
print(f"True σ² = {true_sigma**2}")
print(f"E[σ̂²_MLE] ≈ {np.mean(mle_vars):.4f} (biased)")
print(f"E[s²] ≈ {np.mean(unbiased_vars):.4f} (unbiased)")
print(f"Expected MLE bias: {(n-1)/n * true_sigma**2:.4f}")

Properties and Operations on Gaussians

The Gaussian distribution has remarkable closure properties that make it incredibly tractable for mathematical analysis and practical applications.

Linear Transformation

If X ~ N(μ, σ²) and Y = aX + b, then:

$$Y \sim N(a\mu + b, a^2\sigma^2)$$

This means scaling and shifting a Gaussian yields another Gaussian. This is fundamental to standardization and z-scores.

Sum of Independent Gaussians

If X₁ ~ N(μ₁, σ₁²) and X₂ ~ N(μ₂, σ₂²) are independent, then:

$$X_1 + X_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$$

More generally, for independent Xᵢ ~ N(μᵢ, σᵢ²):

$$\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i \mu_i, \sum_{i=1}^n a_i^2 \sigma_i^2\right)$$

Key insight: Variances add, standard deviations do not!

Product of PDFs (Bayesian Updating)

The product of two Gaussian PDFs (up to normalization) is another Gaussian. This is crucial for Bayesian inference:

If we have a prior N(μ₁, σ₁²) and likelihood N(μ₂, σ₂²), the posterior is N(μₚ, σₚ²) where:

$$\sigma_p^2 = \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right)^{-1}$$

$$\mu_p = \sigma_p^2 \left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)$$

This is the precision-weighted average of the means, where precision = 1/variance.

Conditional Distribution

For jointly Gaussian variables (X, Y), the conditional distribution Y|X is also Gaussian:

$$Y | X = x \sim N\left(\mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X), \sigma_Y^2(1-\rho^2)\right)$$

where ρ is the correlation coefficient. This property is fundamental to Gaussian processes and linear regression.

Gaussian Closure Properties
Operation	Result	Formula
Linear transform Y = aX + b	Gaussian	N(aμ + b, a²σ²)
Sum of independent Gaussians	Gaussian	Means add, variances add
Product of PDFs	Gaussian (unnormalized)	Precision-weighted combination
Conditional (joint Gaussian)	Gaussian	Linear regression formula
Marginal (joint Gaussian)	Gaussian	Extract relevant parameters
Convolution	Gaussian	Variances add

Why Gaussians Are So Tractable

The Gaussian family is closed under almost every operation we care about: linear combinations, conditioning, marginalization, convolution, and conjugate Bayesian updating. This mathematical closure is why Gaussian assumptions are so common—they lead to analytically tractable solutions. When exact solutions aren't possible, Gaussian approximations (like the Laplace approximation) are often the first resort.

Applications in Machine Learning

The Gaussian distribution is woven into the fabric of machine learning. Here we explore its key applications in depth.

Linear Regression

The standard linear regression model assumes:

$$y = \mathbf{w}^T \mathbf{x} + \epsilon, \quad \epsilon \sim N(0, \sigma^2)$$

This implies: $$y | \mathbf{x} \sim N(\mathbf{w}^T \mathbf{x}, \sigma^2)$$

The log-likelihood for n observations is:

$$\ell(\mathbf{w}, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

Maximizing with respect to w is equivalent to minimizing the sum of squared errors—the famous least squares criterion emerges naturally from Gaussian noise assumptions!

gaussian_ml_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy import stats
 
# 1. LINEAR REGRESSION FROM GAUSSIAN LIKELIHOOD
def linear_regression_mle(X, y):
    """
    Derive linear regression from Gaussian noise assumption.
    y = X @ w + ε, where ε ~ N(0, σ²I)
    
    MLE for w: minimize ||y - Xw||² 
    Solution: w = (X'X)^{-1} X'y
    """
    # Add bias term
    X_bias = np.column_stack([np.ones(len(X)), X])
    
    # MLE (equivalent to OLS)
    w_mle = np.linalg.lstsq(X_bias, y, rcond=None)[0]
    
    # Predicted values
    y_pred = X_bias @ w_mle
    
    # MLE for σ²
    residuals = y - y_pred
    sigma2_mle = np.mean(residuals ** 2)
    
    # Log-likelihood
    n = len(y)
    log_likelihood = -n/2 * np.log(2 * np.pi * sigma2_mle) - n/2
    
    return {
        'weights': w_mle,
        'sigma2': sigma2_mle,
        'log_likelihood': log_likelihood,
        'predictions': y_pred
    }
 
# 2. GAUSSIAN NAIVE BAYES
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier.
    Assumes features are Gaussian given class: P(x_j | y=c) = N(μ_jc, σ²_jc)
    """
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.n_features = X.shape[1]
        
        # Store class priors and parameters
        self.priors = {}
        self.means = {}
        self.variances = {}
        
        for c in self.classes:
            X_c = X[y == c]
            self.priors[c] = len(X_c) / len(X)
            self.means[c] = X_c.mean(axis=0)
            self.variances[c] = X_c.var(axis=0) + 1e-9  # Smoothing
    
    def _log_likelihood(self, x, c):
        """Compute log P(x | y=c) assuming Gaussian features."""
        log_prob = 0
        for j in range(self.n_features):
            log_prob += stats.norm.logpdf(
                x[j], 
                self.means[c][j], 
                np.sqrt(self.variances[c][j])
            )
        return log_prob
    
    def predict_proba(self, X):
        """Compute posterior probabilities P(y=c | x)."""
        log_posteriors = np.zeros((len(X), len(self.classes)))
        
        for i, x in enumerate(X):
            for j, c in enumerate(self.classes):
                log_posteriors[i, j] = (np.log(self.priors[c]) + 
                                        self._log_likelihood(x, c))
        
        # Normalize using log-sum-exp
        log_norm = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True))
        posteriors = np.exp(log_posteriors - log_norm)
        return posteriors
    
    def predict(self, X):
        return self.classes[self.predict_proba(X).argmax(axis=1)]
 
# Example usage
np.random.seed(42)
X = np.vstack([
    np.random.normal([0, 0], [1, 1], (50, 2)),
    np.random.normal([3, 3], [1, 1], (50, 2))
])
y = np.array([0]*50 + [1]*50)
 
gnb = GaussianNaiveBayes()
gnb.fit(X, y)
accuracy = (gnb.predict(X) == y).mean()
print(f"Gaussian Naive Bayes accuracy: {accuracy:.2%}")

Gaussian Processes

Gaussian Processes (GPs) define distributions over functions, where any finite collection of function values is jointly Gaussian:

$$[f(x_1), f(x_2), \ldots, f(x_n)]^T \sim N(\mathbf{m}, \mathbf{K})$$

where:

m is the mean vector: mᵢ = m(xᵢ)
K is the covariance (kernel) matrix: Kᵢⱼ = k(xᵢ, xⱼ)

GPs provide:

Non-parametric, flexible function modeling
Uncertainty quantification (posterior variance)
Principled Bayesian inference

Variational Autoencoders (VAEs)

VAEs use Gaussian distributions in the latent space:

Encoder: q(z|x) = N(μ(x), σ²(x)) — maps data to latent Gaussian
Prior: p(z) = N(0, I) — standard Gaussian prior
Decoder: p(x|z) — reconstructs from latent code

The "reparameterization trick" enables backpropagation through Gaussian sampling:

$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, I)$$

Gaussian Applications Summary

•Linear Regression: Gaussian noise assumption leads to least squares; enables MLE and confidence intervals
•Gaussian Naive Bayes: Continuous features modeled as Gaussian given class
•Gaussian Mixture Models: Cluster data using mixtures of Gaussians (EM algorithm)
•Gaussian Processes: Non-parametric Bayesian regression with uncertainty
•Variational Autoencoders: Gaussian latent spaces for generative modeling
•Batch Normalization: Transforms activations to be approximately Gaussian
•Weight Initialization: Glorot/He initialization uses Gaussian or uniform with σ based on layer size
•Additive Noise: Data augmentation, dropout noise, and label smoothing often assume Gaussian

Testing for Normality

Many statistical methods assume normality. Before applying such methods, we should verify whether the assumption is reasonable. Here are the main approaches:

Visual Methods

1. Histogram: Compare the empirical distribution to the theoretical Gaussian bell curve.

2. Q-Q Plot (Quantile-Quantile): Plot sample quantiles against theoretical Gaussian quantiles. If data is Gaussian, points lie on a straight line.

Points curving up on the right → right skew
Points curving down on the left → left skew
S-shape → heavy tails
Inverted S → light tails

Statistical Tests

Shapiro-Wilk Test: Most powerful for small to moderate samples (n < 5000).

H₀: Data comes from a normal distribution
Small p-value → reject normality

Kolmogorov-Smirnov Test: Compares empirical CDF to theoretical normal CDF.

Less powerful than Shapiro-Wilk
Sensitive to any departure from normality

Anderson-Darling Test: Similar to K-S but gives more weight to tails.

Good for detecting tail departures

D'Agostino-Pearson Test: Combines skewness and kurtosis tests.

Tests based on third and fourth moments

normality_testing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def comprehensive_normality_test(data, alpha=0.05):
    """
    Perform comprehensive normality testing with visual and statistical tests.
    """
    n = len(data)
    results = {'n': n, 'alpha': alpha}
    
    # Descriptive statistics
    results['mean'] = np.mean(data)
    results['std'] = np.std(data, ddof=1)
    results['skewness'] = stats.skew(data)
    results['kurtosis'] = stats.kurtosis(data)  # Excess kurtosis (0 for normal)
    
    # Statistical tests
    # 1. Shapiro-Wilk (best for n < 5000)
    if n <= 5000:
        stat, p = stats.shapiro(data)
        results['shapiro_wilk'] = {'statistic': stat, 'p_value': p, 
                                    'normal': p > alpha}
    
    # 2. Kolmogorov-Smirnov (against standard normal after standardization)
    standardized = (data - np.mean(data)) / np.std(data, ddof=1)
    stat, p = stats.kstest(standardized, 'norm')
    results['kolmogorov_smirnov'] = {'statistic': stat, 'p_value': p,
                                      'normal': p > alpha}
    
    # 3. Anderson-Darling
    result = stats.anderson(data, dist='norm')
    # Use 5% significance level (index 2)
    results['anderson_darling'] = {
        'statistic': result.statistic,
        'critical_value_5%': result.critical_values[2],
        'normal': result.statistic < result.critical_values[2]
    }
    
    # 4. D'Agostino-Pearson (requires n >= 20)
    if n >= 20:
        stat, p = stats.normaltest(data)
        results['dagostino_pearson'] = {'statistic': stat, 'p_value': p,
                                         'normal': p > alpha}
    
    return results
 
def print_normality_results(results):
    print(f"
Normality Test Results (n = {results['n']}, α = {results['alpha']})")
    print("=" * 60)
    print(f"Skewness: {results['skewness']:.4f} (0 for normal)")
    print(f"Excess Kurtosis: {results['kurtosis']:.4f} (0 for normal)")
    print("-" * 60)
    
    if 'shapiro_wilk' in results:
        r = results['shapiro_wilk']
        print(f"Shapiro-Wilk: W={r['statistic']:.4f}, p={r['p_value']:.4f} "
              f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    r = results['kolmogorov_smirnov']
    print(f"Kolmogorov-Smirnov: D={r['statistic']:.4f}, p={r['p_value']:.4f} "
          f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    r = results['anderson_darling']
    print(f"Anderson-Darling: A²={r['statistic']:.4f}, crit={r['critical_value_5%']:.4f} "
          f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    if 'dagostino_pearson' in results:
        r = results['dagostino_pearson']
        print(f"D'Agostino-Pearson: K²={r['statistic']:.4f}, p={r['p_value']:.4f} "
              f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
 
# Test with different distributions
np.random.seed(42)
 
print("
" + "="*60)
print("Testing NORMAL data:")
normal_data = np.random.normal(50, 10, 200)
print_normality_results(comprehensive_normality_test(normal_data))
 
print("
" + "="*60)
print("Testing EXPONENTIAL data (should reject normality):")
exp_data = np.random.exponential(10, 200)
print_normality_results(comprehensive_normality_test(exp_data))

Practical Considerations

With large samples, normality tests will reject almost any real data (which is never exactly Gaussian). Focus on whether departures from normality are practically significant. The CLT often makes the normal approximation work even with non-normal data. Visual inspection via Q-Q plots is often more informative than p-values for large n.

Summary and Key Takeaways

The Gaussian distribution is the most important continuous distribution in machine learning and statistics. Let's consolidate our understanding:

Key Takeaways

•Definition: N(μ, σ²) is characterized by mean μ and variance σ²; the standard normal N(0,1) serves as the universal reference.
•The 68-95-99.7 Rule: About 68%, 95%, and 99.7% of values fall within 1, 2, and 3 standard deviations of the mean.
•Central Limit Theorem: Sums of independent random variables converge to Gaussian, explaining its ubiquity.
•MLE: Sample mean estimates μ (unbiased); sample variance with n denominator estimates σ² (biased, use n-1 for unbiased).
•Closure Properties: Gaussians are closed under linear transformations, summation, conditioning, and marginalization.
•ML Applications: Linear regression, Gaussian Naive Bayes, GPs, VAEs, initialization, and batch normalization all rely on Gaussians.
•Normality Testing: Use Q-Q plots and Shapiro-Wilk test; but remember the CLT makes Gaussian approximations robust.

What's Next:

We've covered discrete (Bernoulli/Binomial) and continuous (Gaussian) distributions. Next, we explore the Poisson distribution—the natural model for counting rare events over time or space. The Poisson connects to the Binomial (as a limit) and is fundamental to event modeling, queuing theory, and count-based machine learning problems.

Page Complete

You now have a comprehensive understanding of the Gaussian distribution—from its mathematical definition through the Central Limit Theorem to its foundational role in machine learning. The Gaussian's mathematical tractability and theoretical justification make it the default choice for modeling continuous uncertainty.

2 / 5

Loading learning content...

Machine LearningProbability Theory

Common Probability Distributions

LevelIntermediate

Duration120 mins

TopicProbability Theory

2 / 5

Gaussian (Normal) Distribution

The Universal Distribution of Nature and Science

In machine learning, the Gaussian is foundational:

Linear regression assumes Gaussian noise
Gaussian Processes define distributions over functions
Variational Autoencoders use Gaussian latent spaces
Batch Normalization transforms activations toward Gaussian
Maximum likelihood estimation often leads to least-squares when noise is Gaussian

Mastering the Gaussian distribution is essential for any serious machine learning practitioner.

What You Will Learn

Mathematical Definition

A continuous random variable X follows a Gaussian (Normal) distribution with mean μ and variance σ², written X ~ N(μ, σ²), if its probability density function (PDF) is:

$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

Equivalently, using the standard notation:

$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$

where:

μ (mu) is the mean or location parameter—the center of the distribution
σ (sigma) is the standard deviation—controls the spread
σ² is the variance
The support is the entire real line: x ∈ (-∞, +∞)

Understanding the PDF Components

Why This Specific Form?

The Gaussian is uniquely characterized by several mathematical properties:

1. Maximum Entropy: Among all distributions with specified mean μ and variance σ², the Gaussian has maximum entropy. It makes the fewest assumptions beyond the first two moments.

$$H(X) = \frac{1}{2} \ln(2\pi e \sigma^2)$$

2. Central Limit Theorem Limit: The sum of many independent random variables converges to a Gaussian.

3. Stability Under Linear Transformations: If X ~ N(μ, σ²), then aX + b ~ N(aμ + b, a²σ²).

4. Closure Under Convolution: The sum of independent Gaussians is Gaussian.

5. Zero Correlation Implies Independence: For jointly Gaussian variables, uncorrelated implies independent (unique among distributions).

These properties make the Gaussian mathematically tractable and theoretically justified as a default assumption when we know only the mean and variance.

Gaussian Distribution: Key Properties
Property	Formula/Value	Interpretation
Support	x ∈ (-∞, +∞)	Unbounded; assigns small probability to extreme values
Mean	E[X] = μ	Center of the distribution
Variance	Var(X) = σ²	Spread around the mean
Mode	μ	Most likely value (peak of PDF)
Median	μ	50th percentile
Skewness	0	Perfectly symmetric around μ
Kurtosis	3 (excess = 0)	Reference for 'normal' tail behavior
MGF	exp(μt + σ²t²/2)	Moment generating function
Entropy	½ ln(2πeσ²)	Bits of information

The Standard Normal Distribution

The Standard Normal Distribution is the special case with μ = 0 and σ² = 1, denoted Z ~ N(0, 1).

PDF of Standard Normal:

$$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$

CDF of Standard Normal:

$$\Phi(z) = \int_{-\infty}^{z} \phi(t) \, dt = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]$$

where erf is the error function.

Standardization (Z-Scores)

Any Gaussian can be converted to the standard normal via standardization:

If X ~ N(μ, σ²), then:

$$Z = \frac{X - \mu}{\sigma} \sim N(0, 1)$$

Conversely, if Z ~ N(0, 1):

$$X = \mu + \sigma Z \sim N(\mu, \sigma^2)$$

The z-score tells us how many standard deviations a value is from the mean.

The 68-95-99.7 Rule (Empirical Rule)

For any normal distribution:

Interval	Probability	Interpretation
μ ± 1σ	68.27%	About 2/3 of data falls within 1 SD
μ ± 2σ	95.45%	About 95% within 2 SDs
μ ± 3σ	99.73%	Nearly all within 3 SDs
μ ± 4σ	99.994%	Extreme outliers beyond 4 SDs

This rule provides quick mental estimates. If you see a value more than 3 standard deviations from the mean, it's exceptionally rare under normality assumptions.

standard_normal.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from scipy import stats
 
# Standard normal distribution
standard_normal = stats.norm(loc=0, scale=1)
 
# Key quantiles
print("Standard Normal Distribution Z ~ N(0,1)")
print("=" * 50)
 
# Probability within k standard deviations
for k in [1, 2, 3, 4]:
    prob = standard_normal.cdf(k) - standard_normal.cdf(-k)
    print(f"P(-{k} < Z < {k}) = {prob:.6f} ({prob*100:.3f}%)")
 
print()
 
# Critical values for confidence intervals
confidence_levels = [0.90, 0.95, 0.99]
print("Critical z-values for two-tailed confidence intervals:")
for cl in confidence_levels:
    alpha = 1 - cl
    z_critical = standard_normal.ppf(1 - alpha/2)
    print(f"  {cl*100:.0f}% CI: z = ±{z_critical:.4f}")
 
print()
 
# Converting between X ~ N(μ,σ²) and Z ~ N(0,1)
mu, sigma = 100, 15  # Example: IQ scores
x = 130  # A specific IQ score
 
z = (x - mu) / sigma  # Standardize
prob_below = standard_normal.cdf(z)
print(f"IQ scores: X ~ N({mu}, {sigma}²)")
print(f"IQ = {x} → z-score = {z:.2f}")
print(f"P(IQ < {x}) = P(Z < {z:.2f}) = {prob_below:.4f}")
print(f"Percentile: {prob_below*100:.1f}th")

The Universal Reference

The Central Limit Theorem

Formal Statement

Let X₁, X₂, ..., Xₙ be independent and identically distributed (i.i.d.) random variables with:

Mean: E[Xᵢ] = μ
Variance: Var(Xᵢ) = σ² < ∞

Define the sample mean: $$\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i$$

Then as n → ∞:

$$\sqrt{n}\left(\bar{X}_n - \mu\right) \stackrel{d}{\rightarrow} N(0, \sigma^2)$$

Equivalently, the standardized sum converges to a standard normal:

$$Z_n = \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \stackrel{d}{\rightarrow} N(0, 1)$$

The Remarkable Universality

Practical Implications

1. Finite Sample Approximation: For finite n, we approximate:

$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$

The approximation improves as n grows. Rule of thumb: n ≥ 30 often suffices, but more skewed distributions need larger n.

2. Binomial to Normal: For X ~ Binomial(n, p) with large n:

$$X \approx N(np, np(1-p))$$

3. Sample Mean Distribution: Even if individual observations are non-normal, the sampling distribution of the mean is approximately normal for large samples.

4. Confidence Intervals: The CLT justifies using z-intervals and t-intervals for population means.

Rate of Convergence (Berry-Esseen)

The Berry-Esseen theorem quantifies how fast the CLT convergence occurs:

$$\sup_z |P(Z_n \leq z) - \Phi(z)| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}$$

where ρ = E[|X - μ|³] is the third absolute moment and C ≈ 0.4748. The convergence is O(1/√n).

central_limit_theorem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def demonstrate_clt(original_dist, dist_name, n_samples=10000):
    """
    Demonstrate CLT by showing how sample means converge to Gaussian.
    """
    sample_sizes = [1, 2, 5, 10, 30, 100]
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    for ax, n in zip(axes.flatten(), sample_sizes):
        # Generate n_samples sets of n observations each
        data = original_dist.rvs(size=(n_samples, n))
        sample_means = data.mean(axis=1)
        
        # True parameters of sample mean distribution
        true_mean = original_dist.mean()
        true_std = original_dist.std() / np.sqrt(n)
        
        # Plot histogram of sample means
        ax.hist(sample_means, bins=50, density=True, alpha=0.7, 
                color='steelblue', edgecolor='navy')
        
        # Overlay the normal approximation
        x = np.linspace(sample_means.min(), sample_means.max(), 200)
        normal_pdf = stats.norm.pdf(x, true_mean, true_std)
        ax.plot(x, normal_pdf, 'r-', linewidth=2, label='Normal approx.')
        
        ax.set_title(f'n = {n}')
        ax.set_xlabel('Sample Mean')
        ax.legend()
    
    plt.suptitle(f'CLT Demonstration: {dist_name}', fontsize=14)
    plt.tight_layout()
    plt.show()
 
# Demonstrate with highly non-normal distributions
 
# 1. Exponential (skewed right)
print("Exponential Distribution (λ=1):")
exponential = stats.expon(scale=1)
print(f"  Mean: {exponential.mean():.2f}, Variance: {exponential.var():.2f}")
print(f"  Skewness: {exponential.stats(moments='s')[0]:.2f} (highly skewed)")
# demonstrate_clt(exponential, "Exponential(λ=1)")
 
# 2. Uniform (symmetric, non-normal)
print("
Uniform[0,1] Distribution:")
uniform = stats.uniform(0, 1)
print(f"  Mean: {uniform.mean():.2f}, Variance: {uniform.var():.4f}")
# demonstrate_clt(uniform, "Uniform[0,1]")
 
# 3. Bernoulli (discrete)
print("
Bernoulli(p=0.3) Distribution:")
bernoulli = stats.bernoulli(0.3)
print(f"  Mean: {bernoulli.mean():.2f}, Variance: {bernoulli.var():.4f}")
 
# Manual CLT demonstration
np.random.seed(42)
n = 50  # Sample size
n_experiments = 10000
 
# Draw n_experiments sample means, each from n observations
bernoulli_means = np.random.binomial(n, 0.3, n_experiments) / n
 
# Compare to normal
mu = 0.3
sigma = np.sqrt(0.3 * 0.7 / n)
print(f"
For n=50 Bernoulli samples:")
print(f"  μ_X̄ = {mu:.3f}, σ_X̄ = {sigma:.4f}")
print(f"  Observed: mean = {bernoulli_means.mean():.4f}, std = {bernoulli_means.std():.4f}")

Parameter Estimation via MLE

Given observations x₁, x₂, ..., xₙ from a Gaussian distribution, we estimate the unknown parameters μ and σ² using Maximum Likelihood Estimation (MLE).

Deriving the MLE

The likelihood of n i.i.d. Gaussian observations is:

$$L(\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$

The log-likelihood is:

$$\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$

Solving for μ

Taking the derivative with respect to μ:

$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu) = 0$$

Solving: $$\sum_{i=1}^n x_i - n\mu = 0 \implies \hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}$$

The MLE for μ is the sample mean.

Solving for σ²

Taking the derivative with respect to σ²:

$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2 = 0$$

Solving: $$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$$

The MLE for σ² is the sample variance (with n denominator).

Bias Correction

Interestingly, the MLE for variance is biased:

$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 eq \sigma^2$$

The unbiased estimator divides by (n-1) instead:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2$$

This is called Bessel's correction. The intuition: we use one degree of freedom to estimate μ, leaving only n-1 for estimating spread around it.

For large n, the difference is negligible.

Gaussian Parameter Estimators
Parameter	MLE Estimator	Unbiased Estimator	Properties
μ (mean)	x̄ = (1/n)Σxᵢ	Same (unbiased)	Var(x̄) = σ²/n
σ² (variance)	(1/n)Σ(xᵢ-x̄)²	s² = (1/(n-1))Σ(xᵢ-x̄)²	MLE biased by factor (n-1)/n
σ (std dev)	√[(1/n)Σ(xᵢ-x̄)²]	√s² (still slightly biased)	Unbiasing σ is more complex

gaussian_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats
 
def gaussian_mle(samples: np.ndarray) -> dict:
    """
    Maximum Likelihood Estimation for Gaussian parameters.
    
    Returns both MLE estimates and unbiased estimates with confidence intervals.
    """
    n = len(samples)
    
    # MLE estimates
    mu_mle = np.mean(samples)
    var_mle = np.var(samples, ddof=0)  # ddof=0 for MLE (divide by n)
    
    # Unbiased estimates  
    var_unbiased = np.var(samples, ddof=1)  # ddof=1 for unbiased (divide by n-1)
    sigma_unbiased = np.std(samples, ddof=1)
    
    # Standard error of the mean
    se_mean = sigma_unbiased / np.sqrt(n)
    
    # Confidence intervals for μ using t-distribution
    t_critical = stats.t.ppf(0.975, df=n-1)  # 95% CI
    ci_mean = (mu_mle - t_critical * se_mean, 
               mu_mle + t_critical * se_mean)
    
    # Confidence interval for σ² using chi-squared distribution
    # (n-1)s²/σ² ~ χ²(n-1)
    chi2_lower = stats.chi2.ppf(0.025, df=n-1)
    chi2_upper = stats.chi2.ppf(0.975, df=n-1)
    ci_var = ((n-1) * var_unbiased / chi2_upper,
              (n-1) * var_unbiased / chi2_lower)
    
    return {
        'n': n,
        'mu_mle': mu_mle,
        'sigma2_mle': var_mle,
        'sigma2_unbiased': var_unbiased,
        'sigma_unbiased': sigma_unbiased,
        'se_mean': se_mean,
        'ci_mean_95': ci_mean,
        'ci_var_95': ci_var
    }
 
# Example: Estimate parameters from data
np.random.seed(42)
true_mu, true_sigma = 50, 10
n = 100
data = np.random.normal(true_mu, true_sigma, n)
 
results = gaussian_mle(data)
print(f"True parameters: μ = {true_mu}, σ = {true_sigma}, σ² = {true_sigma**2}")
print(f"
MLE estimates:")
print(f"  μ̂ = {results['mu_mle']:.4f}")
print(f"  σ̂² (MLE) = {results['sigma2_mle']:.4f}")
print(f"  σ̂² (unbiased) = {results['sigma2_unbiased']:.4f}")
print(f"
95% CI for μ: [{results['ci_mean_95'][0]:.4f}, {results['ci_mean_95'][1]:.4f}]")
print(f"95% CI for σ²: [{results['ci_var_95'][0]:.4f}, {results['ci_var_95'][1]:.4f}]")
 
# Demonstrate bias in variance estimation
print("
--- Bias Demonstration ---")
n_simulations = 10000
mle_vars = []
unbiased_vars = []
 
for _ in range(n_simulations):
    sample = np.random.normal(true_mu, true_sigma, n)
    mle_vars.append(np.var(sample, ddof=0))
    unbiased_vars.append(np.var(sample, ddof=1))
 
print(f"True σ² = {true_sigma**2}")
print(f"E[σ̂²_MLE] ≈ {np.mean(mle_vars):.4f} (biased)")
print(f"E[s²] ≈ {np.mean(unbiased_vars):.4f} (unbiased)")
print(f"Expected MLE bias: {(n-1)/n * true_sigma**2:.4f}")

Properties and Operations on Gaussians

The Gaussian distribution has remarkable closure properties that make it incredibly tractable for mathematical analysis and practical applications.

Linear Transformation

If X ~ N(μ, σ²) and Y = aX + b, then:

$$Y \sim N(a\mu + b, a^2\sigma^2)$$

This means scaling and shifting a Gaussian yields another Gaussian. This is fundamental to standardization and z-scores.

Sum of Independent Gaussians

If X₁ ~ N(μ₁, σ₁²) and X₂ ~ N(μ₂, σ₂²) are independent, then:

$$X_1 + X_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$$

More generally, for independent Xᵢ ~ N(μᵢ, σᵢ²):

$$\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i \mu_i, \sum_{i=1}^n a_i^2 \sigma_i^2\right)$$

Key insight: Variances add, standard deviations do not!

Product of PDFs (Bayesian Updating)

The product of two Gaussian PDFs (up to normalization) is another Gaussian. This is crucial for Bayesian inference:

If we have a prior N(μ₁, σ₁²) and likelihood N(μ₂, σ₂²), the posterior is N(μₚ, σₚ²) where:

$$\sigma_p^2 = \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right)^{-1}$$

$$\mu_p = \sigma_p^2 \left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)$$

This is the precision-weighted average of the means, where precision = 1/variance.

Conditional Distribution

For jointly Gaussian variables (X, Y), the conditional distribution Y|X is also Gaussian:

$$Y | X = x \sim N\left(\mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X), \sigma_Y^2(1-\rho^2)\right)$$

where ρ is the correlation coefficient. This property is fundamental to Gaussian processes and linear regression.

Gaussian Closure Properties
Operation	Result	Formula
Linear transform Y = aX + b	Gaussian	N(aμ + b, a²σ²)
Sum of independent Gaussians	Gaussian	Means add, variances add
Product of PDFs	Gaussian (unnormalized)	Precision-weighted combination
Conditional (joint Gaussian)	Gaussian	Linear regression formula
Marginal (joint Gaussian)	Gaussian	Extract relevant parameters
Convolution	Gaussian	Variances add

Why Gaussians Are So Tractable

Applications in Machine Learning

The Gaussian distribution is woven into the fabric of machine learning. Here we explore its key applications in depth.

Linear Regression

The standard linear regression model assumes:

$$y = \mathbf{w}^T \mathbf{x} + \epsilon, \quad \epsilon \sim N(0, \sigma^2)$$

This implies: $$y | \mathbf{x} \sim N(\mathbf{w}^T \mathbf{x}, \sigma^2)$$

The log-likelihood for n observations is:

$$\ell(\mathbf{w}, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

Maximizing with respect to w is equivalent to minimizing the sum of squared errors—the famous least squares criterion emerges naturally from Gaussian noise assumptions!

gaussian_ml_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy import stats
 
# 1. LINEAR REGRESSION FROM GAUSSIAN LIKELIHOOD
def linear_regression_mle(X, y):
    """
    Derive linear regression from Gaussian noise assumption.
    y = X @ w + ε, where ε ~ N(0, σ²I)
    
    MLE for w: minimize ||y - Xw||² 
    Solution: w = (X'X)^{-1} X'y
    """
    # Add bias term
    X_bias = np.column_stack([np.ones(len(X)), X])
    
    # MLE (equivalent to OLS)
    w_mle = np.linalg.lstsq(X_bias, y, rcond=None)[0]
    
    # Predicted values
    y_pred = X_bias @ w_mle
    
    # MLE for σ²
    residuals = y - y_pred
    sigma2_mle = np.mean(residuals ** 2)
    
    # Log-likelihood
    n = len(y)
    log_likelihood = -n/2 * np.log(2 * np.pi * sigma2_mle) - n/2
    
    return {
        'weights': w_mle,
        'sigma2': sigma2_mle,
        'log_likelihood': log_likelihood,
        'predictions': y_pred
    }
 
# 2. GAUSSIAN NAIVE BAYES
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier.
    Assumes features are Gaussian given class: P(x_j | y=c) = N(μ_jc, σ²_jc)
    """
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.n_features = X.shape[1]
        
        # Store class priors and parameters
        self.priors = {}
        self.means = {}
        self.variances = {}
        
        for c in self.classes:
            X_c = X[y == c]
            self.priors[c] = len(X_c) / len(X)
            self.means[c] = X_c.mean(axis=0)
            self.variances[c] = X_c.var(axis=0) + 1e-9  # Smoothing
    
    def _log_likelihood(self, x, c):
        """Compute log P(x | y=c) assuming Gaussian features."""
        log_prob = 0
        for j in range(self.n_features):
            log_prob += stats.norm.logpdf(
                x[j], 
                self.means[c][j], 
                np.sqrt(self.variances[c][j])
            )
        return log_prob
    
    def predict_proba(self, X):
        """Compute posterior probabilities P(y=c | x)."""
        log_posteriors = np.zeros((len(X), len(self.classes)))
        
        for i, x in enumerate(X):
            for j, c in enumerate(self.classes):
                log_posteriors[i, j] = (np.log(self.priors[c]) + 
                                        self._log_likelihood(x, c))
        
        # Normalize using log-sum-exp
        log_norm = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True))
        posteriors = np.exp(log_posteriors - log_norm)
        return posteriors
    
    def predict(self, X):
        return self.classes[self.predict_proba(X).argmax(axis=1)]
 
# Example usage
np.random.seed(42)
X = np.vstack([
    np.random.normal([0, 0], [1, 1], (50, 2)),
    np.random.normal([3, 3], [1, 1], (50, 2))
])
y = np.array([0]*50 + [1]*50)
 
gnb = GaussianNaiveBayes()
gnb.fit(X, y)
accuracy = (gnb.predict(X) == y).mean()
print(f"Gaussian Naive Bayes accuracy: {accuracy:.2%}")

Gaussian Processes

Gaussian Processes (GPs) define distributions over functions, where any finite collection of function values is jointly Gaussian:

$$[f(x_1), f(x_2), \ldots, f(x_n)]^T \sim N(\mathbf{m}, \mathbf{K})$$

where:

m is the mean vector: mᵢ = m(xᵢ)
K is the covariance (kernel) matrix: Kᵢⱼ = k(xᵢ, xⱼ)

GPs provide:

Non-parametric, flexible function modeling
Uncertainty quantification (posterior variance)
Principled Bayesian inference

Variational Autoencoders (VAEs)

VAEs use Gaussian distributions in the latent space:

Encoder: q(z|x) = N(μ(x), σ²(x)) — maps data to latent Gaussian
Prior: p(z) = N(0, I) — standard Gaussian prior
Decoder: p(x|z) — reconstructs from latent code

The "reparameterization trick" enables backpropagation through Gaussian sampling:

$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, I)$$

Gaussian Applications Summary

•Linear Regression: Gaussian noise assumption leads to least squares; enables MLE and confidence intervals
•Gaussian Naive Bayes: Continuous features modeled as Gaussian given class
•Gaussian Mixture Models: Cluster data using mixtures of Gaussians (EM algorithm)
•Gaussian Processes: Non-parametric Bayesian regression with uncertainty
•Variational Autoencoders: Gaussian latent spaces for generative modeling
•Batch Normalization: Transforms activations to be approximately Gaussian
•Weight Initialization: Glorot/He initialization uses Gaussian or uniform with σ based on layer size
•Additive Noise: Data augmentation, dropout noise, and label smoothing often assume Gaussian

Testing for Normality

Many statistical methods assume normality. Before applying such methods, we should verify whether the assumption is reasonable. Here are the main approaches:

Visual Methods

1. Histogram: Compare the empirical distribution to the theoretical Gaussian bell curve.

2. Q-Q Plot (Quantile-Quantile): Plot sample quantiles against theoretical Gaussian quantiles. If data is Gaussian, points lie on a straight line.

Points curving up on the right → right skew
Points curving down on the left → left skew
S-shape → heavy tails
Inverted S → light tails

Statistical Tests

Shapiro-Wilk Test: Most powerful for small to moderate samples (n < 5000).

H₀: Data comes from a normal distribution
Small p-value → reject normality

Kolmogorov-Smirnov Test: Compares empirical CDF to theoretical normal CDF.

Less powerful than Shapiro-Wilk
Sensitive to any departure from normality

Anderson-Darling Test: Similar to K-S but gives more weight to tails.

Good for detecting tail departures

D'Agostino-Pearson Test: Combines skewness and kurtosis tests.

Tests based on third and fourth moments

normality_testing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def comprehensive_normality_test(data, alpha=0.05):
    """
    Perform comprehensive normality testing with visual and statistical tests.
    """
    n = len(data)
    results = {'n': n, 'alpha': alpha}
    
    # Descriptive statistics
    results['mean'] = np.mean(data)
    results['std'] = np.std(data, ddof=1)
    results['skewness'] = stats.skew(data)
    results['kurtosis'] = stats.kurtosis(data)  # Excess kurtosis (0 for normal)
    
    # Statistical tests
    # 1. Shapiro-Wilk (best for n < 5000)
    if n <= 5000:
        stat, p = stats.shapiro(data)
        results['shapiro_wilk'] = {'statistic': stat, 'p_value': p, 
                                    'normal': p > alpha}
    
    # 2. Kolmogorov-Smirnov (against standard normal after standardization)
    standardized = (data - np.mean(data)) / np.std(data, ddof=1)
    stat, p = stats.kstest(standardized, 'norm')
    results['kolmogorov_smirnov'] = {'statistic': stat, 'p_value': p,
                                      'normal': p > alpha}
    
    # 3. Anderson-Darling
    result = stats.anderson(data, dist='norm')
    # Use 5% significance level (index 2)
    results['anderson_darling'] = {
        'statistic': result.statistic,
        'critical_value_5%': result.critical_values[2],
        'normal': result.statistic < result.critical_values[2]
    }
    
    # 4. D'Agostino-Pearson (requires n >= 20)
    if n >= 20:
        stat, p = stats.normaltest(data)
        results['dagostino_pearson'] = {'statistic': stat, 'p_value': p,
                                         'normal': p > alpha}
    
    return results
 
def print_normality_results(results):
    print(f"
Normality Test Results (n = {results['n']}, α = {results['alpha']})")
    print("=" * 60)
    print(f"Skewness: {results['skewness']:.4f} (0 for normal)")
    print(f"Excess Kurtosis: {results['kurtosis']:.4f} (0 for normal)")
    print("-" * 60)
    
    if 'shapiro_wilk' in results:
        r = results['shapiro_wilk']
        print(f"Shapiro-Wilk: W={r['statistic']:.4f}, p={r['p_value']:.4f} "
              f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    r = results['kolmogorov_smirnov']
    print(f"Kolmogorov-Smirnov: D={r['statistic']:.4f}, p={r['p_value']:.4f} "
          f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    r = results['anderson_darling']
    print(f"Anderson-Darling: A²={r['statistic']:.4f}, crit={r['critical_value_5%']:.4f} "
          f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    if 'dagostino_pearson' in results:
        r = results['dagostino_pearson']
        print(f"D'Agostino-Pearson: K²={r['statistic']:.4f}, p={r['p_value']:.4f} "
              f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
 
# Test with different distributions
np.random.seed(42)
 
print("
" + "="*60)
print("Testing NORMAL data:")
normal_data = np.random.normal(50, 10, 200)
print_normality_results(comprehensive_normality_test(normal_data))
 
print("
" + "="*60)
print("Testing EXPONENTIAL data (should reject normality):")
exp_data = np.random.exponential(10, 200)
print_normality_results(comprehensive_normality_test(exp_data))

Practical Considerations

Summary and Key Takeaways

The Gaussian distribution is the most important continuous distribution in machine learning and statistics. Let's consolidate our understanding:

Key Takeaways

•Definition: N(μ, σ²) is characterized by mean μ and variance σ²; the standard normal N(0,1) serves as the universal reference.
•The 68-95-99.7 Rule: About 68%, 95%, and 99.7% of values fall within 1, 2, and 3 standard deviations of the mean.
•Central Limit Theorem: Sums of independent random variables converge to Gaussian, explaining its ubiquity.
•MLE: Sample mean estimates μ (unbiased); sample variance with n denominator estimates σ² (biased, use n-1 for unbiased).
•Closure Properties: Gaussians are closed under linear transformations, summation, conditioning, and marginalization.
•ML Applications: Linear regression, Gaussian Naive Bayes, GPs, VAEs, initialization, and batch normalization all rely on Gaussians.
•Normality Testing: Use Q-Q plots and Shapiro-Wilk test; but remember the CLT makes Gaussian approximations robust.

What's Next:

Page Complete

2 / 5