Common Probability Distributions - Learning Module

Loading content...

0/245

Gaussian (Normal) Distribution

The Universal Distribution of Nature and Science

If you could learn only one probability distribution, it should be the Gaussian distribution—also known as the Normal distribution. Named after Carl Friedrich Gauss (1777–1855), who used it to analyze astronomical data, this distribution appears with remarkable ubiquity across science, engineering, and machine learning.\n\nWhy is the Gaussian so special? The answer lies in the Central Limit Theorem: the sum of many independent random variables (regardless of their original distributions) tends toward a Gaussian. This explains why measurement errors, biological traits, stock returns, and countless other phenomena follow bell-shaped curves—they are aggregates of many small, independent effects.\n\nIn machine learning, the Gaussian is foundational:\n- Linear regression assumes Gaussian noise\n- Gaussian Processes define distributions over functions\n- Variational Autoencoders use Gaussian latent spaces\n- Batch Normalization transforms activations toward Gaussian\n- Maximum likelihood estimation often leads to least-squares when noise is Gaussian\n\nMastering the Gaussian distribution is essential for any serious machine learning practitioner.

What You Will Learn

By the end of this page, you will understand the Gaussian distribution from first principles: its mathematical definition, key properties, the standard normal and z-scores, the Central Limit Theorem, parameter estimation via MLE, and its pervasive applications throughout machine learning.

Mathematical Definition

A continuous random variable X follows a Gaussian (Normal) distribution with mean μ and variance σ², written X ~ N(μ, σ²), if its probability density function (PDF) is:\n\n$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$\n\nEquivalently, using the standard notation:\n\n$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$\n\nwhere:\n- μ (mu) is the mean or location parameter—the center of the distribution\n- σ (sigma) is the standard deviation—controls the spread\n- σ² is the variance\n- The support is the entire real line: x ∈ (-∞, +∞)

Understanding the PDF Components

The Gaussian PDF has two parts: (1) The normalizing constant 1/(σ√(2π)) ensures the total area under the curve equals 1. (2) The exponential term exp(-(x-μ)²/(2σ²)) creates the characteristic bell shape, with the quadratic in the exponent producing symmetry around μ and decay as x moves away from μ.

Why This Specific Form?\n\nThe Gaussian is uniquely characterized by several mathematical properties:\n\n1. Maximum Entropy: Among all distributions with specified mean μ and variance σ², the Gaussian has maximum entropy. It makes the fewest assumptions beyond the first two moments.\n\n$$H(X) = \frac{1}{2} \ln(2\pi e \sigma^2)$$\n\n2. Central Limit Theorem Limit: The sum of many independent random variables converges to a Gaussian.\n\n3. Stability Under Linear Transformations: If X ~ N(μ, σ²), then aX + b ~ N(aμ + b, a²σ²).\n\n4. Closure Under Convolution: The sum of independent Gaussians is Gaussian.\n\n5. Zero Correlation Implies Independence: For jointly Gaussian variables, uncorrelated implies independent (unique among distributions).\n\nThese properties make the Gaussian mathematically tractable and theoretically justified as a default assumption when we know only the mean and variance.

Gaussian Distribution: Key Properties
Property	Formula/Value	Interpretation
Support	x ∈ (-∞, +∞)	Unbounded; assigns small probability to extreme values
Mean	E[X] = μ	Center of the distribution
Variance	Var(X) = σ²	Spread around the mean
Mode	μ	Most likely value (peak of PDF)
Median	μ	50th percentile
Skewness	0	Perfectly symmetric around μ
Kurtosis	3 (excess = 0)	Reference for 'normal' tail behavior
MGF	exp(μt + σ²t²/2)	Moment generating function
Entropy	½ ln(2πeσ²)	Bits of information

The Standard Normal Distribution

The Standard Normal Distribution is the special case with μ = 0 and σ² = 1, denoted Z ~ N(0, 1).\n\nPDF of Standard Normal:\n\n$$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$\n\nCDF of Standard Normal:\n\n$$\Phi(z) = \int_{-\infty}^{z} \phi(t) \, dt = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]$$\n\nwhere erf is the error function.\n\n### Standardization (Z-Scores)\n\nAny Gaussian can be converted to the standard normal via standardization:\n\nIf X ~ N(μ, σ²), then:\n\n$$Z = \frac{X - \mu}{\sigma} \sim N(0, 1)$$\n\nConversely, if Z ~ N(0, 1):\n\n$$X = \mu + \sigma Z \sim N(\mu, \sigma^2)$$\n\nThe z-score tells us how many standard deviations a value is from the mean.

The 68-95-99.7 Rule (Empirical Rule)\n\nFor any normal distribution:\n\n| Interval | Probability | Interpretation |\n|----------|-------------|----------------|\n| μ ± 1σ | 68.27% | About 2/3 of data falls within 1 SD |\n| μ ± 2σ | 95.45% | About 95% within 2 SDs |\n| μ ± 3σ | 99.73% | Nearly all within 3 SDs |\n| μ ± 4σ | 99.994% | Extreme outliers beyond 4 SDs |\n\nThis rule provides quick mental estimates. If you see a value more than 3 standard deviations from the mean, it's exceptionally rare under normality assumptions.

standard_normal.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import numpy as np
from scipy import stats
 
# Standard normal distribution
standard_normal = stats.norm(loc=0, scale=1)
 
# Key quantiles
print("Standard Normal Distribution Z ~ N(0,1)")
print("=" * 50)
 
# Probability within k standard deviations
for k in [1, 2, 3, 4]:
    prob = standard_normal.cdf(k) - standard_normal.cdf(-k)
    print(f"P(-{k} < Z < {k}) = {prob:.6f} ({prob*100:.3f}%)")
 
print()
 
# Critical values for confidence intervals
confidence_levels = [0.90, 0.95, 0.99]
print("Critical z-values for two-tailed confidence intervals:")
for cl in confidence_levels:
    alpha = 1 - cl
    z_critical = standard_normal.ppf(1 - alpha/2)
    print(f"  {cl*100:.0f}% CI: z = ±{z_critical:.4f}")
 
print()
 
# Converting between X ~ N(μ,σ²) and Z ~ N(0,1)
mu, sigma = 100, 15  # Example: IQ scores
x = 130  # A specific IQ score
 
z = (x - mu) / sigma  # Standardize
prob_below = standard_normal.cdf(z)
print(f"IQ scores: X ~ N({mu}, {sigma}²)")
print(f"IQ = {x} → z-score = {z:.2f}")
print(f"P(IQ < {x}) = P(Z < {z:.2f}) = {prob_below:.4f}")
print(f"Percentile: {prob_below*100:.1f}th")

The Universal Reference

The standard normal serves as a universal reference. Any question about a Gaussian N(μ,σ²) can be answered by standardizing to N(0,1) and using precomputed tables or functions. This is why software libraries need only implement the standard normal—all other Gaussians follow from linear transformation.

The Central Limit Theorem

The Central Limit Theorem (CLT) is one of the most profound results in probability theory, explaining why the Gaussian distribution is so prevalent. It states that the sum (or average) of many independent random variables tends toward a Gaussian, regardless of the original distributions.\n\n### Formal Statement\n\nLet X₁, X₂, ..., Xₙ be independent and identically distributed (i.i.d.) random variables with:\n- Mean: E[Xᵢ] = μ\n- Variance: Var(Xᵢ) = σ² < ∞\n\nDefine the sample mean:\n$$\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i$$\n\nThen as n → ∞:\n\n$$\sqrt{n}\left(\bar{X}n - \mu\right) \stackrel{d}{\rightarrow} N(0, \sigma^2)$$\n\nEquivalently, the standardized sum converges to a standard normal:\n\n$$Z_n = \frac{\sum{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \stackrel{d}{\rightarrow} N(0, 1)$$

The Remarkable Universality

The CLT is remarkable because it doesn't matter what distribution the Xᵢ come from—uniform, exponential, Poisson, Bernoulli, or any other with finite variance. The sum always tends toward Gaussian. This universality explains the Gaussian's ubiquity in natural phenomena, which are often the result of many small, independent contributions.

Practical Implications\n\n1. Finite Sample Approximation:\nFor finite n, we approximate:\n\n$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$\n\nThe approximation improves as n grows. Rule of thumb: n ≥ 30 often suffices, but more skewed distributions need larger n.\n\n2. Binomial to Normal:\nFor X ~ Binomial(n, p) with large n:\n\n$$X \approx N(np, np(1-p))$$\n\n3. Sample Mean Distribution:\nEven if individual observations are non-normal, the sampling distribution of the mean is approximately normal for large samples.\n\n4. Confidence Intervals:\nThe CLT justifies using z-intervals and t-intervals for population means.\n\n### Rate of Convergence (Berry-Esseen)\n\nThe Berry-Esseen theorem quantifies how fast the CLT convergence occurs:\n\n$$\sup_z |P(Z_n \leq z) - \Phi(z)| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}$$\n\nwhere ρ = E[|X - μ|³] is the third absolute moment and C ≈ 0.4748. The convergence is O(1/√n).

central_limit_theorem.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def demonstrate_clt(original_dist, dist_name, n_samples=10000):
    """
    Demonstrate CLT by showing how sample means converge to Gaussian.
    """
    sample_sizes = [1, 2, 5, 10, 30, 100]
    fig, axes = plt.subplots(2, 3, figsize=(14, 8))
    
    for ax, n in zip(axes.flatten(), sample_sizes):
        # Generate n_samples sets of n observations each
        data = original_dist.rvs(size=(n_samples, n))
        sample_means = data.mean(axis=1)
        
        # True parameters of sample mean distribution
        true_mean = original_dist.mean()
        true_std = original_dist.std() / np.sqrt(n)
        
        # Plot histogram of sample means
        ax.hist(sample_means, bins=50, density=True, alpha=0.7, 
                color='steelblue', edgecolor='navy')
        
        # Overlay the normal approximation
        x = np.linspace(sample_means.min(), sample_means.max(), 200)
        normal_pdf = stats.norm.pdf(x, true_mean, true_std)
        ax.plot(x, normal_pdf, 'r-', linewidth=2, label='Normal approx.')
        
        ax.set_title(f'n = {n}')
        ax.set_xlabel('Sample Mean')
        ax.legend()
    
    plt.suptitle(f'CLT Demonstration: {dist_name}', fontsize=14)
    plt.tight_layout()
    plt.show()
 
# Demonstrate with highly non-normal distributions
 
# 1. Exponential (skewed right)
print("Exponential Distribution (λ=1):")
exponential = stats.expon(scale=1)
print(f"  Mean: {exponential.mean():.2f}, Variance: {exponential.var():.2f}")
print(f"  Skewness: {exponential.stats(moments='s')[0]:.2f} (highly skewed)")
# demonstrate_clt(exponential, "Exponential(λ=1)")
 
# 2. Uniform (symmetric, non-normal)
print("\nUniform[0,1] Distribution:")
uniform = stats.uniform(0, 1)
print(f"  Mean: {uniform.mean():.2f}, Variance: {uniform.var():.4f}")
# demonstrate_clt(uniform, "Uniform[0,1]")
 
# 3. Bernoulli (discrete)
print("\nBernoulli(p=0.3) Distribution:")
bernoulli = stats.bernoulli(0.3)
print(f"  Mean: {bernoulli.mean():.2f}, Variance: {bernoulli.var():.4f}")
 
# Manual CLT demonstration
np.random.seed(42)
n = 50  # Sample size
n_experiments = 10000
 
# Draw n_experiments sample means, each from n observations
bernoulli_means = np.random.binomial(n, 0.3, n_experiments) / n
 
# Compare to normal
mu = 0.3
sigma = np.sqrt(0.3 * 0.7 / n)
print(f"\nFor n=50 Bernoulli samples:")
print(f"  μ_X̄ = {mu:.3f}, σ_X̄ = {sigma:.4f}")
print(f"  Observed: mean = {bernoulli_means.mean():.4f}, std = {bernoulli_means.std():.4f}")

Parameter Estimation via MLE

Given observations x₁, x₂, ..., xₙ from a Gaussian distribution, we estimate the unknown parameters μ and σ² using Maximum Likelihood Estimation (MLE).\n\n### Deriving the MLE\n\nThe likelihood of n i.i.d. Gaussian observations is:\n\n$$L(\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$\n\nThe log-likelihood is:\n\n$$\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$\n\n### Solving for μ\n\nTaking the derivative with respect to μ:\n\n$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu) = 0$$\n\nSolving:\n$$\sum_{i=1}^n x_i - n\mu = 0 \implies \hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}$$\n\nThe MLE for μ is the sample mean.

Solving for σ²\n\nTaking the derivative with respect to σ²:\n\n$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2 = 0$$\n\nSolving:\n$$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$$\n\nThe MLE for σ² is the sample variance (with n denominator).\n\n### Bias Correction\n\nInterestingly, the MLE for variance is biased:\n\n$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 \neq \sigma^2$$\n\nThe unbiased estimator divides by (n-1) instead:\n\n$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2$$\n\nThis is called Bessel's correction. The intuition: we use one degree of freedom to estimate μ, leaving only n-1 for estimating spread around it.\n\nFor large n, the difference is negligible.

Gaussian Parameter Estimators
Parameter	MLE Estimator	Unbiased Estimator	Properties
μ (mean)	x̄ = (1/n)Σxᵢ	Same (unbiased)	Var(x̄) = σ²/n
σ² (variance)	(1/n)Σ(xᵢ-x̄)²	s² = (1/(n-1))Σ(xᵢ-x̄)²	MLE biased by factor (n-1)/n
σ (std dev)	√[(1/n)Σ(xᵢ-x̄)²]	√s² (still slightly biased)	Unbiasing σ is more complex

gaussian_estimation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from scipy import stats
 
def gaussian_mle(samples: np.ndarray) -> dict:
    """
    Maximum Likelihood Estimation for Gaussian parameters.
    
    Returns both MLE estimates and unbiased estimates with confidence intervals.
    """
    n = len(samples)
    
    # MLE estimates
    mu_mle = np.mean(samples)
    var_mle = np.var(samples, ddof=0)  # ddof=0 for MLE (divide by n)
    
    # Unbiased estimates  
    var_unbiased = np.var(samples, ddof=1)  # ddof=1 for unbiased (divide by n-1)
    sigma_unbiased = np.std(samples, ddof=1)
    
    # Standard error of the mean
    se_mean = sigma_unbiased / np.sqrt(n)
    
    # Confidence intervals for μ using t-distribution
    t_critical = stats.t.ppf(0.975, df=n-1)  # 95% CI
    ci_mean = (mu_mle - t_critical * se_mean, 
               mu_mle + t_critical * se_mean)
    
    # Confidence interval for σ² using chi-squared distribution
    # (n-1)s²/σ² ~ χ²(n-1)
    chi2_lower = stats.chi2.ppf(0.025, df=n-1)
    chi2_upper = stats.chi2.ppf(0.975, df=n-1)
    ci_var = ((n-1) * var_unbiased / chi2_upper,
              (n-1) * var_unbiased / chi2_lower)
    
    return {
        'n': n,
        'mu_mle': mu_mle,
        'sigma2_mle': var_mle,
        'sigma2_unbiased': var_unbiased,
        'sigma_unbiased': sigma_unbiased,
        'se_mean': se_mean,
        'ci_mean_95': ci_mean,
        'ci_var_95': ci_var
    }
 
# Example: Estimate parameters from data
np.random.seed(42)
true_mu, true_sigma = 50, 10
n = 100
data = np.random.normal(true_mu, true_sigma, n)
 
results = gaussian_mle(data)
print(f"True parameters: μ = {true_mu}, σ = {true_sigma}, σ² = {true_sigma**2}")
print(f"\nMLE estimates:")
print(f"  μ̂ = {results['mu_mle']:.4f}")
print(f"  σ̂² (MLE) = {results['sigma2_mle']:.4f}")
print(f"  σ̂² (unbiased) = {results['sigma2_unbiased']:.4f}")
print(f"\n95% CI for μ: [{results['ci_mean_95'][0]:.4f}, {results['ci_mean_95'][1]:.4f}]")
print(f"95% CI for σ²: [{results['ci_var_95'][0]:.4f}, {results['ci_var_95'][1]:.4f}]")
 
# Demonstrate bias in variance estimation
print("\n--- Bias Demonstration ---")
n_simulations = 10000
mle_vars = []
unbiased_vars = []
 
for _ in range(n_simulations):
    sample = np.random.normal(true_mu, true_sigma, n)
    mle_vars.append(np.var(sample, ddof=0))
    unbiased_vars.append(np.var(sample, ddof=1))
 
print(f"True σ² = {true_sigma**2}")
print(f"E[σ̂²_MLE] ≈ {np.mean(mle_vars):.4f} (biased)")
print(f"E[s²] ≈ {np.mean(unbiased_vars):.4f} (unbiased)")
print(f"Expected MLE bias: {(n-1)/n * true_sigma**2:.4f}")

Properties and Operations on Gaussians

The Gaussian distribution has remarkable closure properties that make it incredibly tractable for mathematical analysis and practical applications.\n\n### Linear Transformation\n\nIf X ~ N(μ, σ²) and Y = aX + b, then:\n\n$$Y \sim N(a\mu + b, a^2\sigma^2)$$\n\nThis means scaling and shifting a Gaussian yields another Gaussian. This is fundamental to standardization and z-scores.\n\n### Sum of Independent Gaussians\n\nIf X₁ ~ N(μ₁, σ₁²) and X₂ ~ N(μ₂, σ₂²) are independent, then:\n\n$$X_1 + X_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$$\n\nMore generally, for independent Xᵢ ~ N(μᵢ, σᵢ²):\n\n$$\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i \mu_i, \sum_{i=1}^n a_i^2 \sigma_i^2\right)$$\n\nKey insight: Variances add, standard deviations do not!

Product of PDFs (Bayesian Updating)\n\nThe product of two Gaussian PDFs (up to normalization) is another Gaussian. This is crucial for Bayesian inference:\n\nIf we have a prior N(μ₁, σ₁²) and likelihood N(μ₂, σ₂²), the posterior is N(μₚ, σₚ²) where:\n\n$$\sigma_p^2 = \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right)^{-1}$$\n\n$$\mu_p = \sigma_p^2 \left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)$$\n\nThis is the precision-weighted average of the means, where precision = 1/variance.\n\n### Conditional Distribution\n\nFor jointly Gaussian variables (X, Y), the conditional distribution Y|X is also Gaussian:\n\n$$Y | X = x \sim N\left(\mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X), \sigma_Y^2(1-\rho^2)\right)$$\n\nwhere ρ is the correlation coefficient. This property is fundamental to Gaussian processes and linear regression.

Gaussian Closure Properties
Operation	Result	Formula
Linear transform Y = aX + b	Gaussian	N(aμ + b, a²σ²)
Sum of independent Gaussians	Gaussian	Means add, variances add
Product of PDFs	Gaussian (unnormalized)	Precision-weighted combination
Conditional (joint Gaussian)	Gaussian	Linear regression formula
Marginal (joint Gaussian)	Gaussian	Extract relevant parameters
Convolution	Gaussian	Variances add

Why Gaussians Are So Tractable

The Gaussian family is closed under almost every operation we care about: linear combinations, conditioning, marginalization, convolution, and conjugate Bayesian updating. This mathematical closure is why Gaussian assumptions are so common—they lead to analytically tractable solutions. When exact solutions aren't possible, Gaussian approximations (like the Laplace approximation) are often the first resort.

Applications in Machine Learning

The Gaussian distribution is woven into the fabric of machine learning. Here we explore its key applications in depth.\n\n### Linear Regression\n\nThe standard linear regression model assumes:\n\n$$y = \mathbf{w}^T \mathbf{x} + \epsilon, \quad \epsilon \sim N(0, \sigma^2)$$\n\nThis implies:\n$$y | \mathbf{x} \sim N(\mathbf{w}^T \mathbf{x}, \sigma^2)$$\n\nThe log-likelihood for n observations is:\n\n$$\ell(\mathbf{w}, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$\n\nMaximizing with respect to w is equivalent to minimizing the sum of squared errors—the famous least squares criterion emerges naturally from Gaussian noise assumptions!

gaussian_ml_applications.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
from scipy import stats
 
# 1. LINEAR REGRESSION FROM GAUSSIAN LIKELIHOOD
def linear_regression_mle(X, y):
    """
    Derive linear regression from Gaussian noise assumption.
    y = X @ w + ε, where ε ~ N(0, σ²I)
    
    MLE for w: minimize ||y - Xw||² 
    Solution: w = (X'X)^{-1} X'y
    """
    # Add bias term
    X_bias = np.column_stack([np.ones(len(X)), X])
    
    # MLE (equivalent to OLS)
    w_mle = np.linalg.lstsq(X_bias, y, rcond=None)[0]
    
    # Predicted values
    y_pred = X_bias @ w_mle
    
    # MLE for σ²
    residuals = y - y_pred
    sigma2_mle = np.mean(residuals ** 2)
    
    # Log-likelihood
    n = len(y)
    log_likelihood = -n/2 * np.log(2 * np.pi * sigma2_mle) - n/2
    
    return {
        'weights': w_mle,
        'sigma2': sigma2_mle,
        'log_likelihood': log_likelihood,
        'predictions': y_pred
    }
 
# 2. GAUSSIAN NAIVE BAYES
class GaussianNaiveBayes:
    """
    Gaussian Naive Bayes classifier.
    Assumes features are Gaussian given class: P(x_j | y=c) = N(μ_jc, σ²_jc)
    """
    def fit(self, X, y):
        self.classes = np.unique(y)
        self.n_features = X.shape[1]
        
        # Store class priors and parameters
        self.priors = {}
        self.means = {}
        self.variances = {}
        
        for c in self.classes:
            X_c = X[y == c]
            self.priors[c] = len(X_c) / len(X)
            self.means[c] = X_c.mean(axis=0)
            self.variances[c] = X_c.var(axis=0) + 1e-9  # Smoothing
    
    def _log_likelihood(self, x, c):
        """Compute log P(x | y=c) assuming Gaussian features."""
        log_prob = 0
        for j in range(self.n_features):
            log_prob += stats.norm.logpdf(
                x[j], 
                self.means[c][j], 
                np.sqrt(self.variances[c][j])
            )
        return log_prob
    
    def predict_proba(self, X):
        """Compute posterior probabilities P(y=c | x)."""
        log_posteriors = np.zeros((len(X), len(self.classes)))
        
        for i, x in enumerate(X):
            for j, c in enumerate(self.classes):
                log_posteriors[i, j] = (np.log(self.priors[c]) + 
                                        self._log_likelihood(x, c))
        
        # Normalize using log-sum-exp
        log_norm = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True))
        posteriors = np.exp(log_posteriors - log_norm)
        return posteriors
    
    def predict(self, X):
        return self.classes[self.predict_proba(X).argmax(axis=1)]
 
# Example usage
np.random.seed(42)
X = np.vstack([
    np.random.normal([0, 0], [1, 1], (50, 2)),
    np.random.normal([3, 3], [1, 1], (50, 2))
])
y = np.array([0]*50 + [1]*50)
 
gnb = GaussianNaiveBayes()
gnb.fit(X, y)
accuracy = (gnb.predict(X) == y).mean()
print(f"Gaussian Naive Bayes accuracy: {accuracy:.2%}")

Gaussian Processes\n\nGaussian Processes (GPs) define distributions over functions, where any finite collection of function values is jointly Gaussian:\n\n$$[f(x_1), f(x_2), \ldots, f(x_n)]^T \sim N(\mathbf{m}, \mathbf{K})$$\n\nwhere:\n- m is the mean vector: mᵢ = m(xᵢ)\n- K is the covariance (kernel) matrix: Kᵢⱼ = k(xᵢ, xⱼ)\n\nGPs provide:\n- Non-parametric, flexible function modeling\n- Uncertainty quantification (posterior variance)\n- Principled Bayesian inference\n\n### Variational Autoencoders (VAEs)\n\nVAEs use Gaussian distributions in the latent space:\n\n- Encoder: q(z|x) = N(μ(x), σ²(x)) — maps data to latent Gaussian\n- Prior: p(z) = N(0, I) — standard Gaussian prior\n- Decoder: p(x|z) — reconstructs from latent code\n\nThe "reparameterization trick" enables backpropagation through Gaussian sampling:\n\n$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, I)$$

Gaussian Applications Summary

•Linear Regression: Gaussian noise assumption leads to least squares; enables MLE and confidence intervals
•Gaussian Naive Bayes: Continuous features modeled as Gaussian given class
•Gaussian Mixture Models: Cluster data using mixtures of Gaussians (EM algorithm)
•Gaussian Processes: Non-parametric Bayesian regression with uncertainty
•Variational Autoencoders: Gaussian latent spaces for generative modeling
•Batch Normalization: Transforms activations to be approximately Gaussian
•Weight Initialization: Glorot/He initialization uses Gaussian or uniform with σ based on layer size
•Additive Noise: Data augmentation, dropout noise, and label smoothing often assume Gaussian

Testing for Normality

Many statistical methods assume normality. Before applying such methods, we should verify whether the assumption is reasonable. Here are the main approaches:\n\n### Visual Methods\n\n1. Histogram: Compare the empirical distribution to the theoretical Gaussian bell curve.\n\n2. Q-Q Plot (Quantile-Quantile): Plot sample quantiles against theoretical Gaussian quantiles. If data is Gaussian, points lie on a straight line.\n\n- Points curving up on the right → right skew\n- Points curving down on the left → left skew\n- S-shape → heavy tails\n- Inverted S → light tails\n\n### Statistical Tests\n\nShapiro-Wilk Test:\nMost powerful for small to moderate samples (n < 5000).\n- H₀: Data comes from a normal distribution\n- Small p-value → reject normality\n\nKolmogorov-Smirnov Test:\nCompares empirical CDF to theoretical normal CDF.\n- Less powerful than Shapiro-Wilk\n- Sensitive to any departure from normality\n\nAnderson-Darling Test:\nSimilar to K-S but gives more weight to tails.\n- Good for detecting tail departures\n\nD'Agostino-Pearson Test:\nCombines skewness and kurtosis tests.\n- Tests based on third and fourth moments

normality_testing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
 
def comprehensive_normality_test(data, alpha=0.05):
    """
    Perform comprehensive normality testing with visual and statistical tests.
    """
    n = len(data)
    results = {'n': n, 'alpha': alpha}
    
    # Descriptive statistics
    results['mean'] = np.mean(data)
    results['std'] = np.std(data, ddof=1)
    results['skewness'] = stats.skew(data)
    results['kurtosis'] = stats.kurtosis(data)  # Excess kurtosis (0 for normal)
    
    # Statistical tests
    # 1. Shapiro-Wilk (best for n < 5000)
    if n <= 5000:
        stat, p = stats.shapiro(data)
        results['shapiro_wilk'] = {'statistic': stat, 'p_value': p, 
                                    'normal': p > alpha}
    
    # 2. Kolmogorov-Smirnov (against standard normal after standardization)
    standardized = (data - np.mean(data)) / np.std(data, ddof=1)
    stat, p = stats.kstest(standardized, 'norm')
    results['kolmogorov_smirnov'] = {'statistic': stat, 'p_value': p,
                                      'normal': p > alpha}
    
    # 3. Anderson-Darling
    result = stats.anderson(data, dist='norm')
    # Use 5% significance level (index 2)
    results['anderson_darling'] = {
        'statistic': result.statistic,
        'critical_value_5%': result.critical_values[2],
        'normal': result.statistic < result.critical_values[2]
    }
    
    # 4. D'Agostino-Pearson (requires n >= 20)
    if n >= 20:
        stat, p = stats.normaltest(data)
        results['dagostino_pearson'] = {'statistic': stat, 'p_value': p,
                                         'normal': p > alpha}
    
    return results
 
def print_normality_results(results):
    print(f"\nNormality Test Results (n = {results['n']}, α = {results['alpha']})")
    print("=" * 60)
    print(f"Skewness: {results['skewness']:.4f} (0 for normal)")
    print(f"Excess Kurtosis: {results['kurtosis']:.4f} (0 for normal)")
    print("-" * 60)
    
    if 'shapiro_wilk' in results:
        r = results['shapiro_wilk']
        print(f"Shapiro-Wilk: W={r['statistic']:.4f}, p={r['p_value']:.4f} "
              f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    r = results['kolmogorov_smirnov']
    print(f"Kolmogorov-Smirnov: D={r['statistic']:.4f}, p={r['p_value']:.4f} "
          f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    r = results['anderson_darling']
    print(f"Anderson-Darling: A²={r['statistic']:.4f}, crit={r['critical_value_5%']:.4f} "
          f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
    
    if 'dagostino_pearson' in results:
        r = results['dagostino_pearson']
        print(f"D'Agostino-Pearson: K²={r['statistic']:.4f}, p={r['p_value']:.4f} "
              f"→ {'Normal' if r['normal'] else 'NOT Normal'}")
 
# Test with different distributions
np.random.seed(42)
 
print("\n" + "="*60)
print("Testing NORMAL data:")
normal_data = np.random.normal(50, 10, 200)
print_normality_results(comprehensive_normality_test(normal_data))
 
print("\n" + "="*60)
print("Testing EXPONENTIAL data (should reject normality):")
exp_data = np.random.exponential(10, 200)
print_normality_results(comprehensive_normality_test(exp_data))

Practical Considerations

With large samples, normality tests will reject almost any real data (which is never exactly Gaussian). Focus on whether departures from normality are practically significant. The CLT often makes the normal approximation work even with non-normal data. Visual inspection via Q-Q plots is often more informative than p-values for large n.

Summary and Key Takeaways

The Gaussian distribution is the most important continuous distribution in machine learning and statistics. Let's consolidate our understanding:

Key Takeaways

•Definition: N(μ, σ²) is characterized by mean μ and variance σ²; the standard normal N(0,1) serves as the universal reference.
•The 68-95-99.7 Rule: About 68%, 95%, and 99.7% of values fall within 1, 2, and 3 standard deviations of the mean.
•Central Limit Theorem: Sums of independent random variables converge to Gaussian, explaining its ubiquity.
•MLE: Sample mean estimates μ (unbiased); sample variance with n denominator estimates σ² (biased, use n-1 for unbiased).
•Closure Properties: Gaussians are closed under linear transformations, summation, conditioning, and marginalization.
•ML Applications: Linear regression, Gaussian Naive Bayes, GPs, VAEs, initialization, and batch normalization all rely on Gaussians.
•Normality Testing: Use Q-Q plots and Shapiro-Wilk test; but remember the CLT makes Gaussian approximations robust.

What's Next:\n\nWe've covered discrete (Bernoulli/Binomial) and continuous (Gaussian) distributions. Next, we explore the Poisson distribution—the natural model for counting rare events over time or space. The Poisson connects to the Binomial (as a limit) and is fundamental to event modeling, queuing theory, and count-based machine learning problems.

Page Complete

You now have a comprehensive understanding of the Gaussian distribution—from its mathematical definition through the Central Limit Theorem to its foundational role in machine learning. The Gaussian's mathematical tractability and theoretical justification make it the default choice for modeling continuous uncertainty.