Loading learning content...
If you could learn only one probability distribution, it should be the Gaussian distribution—also known as the Normal distribution. Named after Carl Friedrich Gauss (1777–1855), who used it to analyze astronomical data, this distribution appears with remarkable ubiquity across science, engineering, and machine learning.
Why is the Gaussian so special? The answer lies in the Central Limit Theorem: the sum of many independent random variables (regardless of their original distributions) tends toward a Gaussian. This explains why measurement errors, biological traits, stock returns, and countless other phenomena follow bell-shaped curves—they are aggregates of many small, independent effects.
In machine learning, the Gaussian is foundational:
Mastering the Gaussian distribution is essential for any serious machine learning practitioner.
By the end of this page, you will understand the Gaussian distribution from first principles: its mathematical definition, key properties, the standard normal and z-scores, the Central Limit Theorem, parameter estimation via MLE, and its pervasive applications throughout machine learning.
A continuous random variable X follows a Gaussian (Normal) distribution with mean μ and variance σ², written X ~ N(μ, σ²), if its probability density function (PDF) is:
$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
Equivalently, using the standard notation:
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$
where:
The Gaussian PDF has two parts: (1) The normalizing constant 1/(σ√(2π)) ensures the total area under the curve equals 1. (2) The exponential term exp(-(x-μ)²/(2σ²)) creates the characteristic bell shape, with the quadratic in the exponent producing symmetry around μ and decay as x moves away from μ.
The Gaussian is uniquely characterized by several mathematical properties:
1. Maximum Entropy: Among all distributions with specified mean μ and variance σ², the Gaussian has maximum entropy. It makes the fewest assumptions beyond the first two moments.
$$H(X) = \frac{1}{2} \ln(2\pi e \sigma^2)$$
2. Central Limit Theorem Limit: The sum of many independent random variables converges to a Gaussian.
3. Stability Under Linear Transformations: If X ~ N(μ, σ²), then aX + b ~ N(aμ + b, a²σ²).
4. Closure Under Convolution: The sum of independent Gaussians is Gaussian.
5. Zero Correlation Implies Independence: For jointly Gaussian variables, uncorrelated implies independent (unique among distributions).
These properties make the Gaussian mathematically tractable and theoretically justified as a default assumption when we know only the mean and variance.
| Property | Formula/Value | Interpretation |
|---|---|---|
| Support | x ∈ (-∞, +∞) | Unbounded; assigns small probability to extreme values |
| Mean | E[X] = μ | Center of the distribution |
| Variance | Var(X) = σ² | Spread around the mean |
| Mode | μ | Most likely value (peak of PDF) |
| Median | μ | 50th percentile |
| Skewness | 0 | Perfectly symmetric around μ |
| Kurtosis | 3 (excess = 0) | Reference for 'normal' tail behavior |
| MGF | exp(μt + σ²t²/2) | Moment generating function |
| Entropy | ½ ln(2πeσ²) | Bits of information |
The Standard Normal Distribution is the special case with μ = 0 and σ² = 1, denoted Z ~ N(0, 1).
PDF of Standard Normal:
$$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$
CDF of Standard Normal:
$$\Phi(z) = \int_{-\infty}^{z} \phi(t) \, dt = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]$$
where erf is the error function.
Any Gaussian can be converted to the standard normal via standardization:
If X ~ N(μ, σ²), then:
$$Z = \frac{X - \mu}{\sigma} \sim N(0, 1)$$
Conversely, if Z ~ N(0, 1):
$$X = \mu + \sigma Z \sim N(\mu, \sigma^2)$$
The z-score tells us how many standard deviations a value is from the mean.
For any normal distribution:
| Interval | Probability | Interpretation |
|---|---|---|
| μ ± 1σ | 68.27% | About 2/3 of data falls within 1 SD |
| μ ± 2σ | 95.45% | About 95% within 2 SDs |
| μ ± 3σ | 99.73% | Nearly all within 3 SDs |
| μ ± 4σ | 99.994% | Extreme outliers beyond 4 SDs |
This rule provides quick mental estimates. If you see a value more than 3 standard deviations from the mean, it's exceptionally rare under normality assumptions.
12345678910111213141516171819202122232425262728293031323334353637
import numpy as npfrom scipy import stats # Standard normal distributionstandard_normal = stats.norm(loc=0, scale=1) # Key quantilesprint("Standard Normal Distribution Z ~ N(0,1)")print("=" * 50) # Probability within k standard deviationsfor k in [1, 2, 3, 4]: prob = standard_normal.cdf(k) - standard_normal.cdf(-k) print(f"P(-{k} < Z < {k}) = {prob:.6f} ({prob*100:.3f}%)") print() # Critical values for confidence intervalsconfidence_levels = [0.90, 0.95, 0.99]print("Critical z-values for two-tailed confidence intervals:")for cl in confidence_levels: alpha = 1 - cl z_critical = standard_normal.ppf(1 - alpha/2) print(f" {cl*100:.0f}% CI: z = ±{z_critical:.4f}") print() # Converting between X ~ N(μ,σ²) and Z ~ N(0,1)mu, sigma = 100, 15 # Example: IQ scoresx = 130 # A specific IQ score z = (x - mu) / sigma # Standardizeprob_below = standard_normal.cdf(z)print(f"IQ scores: X ~ N({mu}, {sigma}²)")print(f"IQ = {x} → z-score = {z:.2f}")print(f"P(IQ < {x}) = P(Z < {z:.2f}) = {prob_below:.4f}")print(f"Percentile: {prob_below*100:.1f}th")The standard normal serves as a universal reference. Any question about a Gaussian N(μ,σ²) can be answered by standardizing to N(0,1) and using precomputed tables or functions. This is why software libraries need only implement the standard normal—all other Gaussians follow from linear transformation.
The Central Limit Theorem (CLT) is one of the most profound results in probability theory, explaining why the Gaussian distribution is so prevalent. It states that the sum (or average) of many independent random variables tends toward a Gaussian, regardless of the original distributions.
Let X₁, X₂, ..., Xₙ be independent and identically distributed (i.i.d.) random variables with:
Define the sample mean: $$\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i$$
Then as n → ∞:
$$\sqrt{n}\left(\bar{X}_n - \mu\right) \stackrel{d}{\rightarrow} N(0, \sigma^2)$$
Equivalently, the standardized sum converges to a standard normal:
$$Z_n = \frac{\sum_{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \stackrel{d}{\rightarrow} N(0, 1)$$
The CLT is remarkable because it doesn't matter what distribution the Xᵢ come from—uniform, exponential, Poisson, Bernoulli, or any other with finite variance. The sum always tends toward Gaussian. This universality explains the Gaussian's ubiquity in natural phenomena, which are often the result of many small, independent contributions.
1. Finite Sample Approximation: For finite n, we approximate:
$$\bar{X}_n \approx N\left(\mu, \frac{\sigma^2}{n}\right)$$
The approximation improves as n grows. Rule of thumb: n ≥ 30 often suffices, but more skewed distributions need larger n.
2. Binomial to Normal: For X ~ Binomial(n, p) with large n:
$$X \approx N(np, np(1-p))$$
3. Sample Mean Distribution: Even if individual observations are non-normal, the sampling distribution of the mean is approximately normal for large samples.
4. Confidence Intervals: The CLT justifies using z-intervals and t-intervals for population means.
The Berry-Esseen theorem quantifies how fast the CLT convergence occurs:
$$\sup_z |P(Z_n \leq z) - \Phi(z)| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}$$
where ρ = E[|X - μ|³] is the third absolute moment and C ≈ 0.4748. The convergence is O(1/√n).
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def demonstrate_clt(original_dist, dist_name, n_samples=10000): """ Demonstrate CLT by showing how sample means converge to Gaussian. """ sample_sizes = [1, 2, 5, 10, 30, 100] fig, axes = plt.subplots(2, 3, figsize=(14, 8)) for ax, n in zip(axes.flatten(), sample_sizes): # Generate n_samples sets of n observations each data = original_dist.rvs(size=(n_samples, n)) sample_means = data.mean(axis=1) # True parameters of sample mean distribution true_mean = original_dist.mean() true_std = original_dist.std() / np.sqrt(n) # Plot histogram of sample means ax.hist(sample_means, bins=50, density=True, alpha=0.7, color='steelblue', edgecolor='navy') # Overlay the normal approximation x = np.linspace(sample_means.min(), sample_means.max(), 200) normal_pdf = stats.norm.pdf(x, true_mean, true_std) ax.plot(x, normal_pdf, 'r-', linewidth=2, label='Normal approx.') ax.set_title(f'n = {n}') ax.set_xlabel('Sample Mean') ax.legend() plt.suptitle(f'CLT Demonstration: {dist_name}', fontsize=14) plt.tight_layout() plt.show() # Demonstrate with highly non-normal distributions # 1. Exponential (skewed right)print("Exponential Distribution (λ=1):")exponential = stats.expon(scale=1)print(f" Mean: {exponential.mean():.2f}, Variance: {exponential.var():.2f}")print(f" Skewness: {exponential.stats(moments='s')[0]:.2f} (highly skewed)")# demonstrate_clt(exponential, "Exponential(λ=1)") # 2. Uniform (symmetric, non-normal)print("Uniform[0,1] Distribution:")uniform = stats.uniform(0, 1)print(f" Mean: {uniform.mean():.2f}, Variance: {uniform.var():.4f}")# demonstrate_clt(uniform, "Uniform[0,1]") # 3. Bernoulli (discrete)print("Bernoulli(p=0.3) Distribution:")bernoulli = stats.bernoulli(0.3)print(f" Mean: {bernoulli.mean():.2f}, Variance: {bernoulli.var():.4f}") # Manual CLT demonstrationnp.random.seed(42)n = 50 # Sample sizen_experiments = 10000 # Draw n_experiments sample means, each from n observationsbernoulli_means = np.random.binomial(n, 0.3, n_experiments) / n # Compare to normalmu = 0.3sigma = np.sqrt(0.3 * 0.7 / n)print(f"For n=50 Bernoulli samples:")print(f" μ_X̄ = {mu:.3f}, σ_X̄ = {sigma:.4f}")print(f" Observed: mean = {bernoulli_means.mean():.4f}, std = {bernoulli_means.std():.4f}")Given observations x₁, x₂, ..., xₙ from a Gaussian distribution, we estimate the unknown parameters μ and σ² using Maximum Likelihood Estimation (MLE).
The likelihood of n i.i.d. Gaussian observations is:
$$L(\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$
The log-likelihood is:
$$\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$
Taking the derivative with respect to μ:
$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu) = 0$$
Solving: $$\sum_{i=1}^n x_i - n\mu = 0 \implies \hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}$$
The MLE for μ is the sample mean.
Taking the derivative with respect to σ²:
$$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n(x_i-\mu)^2 = 0$$
Solving: $$\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n(x_i-\bar{x})^2$$
The MLE for σ² is the sample variance (with n denominator).
Interestingly, the MLE for variance is biased:
$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 eq \sigma^2$$
The unbiased estimator divides by (n-1) instead:
$$s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2$$
This is called Bessel's correction. The intuition: we use one degree of freedom to estimate μ, leaving only n-1 for estimating spread around it.
For large n, the difference is negligible.
| Parameter | MLE Estimator | Unbiased Estimator | Properties |
|---|---|---|---|
| μ (mean) | x̄ = (1/n)Σxᵢ | Same (unbiased) | Var(x̄) = σ²/n |
| σ² (variance) | (1/n)Σ(xᵢ-x̄)² | s² = (1/(n-1))Σ(xᵢ-x̄)² | MLE biased by factor (n-1)/n |
| σ (std dev) | √[(1/n)Σ(xᵢ-x̄)²] | √s² (still slightly biased) | Unbiasing σ is more complex |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy import stats def gaussian_mle(samples: np.ndarray) -> dict: """ Maximum Likelihood Estimation for Gaussian parameters. Returns both MLE estimates and unbiased estimates with confidence intervals. """ n = len(samples) # MLE estimates mu_mle = np.mean(samples) var_mle = np.var(samples, ddof=0) # ddof=0 for MLE (divide by n) # Unbiased estimates var_unbiased = np.var(samples, ddof=1) # ddof=1 for unbiased (divide by n-1) sigma_unbiased = np.std(samples, ddof=1) # Standard error of the mean se_mean = sigma_unbiased / np.sqrt(n) # Confidence intervals for μ using t-distribution t_critical = stats.t.ppf(0.975, df=n-1) # 95% CI ci_mean = (mu_mle - t_critical * se_mean, mu_mle + t_critical * se_mean) # Confidence interval for σ² using chi-squared distribution # (n-1)s²/σ² ~ χ²(n-1) chi2_lower = stats.chi2.ppf(0.025, df=n-1) chi2_upper = stats.chi2.ppf(0.975, df=n-1) ci_var = ((n-1) * var_unbiased / chi2_upper, (n-1) * var_unbiased / chi2_lower) return { 'n': n, 'mu_mle': mu_mle, 'sigma2_mle': var_mle, 'sigma2_unbiased': var_unbiased, 'sigma_unbiased': sigma_unbiased, 'se_mean': se_mean, 'ci_mean_95': ci_mean, 'ci_var_95': ci_var } # Example: Estimate parameters from datanp.random.seed(42)true_mu, true_sigma = 50, 10n = 100data = np.random.normal(true_mu, true_sigma, n) results = gaussian_mle(data)print(f"True parameters: μ = {true_mu}, σ = {true_sigma}, σ² = {true_sigma**2}")print(f"MLE estimates:")print(f" μ̂ = {results['mu_mle']:.4f}")print(f" σ̂² (MLE) = {results['sigma2_mle']:.4f}")print(f" σ̂² (unbiased) = {results['sigma2_unbiased']:.4f}")print(f"95% CI for μ: [{results['ci_mean_95'][0]:.4f}, {results['ci_mean_95'][1]:.4f}]")print(f"95% CI for σ²: [{results['ci_var_95'][0]:.4f}, {results['ci_var_95'][1]:.4f}]") # Demonstrate bias in variance estimationprint("--- Bias Demonstration ---")n_simulations = 10000mle_vars = []unbiased_vars = [] for _ in range(n_simulations): sample = np.random.normal(true_mu, true_sigma, n) mle_vars.append(np.var(sample, ddof=0)) unbiased_vars.append(np.var(sample, ddof=1)) print(f"True σ² = {true_sigma**2}")print(f"E[σ̂²_MLE] ≈ {np.mean(mle_vars):.4f} (biased)")print(f"E[s²] ≈ {np.mean(unbiased_vars):.4f} (unbiased)")print(f"Expected MLE bias: {(n-1)/n * true_sigma**2:.4f}")The Gaussian distribution has remarkable closure properties that make it incredibly tractable for mathematical analysis and practical applications.
If X ~ N(μ, σ²) and Y = aX + b, then:
$$Y \sim N(a\mu + b, a^2\sigma^2)$$
This means scaling and shifting a Gaussian yields another Gaussian. This is fundamental to standardization and z-scores.
If X₁ ~ N(μ₁, σ₁²) and X₂ ~ N(μ₂, σ₂²) are independent, then:
$$X_1 + X_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$$
More generally, for independent Xᵢ ~ N(μᵢ, σᵢ²):
$$\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i \mu_i, \sum_{i=1}^n a_i^2 \sigma_i^2\right)$$
Key insight: Variances add, standard deviations do not!
The product of two Gaussian PDFs (up to normalization) is another Gaussian. This is crucial for Bayesian inference:
If we have a prior N(μ₁, σ₁²) and likelihood N(μ₂, σ₂²), the posterior is N(μₚ, σₚ²) where:
$$\sigma_p^2 = \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right)^{-1}$$
$$\mu_p = \sigma_p^2 \left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)$$
This is the precision-weighted average of the means, where precision = 1/variance.
For jointly Gaussian variables (X, Y), the conditional distribution Y|X is also Gaussian:
$$Y | X = x \sim N\left(\mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x - \mu_X), \sigma_Y^2(1-\rho^2)\right)$$
where ρ is the correlation coefficient. This property is fundamental to Gaussian processes and linear regression.
| Operation | Result | Formula |
|---|---|---|
| Linear transform Y = aX + b | Gaussian | N(aμ + b, a²σ²) |
| Sum of independent Gaussians | Gaussian | Means add, variances add |
| Product of PDFs | Gaussian (unnormalized) | Precision-weighted combination |
| Conditional (joint Gaussian) | Gaussian | Linear regression formula |
| Marginal (joint Gaussian) | Gaussian | Extract relevant parameters |
| Convolution | Gaussian | Variances add |
The Gaussian family is closed under almost every operation we care about: linear combinations, conditioning, marginalization, convolution, and conjugate Bayesian updating. This mathematical closure is why Gaussian assumptions are so common—they lead to analytically tractable solutions. When exact solutions aren't possible, Gaussian approximations (like the Laplace approximation) are often the first resort.
The Gaussian distribution is woven into the fabric of machine learning. Here we explore its key applications in depth.
The standard linear regression model assumes:
$$y = \mathbf{w}^T \mathbf{x} + \epsilon, \quad \epsilon \sim N(0, \sigma^2)$$
This implies: $$y | \mathbf{x} \sim N(\mathbf{w}^T \mathbf{x}, \sigma^2)$$
The log-likelihood for n observations is:
$$\ell(\mathbf{w}, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$
Maximizing with respect to w is equivalent to minimizing the sum of squared errors—the famous least squares criterion emerges naturally from Gaussian noise assumptions!
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom scipy import stats # 1. LINEAR REGRESSION FROM GAUSSIAN LIKELIHOODdef linear_regression_mle(X, y): """ Derive linear regression from Gaussian noise assumption. y = X @ w + ε, where ε ~ N(0, σ²I) MLE for w: minimize ||y - Xw||² Solution: w = (X'X)^{-1} X'y """ # Add bias term X_bias = np.column_stack([np.ones(len(X)), X]) # MLE (equivalent to OLS) w_mle = np.linalg.lstsq(X_bias, y, rcond=None)[0] # Predicted values y_pred = X_bias @ w_mle # MLE for σ² residuals = y - y_pred sigma2_mle = np.mean(residuals ** 2) # Log-likelihood n = len(y) log_likelihood = -n/2 * np.log(2 * np.pi * sigma2_mle) - n/2 return { 'weights': w_mle, 'sigma2': sigma2_mle, 'log_likelihood': log_likelihood, 'predictions': y_pred } # 2. GAUSSIAN NAIVE BAYESclass GaussianNaiveBayes: """ Gaussian Naive Bayes classifier. Assumes features are Gaussian given class: P(x_j | y=c) = N(μ_jc, σ²_jc) """ def fit(self, X, y): self.classes = np.unique(y) self.n_features = X.shape[1] # Store class priors and parameters self.priors = {} self.means = {} self.variances = {} for c in self.classes: X_c = X[y == c] self.priors[c] = len(X_c) / len(X) self.means[c] = X_c.mean(axis=0) self.variances[c] = X_c.var(axis=0) + 1e-9 # Smoothing def _log_likelihood(self, x, c): """Compute log P(x | y=c) assuming Gaussian features.""" log_prob = 0 for j in range(self.n_features): log_prob += stats.norm.logpdf( x[j], self.means[c][j], np.sqrt(self.variances[c][j]) ) return log_prob def predict_proba(self, X): """Compute posterior probabilities P(y=c | x).""" log_posteriors = np.zeros((len(X), len(self.classes))) for i, x in enumerate(X): for j, c in enumerate(self.classes): log_posteriors[i, j] = (np.log(self.priors[c]) + self._log_likelihood(x, c)) # Normalize using log-sum-exp log_norm = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True)) posteriors = np.exp(log_posteriors - log_norm) return posteriors def predict(self, X): return self.classes[self.predict_proba(X).argmax(axis=1)] # Example usagenp.random.seed(42)X = np.vstack([ np.random.normal([0, 0], [1, 1], (50, 2)), np.random.normal([3, 3], [1, 1], (50, 2))])y = np.array([0]*50 + [1]*50) gnb = GaussianNaiveBayes()gnb.fit(X, y)accuracy = (gnb.predict(X) == y).mean()print(f"Gaussian Naive Bayes accuracy: {accuracy:.2%}")Gaussian Processes (GPs) define distributions over functions, where any finite collection of function values is jointly Gaussian:
$$[f(x_1), f(x_2), \ldots, f(x_n)]^T \sim N(\mathbf{m}, \mathbf{K})$$
where:
GPs provide:
VAEs use Gaussian distributions in the latent space:
The "reparameterization trick" enables backpropagation through Gaussian sampling:
$$z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim N(0, I)$$
Many statistical methods assume normality. Before applying such methods, we should verify whether the assumption is reasonable. Here are the main approaches:
1. Histogram: Compare the empirical distribution to the theoretical Gaussian bell curve.
2. Q-Q Plot (Quantile-Quantile): Plot sample quantiles against theoretical Gaussian quantiles. If data is Gaussian, points lie on a straight line.
Shapiro-Wilk Test: Most powerful for small to moderate samples (n < 5000).
Kolmogorov-Smirnov Test: Compares empirical CDF to theoretical normal CDF.
Anderson-Darling Test: Similar to K-S but gives more weight to tails.
D'Agostino-Pearson Test: Combines skewness and kurtosis tests.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def comprehensive_normality_test(data, alpha=0.05): """ Perform comprehensive normality testing with visual and statistical tests. """ n = len(data) results = {'n': n, 'alpha': alpha} # Descriptive statistics results['mean'] = np.mean(data) results['std'] = np.std(data, ddof=1) results['skewness'] = stats.skew(data) results['kurtosis'] = stats.kurtosis(data) # Excess kurtosis (0 for normal) # Statistical tests # 1. Shapiro-Wilk (best for n < 5000) if n <= 5000: stat, p = stats.shapiro(data) results['shapiro_wilk'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha} # 2. Kolmogorov-Smirnov (against standard normal after standardization) standardized = (data - np.mean(data)) / np.std(data, ddof=1) stat, p = stats.kstest(standardized, 'norm') results['kolmogorov_smirnov'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha} # 3. Anderson-Darling result = stats.anderson(data, dist='norm') # Use 5% significance level (index 2) results['anderson_darling'] = { 'statistic': result.statistic, 'critical_value_5%': result.critical_values[2], 'normal': result.statistic < result.critical_values[2] } # 4. D'Agostino-Pearson (requires n >= 20) if n >= 20: stat, p = stats.normaltest(data) results['dagostino_pearson'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha} return results def print_normality_results(results): print(f"Normality Test Results (n = {results['n']}, α = {results['alpha']})") print("=" * 60) print(f"Skewness: {results['skewness']:.4f} (0 for normal)") print(f"Excess Kurtosis: {results['kurtosis']:.4f} (0 for normal)") print("-" * 60) if 'shapiro_wilk' in results: r = results['shapiro_wilk'] print(f"Shapiro-Wilk: W={r['statistic']:.4f}, p={r['p_value']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") r = results['kolmogorov_smirnov'] print(f"Kolmogorov-Smirnov: D={r['statistic']:.4f}, p={r['p_value']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") r = results['anderson_darling'] print(f"Anderson-Darling: A²={r['statistic']:.4f}, crit={r['critical_value_5%']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") if 'dagostino_pearson' in results: r = results['dagostino_pearson'] print(f"D'Agostino-Pearson: K²={r['statistic']:.4f}, p={r['p_value']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") # Test with different distributionsnp.random.seed(42) print("" + "="*60)print("Testing NORMAL data:")normal_data = np.random.normal(50, 10, 200)print_normality_results(comprehensive_normality_test(normal_data)) print("" + "="*60)print("Testing EXPONENTIAL data (should reject normality):")exp_data = np.random.exponential(10, 200)print_normality_results(comprehensive_normality_test(exp_data))With large samples, normality tests will reject almost any real data (which is never exactly Gaussian). Focus on whether departures from normality are practically significant. The CLT often makes the normal approximation work even with non-normal data. Visual inspection via Q-Q plots is often more informative than p-values for large n.
The Gaussian distribution is the most important continuous distribution in machine learning and statistics. Let's consolidate our understanding:
What's Next:
We've covered discrete (Bernoulli/Binomial) and continuous (Gaussian) distributions. Next, we explore the Poisson distribution—the natural model for counting rare events over time or space. The Poisson connects to the Binomial (as a limit) and is fundamental to event modeling, queuing theory, and count-based machine learning problems.
You now have a comprehensive understanding of the Gaussian distribution—from its mathematical definition through the Central Limit Theorem to its foundational role in machine learning. The Gaussian's mathematical tractability and theoretical justification make it the default choice for modeling continuous uncertainty.