Loading content...
If you could learn only one probability distribution, it should be the Gaussian distribution—also known as the Normal distribution. Named after Carl Friedrich Gauss (1777–1855), who used it to analyze astronomical data, this distribution appears with remarkable ubiquity across science, engineering, and machine learning.\n\nWhy is the Gaussian so special? The answer lies in the Central Limit Theorem: the sum of many independent random variables (regardless of their original distributions) tends toward a Gaussian. This explains why measurement errors, biological traits, stock returns, and countless other phenomena follow bell-shaped curves—they are aggregates of many small, independent effects.\n\nIn machine learning, the Gaussian is foundational:\n- Linear regression assumes Gaussian noise\n- Gaussian Processes define distributions over functions\n- Variational Autoencoders use Gaussian latent spaces\n- Batch Normalization transforms activations toward Gaussian\n- Maximum likelihood estimation often leads to least-squares when noise is Gaussian\n\nMastering the Gaussian distribution is essential for any serious machine learning practitioner.
By the end of this page, you will understand the Gaussian distribution from first principles: its mathematical definition, key properties, the standard normal and z-scores, the Central Limit Theorem, parameter estimation via MLE, and its pervasive applications throughout machine learning.
A continuous random variable X follows a Gaussian (Normal) distribution with mean μ and variance σ², written X ~ N(μ, σ²), if its probability density function (PDF) is:\n\n$$f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$\n\nEquivalently, using the standard notation:\n\n$$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}$$\n\nwhere:\n- μ (mu) is the mean or location parameter—the center of the distribution\n- σ (sigma) is the standard deviation—controls the spread\n- σ² is the variance\n- The support is the entire real line: x ∈ (-∞, +∞)
The Gaussian PDF has two parts: (1) The normalizing constant 1/(σ√(2π)) ensures the total area under the curve equals 1. (2) The exponential term exp(-(x-μ)²/(2σ²)) creates the characteristic bell shape, with the quadratic in the exponent producing symmetry around μ and decay as x moves away from μ.
| Property | Formula/Value | Interpretation |
|---|---|---|
| Support | x ∈ (-∞, +∞) | Unbounded; assigns small probability to extreme values |
| Mean | E[X] = μ | Center of the distribution |
| Variance | Var(X) = σ² | Spread around the mean |
| Mode | μ | Most likely value (peak of PDF) |
| Median | μ | 50th percentile |
| Skewness | 0 | Perfectly symmetric around μ |
| Kurtosis | 3 (excess = 0) | Reference for 'normal' tail behavior |
| MGF | exp(μt + σ²t²/2) | Moment generating function |
| Entropy | ½ ln(2πeσ²) | Bits of information |
The Standard Normal Distribution is the special case with μ = 0 and σ² = 1, denoted Z ~ N(0, 1).\n\nPDF of Standard Normal:\n\n$$\phi(z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}$$\n\nCDF of Standard Normal:\n\n$$\Phi(z) = \int_{-\infty}^{z} \phi(t) \, dt = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]$$\n\nwhere erf is the error function.\n\n### Standardization (Z-Scores)\n\nAny Gaussian can be converted to the standard normal via standardization:\n\nIf X ~ N(μ, σ²), then:\n\n$$Z = \frac{X - \mu}{\sigma} \sim N(0, 1)$$\n\nConversely, if Z ~ N(0, 1):\n\n$$X = \mu + \sigma Z \sim N(\mu, \sigma^2)$$\n\nThe z-score tells us how many standard deviations a value is from the mean.
12345678910111213141516171819202122232425262728293031323334353637
import numpy as npfrom scipy import stats # Standard normal distributionstandard_normal = stats.norm(loc=0, scale=1) # Key quantilesprint("Standard Normal Distribution Z ~ N(0,1)")print("=" * 50) # Probability within k standard deviationsfor k in [1, 2, 3, 4]: prob = standard_normal.cdf(k) - standard_normal.cdf(-k) print(f"P(-{k} < Z < {k}) = {prob:.6f} ({prob*100:.3f}%)") print() # Critical values for confidence intervalsconfidence_levels = [0.90, 0.95, 0.99]print("Critical z-values for two-tailed confidence intervals:")for cl in confidence_levels: alpha = 1 - cl z_critical = standard_normal.ppf(1 - alpha/2) print(f" {cl*100:.0f}% CI: z = ±{z_critical:.4f}") print() # Converting between X ~ N(μ,σ²) and Z ~ N(0,1)mu, sigma = 100, 15 # Example: IQ scoresx = 130 # A specific IQ score z = (x - mu) / sigma # Standardizeprob_below = standard_normal.cdf(z)print(f"IQ scores: X ~ N({mu}, {sigma}²)")print(f"IQ = {x} → z-score = {z:.2f}")print(f"P(IQ < {x}) = P(Z < {z:.2f}) = {prob_below:.4f}")print(f"Percentile: {prob_below*100:.1f}th")The standard normal serves as a universal reference. Any question about a Gaussian N(μ,σ²) can be answered by standardizing to N(0,1) and using precomputed tables or functions. This is why software libraries need only implement the standard normal—all other Gaussians follow from linear transformation.
The Central Limit Theorem (CLT) is one of the most profound results in probability theory, explaining why the Gaussian distribution is so prevalent. It states that the sum (or average) of many independent random variables tends toward a Gaussian, regardless of the original distributions.\n\n### Formal Statement\n\nLet X₁, X₂, ..., Xₙ be independent and identically distributed (i.i.d.) random variables with:\n- Mean: E[Xᵢ] = μ\n- Variance: Var(Xᵢ) = σ² < ∞\n\nDefine the sample mean:\n$$\bar{X}n = \frac{1}{n}\sum{i=1}^n X_i$$\n\nThen as n → ∞:\n\n$$\sqrt{n}\left(\bar{X}n - \mu\right) \stackrel{d}{\rightarrow} N(0, \sigma^2)$$\n\nEquivalently, the standardized sum converges to a standard normal:\n\n$$Z_n = \frac{\sum{i=1}^n X_i - n\mu}{\sigma\sqrt{n}} = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma} \stackrel{d}{\rightarrow} N(0, 1)$$
The CLT is remarkable because it doesn't matter what distribution the Xᵢ come from—uniform, exponential, Poisson, Bernoulli, or any other with finite variance. The sum always tends toward Gaussian. This universality explains the Gaussian's ubiquity in natural phenomena, which are often the result of many small, independent contributions.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def demonstrate_clt(original_dist, dist_name, n_samples=10000): """ Demonstrate CLT by showing how sample means converge to Gaussian. """ sample_sizes = [1, 2, 5, 10, 30, 100] fig, axes = plt.subplots(2, 3, figsize=(14, 8)) for ax, n in zip(axes.flatten(), sample_sizes): # Generate n_samples sets of n observations each data = original_dist.rvs(size=(n_samples, n)) sample_means = data.mean(axis=1) # True parameters of sample mean distribution true_mean = original_dist.mean() true_std = original_dist.std() / np.sqrt(n) # Plot histogram of sample means ax.hist(sample_means, bins=50, density=True, alpha=0.7, color='steelblue', edgecolor='navy') # Overlay the normal approximation x = np.linspace(sample_means.min(), sample_means.max(), 200) normal_pdf = stats.norm.pdf(x, true_mean, true_std) ax.plot(x, normal_pdf, 'r-', linewidth=2, label='Normal approx.') ax.set_title(f'n = {n}') ax.set_xlabel('Sample Mean') ax.legend() plt.suptitle(f'CLT Demonstration: {dist_name}', fontsize=14) plt.tight_layout() plt.show() # Demonstrate with highly non-normal distributions # 1. Exponential (skewed right)print("Exponential Distribution (λ=1):")exponential = stats.expon(scale=1)print(f" Mean: {exponential.mean():.2f}, Variance: {exponential.var():.2f}")print(f" Skewness: {exponential.stats(moments='s')[0]:.2f} (highly skewed)")# demonstrate_clt(exponential, "Exponential(λ=1)") # 2. Uniform (symmetric, non-normal)print("\nUniform[0,1] Distribution:")uniform = stats.uniform(0, 1)print(f" Mean: {uniform.mean():.2f}, Variance: {uniform.var():.4f}")# demonstrate_clt(uniform, "Uniform[0,1]") # 3. Bernoulli (discrete)print("\nBernoulli(p=0.3) Distribution:")bernoulli = stats.bernoulli(0.3)print(f" Mean: {bernoulli.mean():.2f}, Variance: {bernoulli.var():.4f}") # Manual CLT demonstrationnp.random.seed(42)n = 50 # Sample sizen_experiments = 10000 # Draw n_experiments sample means, each from n observationsbernoulli_means = np.random.binomial(n, 0.3, n_experiments) / n # Compare to normalmu = 0.3sigma = np.sqrt(0.3 * 0.7 / n)print(f"\nFor n=50 Bernoulli samples:")print(f" μ_X̄ = {mu:.3f}, σ_X̄ = {sigma:.4f}")print(f" Observed: mean = {bernoulli_means.mean():.4f}, std = {bernoulli_means.std():.4f}")Given observations x₁, x₂, ..., xₙ from a Gaussian distribution, we estimate the unknown parameters μ and σ² using Maximum Likelihood Estimation (MLE).\n\n### Deriving the MLE\n\nThe likelihood of n i.i.d. Gaussian observations is:\n\n$$L(\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$\n\nThe log-likelihood is:\n\n$$\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$\n\n### Solving for μ\n\nTaking the derivative with respect to μ:\n\n$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu) = 0$$\n\nSolving:\n$$\sum_{i=1}^n x_i - n\mu = 0 \implies \hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}$$\n\nThe MLE for μ is the sample mean.
| Parameter | MLE Estimator | Unbiased Estimator | Properties |
|---|---|---|---|
| μ (mean) | x̄ = (1/n)Σxᵢ | Same (unbiased) | Var(x̄) = σ²/n |
| σ² (variance) | (1/n)Σ(xᵢ-x̄)² | s² = (1/(n-1))Σ(xᵢ-x̄)² | MLE biased by factor (n-1)/n |
| σ (std dev) | √[(1/n)Σ(xᵢ-x̄)²] | √s² (still slightly biased) | Unbiasing σ is more complex |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
import numpy as npfrom scipy import stats def gaussian_mle(samples: np.ndarray) -> dict: """ Maximum Likelihood Estimation for Gaussian parameters. Returns both MLE estimates and unbiased estimates with confidence intervals. """ n = len(samples) # MLE estimates mu_mle = np.mean(samples) var_mle = np.var(samples, ddof=0) # ddof=0 for MLE (divide by n) # Unbiased estimates var_unbiased = np.var(samples, ddof=1) # ddof=1 for unbiased (divide by n-1) sigma_unbiased = np.std(samples, ddof=1) # Standard error of the mean se_mean = sigma_unbiased / np.sqrt(n) # Confidence intervals for μ using t-distribution t_critical = stats.t.ppf(0.975, df=n-1) # 95% CI ci_mean = (mu_mle - t_critical * se_mean, mu_mle + t_critical * se_mean) # Confidence interval for σ² using chi-squared distribution # (n-1)s²/σ² ~ χ²(n-1) chi2_lower = stats.chi2.ppf(0.025, df=n-1) chi2_upper = stats.chi2.ppf(0.975, df=n-1) ci_var = ((n-1) * var_unbiased / chi2_upper, (n-1) * var_unbiased / chi2_lower) return { 'n': n, 'mu_mle': mu_mle, 'sigma2_mle': var_mle, 'sigma2_unbiased': var_unbiased, 'sigma_unbiased': sigma_unbiased, 'se_mean': se_mean, 'ci_mean_95': ci_mean, 'ci_var_95': ci_var } # Example: Estimate parameters from datanp.random.seed(42)true_mu, true_sigma = 50, 10n = 100data = np.random.normal(true_mu, true_sigma, n) results = gaussian_mle(data)print(f"True parameters: μ = {true_mu}, σ = {true_sigma}, σ² = {true_sigma**2}")print(f"\nMLE estimates:")print(f" μ̂ = {results['mu_mle']:.4f}")print(f" σ̂² (MLE) = {results['sigma2_mle']:.4f}")print(f" σ̂² (unbiased) = {results['sigma2_unbiased']:.4f}")print(f"\n95% CI for μ: [{results['ci_mean_95'][0]:.4f}, {results['ci_mean_95'][1]:.4f}]")print(f"95% CI for σ²: [{results['ci_var_95'][0]:.4f}, {results['ci_var_95'][1]:.4f}]") # Demonstrate bias in variance estimationprint("\n--- Bias Demonstration ---")n_simulations = 10000mle_vars = []unbiased_vars = [] for _ in range(n_simulations): sample = np.random.normal(true_mu, true_sigma, n) mle_vars.append(np.var(sample, ddof=0)) unbiased_vars.append(np.var(sample, ddof=1)) print(f"True σ² = {true_sigma**2}")print(f"E[σ̂²_MLE] ≈ {np.mean(mle_vars):.4f} (biased)")print(f"E[s²] ≈ {np.mean(unbiased_vars):.4f} (unbiased)")print(f"Expected MLE bias: {(n-1)/n * true_sigma**2:.4f}")The Gaussian distribution has remarkable closure properties that make it incredibly tractable for mathematical analysis and practical applications.\n\n### Linear Transformation\n\nIf X ~ N(μ, σ²) and Y = aX + b, then:\n\n$$Y \sim N(a\mu + b, a^2\sigma^2)$$\n\nThis means scaling and shifting a Gaussian yields another Gaussian. This is fundamental to standardization and z-scores.\n\n### Sum of Independent Gaussians\n\nIf X₁ ~ N(μ₁, σ₁²) and X₂ ~ N(μ₂, σ₂²) are independent, then:\n\n$$X_1 + X_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)$$\n\nMore generally, for independent Xᵢ ~ N(μᵢ, σᵢ²):\n\n$$\sum_{i=1}^n a_i X_i \sim N\left(\sum_{i=1}^n a_i \mu_i, \sum_{i=1}^n a_i^2 \sigma_i^2\right)$$\n\nKey insight: Variances add, standard deviations do not!
| Operation | Result | Formula |
|---|---|---|
| Linear transform Y = aX + b | Gaussian | N(aμ + b, a²σ²) |
| Sum of independent Gaussians | Gaussian | Means add, variances add |
| Product of PDFs | Gaussian (unnormalized) | Precision-weighted combination |
| Conditional (joint Gaussian) | Gaussian | Linear regression formula |
| Marginal (joint Gaussian) | Gaussian | Extract relevant parameters |
| Convolution | Gaussian | Variances add |
The Gaussian family is closed under almost every operation we care about: linear combinations, conditioning, marginalization, convolution, and conjugate Bayesian updating. This mathematical closure is why Gaussian assumptions are so common—they lead to analytically tractable solutions. When exact solutions aren't possible, Gaussian approximations (like the Laplace approximation) are often the first resort.
The Gaussian distribution is woven into the fabric of machine learning. Here we explore its key applications in depth.\n\n### Linear Regression\n\nThe standard linear regression model assumes:\n\n$$y = \mathbf{w}^T \mathbf{x} + \epsilon, \quad \epsilon \sim N(0, \sigma^2)$$\n\nThis implies:\n$$y | \mathbf{x} \sim N(\mathbf{w}^T \mathbf{x}, \sigma^2)$$\n\nThe log-likelihood for n observations is:\n\n$$\ell(\mathbf{w}, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$\n\nMaximizing with respect to w is equivalent to minimizing the sum of squared errors—the famous least squares criterion emerges naturally from Gaussian noise assumptions!
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
import numpy as npfrom scipy import stats # 1. LINEAR REGRESSION FROM GAUSSIAN LIKELIHOODdef linear_regression_mle(X, y): """ Derive linear regression from Gaussian noise assumption. y = X @ w + ε, where ε ~ N(0, σ²I) MLE for w: minimize ||y - Xw||² Solution: w = (X'X)^{-1} X'y """ # Add bias term X_bias = np.column_stack([np.ones(len(X)), X]) # MLE (equivalent to OLS) w_mle = np.linalg.lstsq(X_bias, y, rcond=None)[0] # Predicted values y_pred = X_bias @ w_mle # MLE for σ² residuals = y - y_pred sigma2_mle = np.mean(residuals ** 2) # Log-likelihood n = len(y) log_likelihood = -n/2 * np.log(2 * np.pi * sigma2_mle) - n/2 return { 'weights': w_mle, 'sigma2': sigma2_mle, 'log_likelihood': log_likelihood, 'predictions': y_pred } # 2. GAUSSIAN NAIVE BAYESclass GaussianNaiveBayes: """ Gaussian Naive Bayes classifier. Assumes features are Gaussian given class: P(x_j | y=c) = N(μ_jc, σ²_jc) """ def fit(self, X, y): self.classes = np.unique(y) self.n_features = X.shape[1] # Store class priors and parameters self.priors = {} self.means = {} self.variances = {} for c in self.classes: X_c = X[y == c] self.priors[c] = len(X_c) / len(X) self.means[c] = X_c.mean(axis=0) self.variances[c] = X_c.var(axis=0) + 1e-9 # Smoothing def _log_likelihood(self, x, c): """Compute log P(x | y=c) assuming Gaussian features.""" log_prob = 0 for j in range(self.n_features): log_prob += stats.norm.logpdf( x[j], self.means[c][j], np.sqrt(self.variances[c][j]) ) return log_prob def predict_proba(self, X): """Compute posterior probabilities P(y=c | x).""" log_posteriors = np.zeros((len(X), len(self.classes))) for i, x in enumerate(X): for j, c in enumerate(self.classes): log_posteriors[i, j] = (np.log(self.priors[c]) + self._log_likelihood(x, c)) # Normalize using log-sum-exp log_norm = np.log(np.exp(log_posteriors).sum(axis=1, keepdims=True)) posteriors = np.exp(log_posteriors - log_norm) return posteriors def predict(self, X): return self.classes[self.predict_proba(X).argmax(axis=1)] # Example usagenp.random.seed(42)X = np.vstack([ np.random.normal([0, 0], [1, 1], (50, 2)), np.random.normal([3, 3], [1, 1], (50, 2))])y = np.array([0]*50 + [1]*50) gnb = GaussianNaiveBayes()gnb.fit(X, y)accuracy = (gnb.predict(X) == y).mean()print(f"Gaussian Naive Bayes accuracy: {accuracy:.2%}")Many statistical methods assume normality. Before applying such methods, we should verify whether the assumption is reasonable. Here are the main approaches:\n\n### Visual Methods\n\n1. Histogram: Compare the empirical distribution to the theoretical Gaussian bell curve.\n\n2. Q-Q Plot (Quantile-Quantile): Plot sample quantiles against theoretical Gaussian quantiles. If data is Gaussian, points lie on a straight line.\n\n- Points curving up on the right → right skew\n- Points curving down on the left → left skew\n- S-shape → heavy tails\n- Inverted S → light tails\n\n### Statistical Tests\n\nShapiro-Wilk Test:\nMost powerful for small to moderate samples (n < 5000).\n- H₀: Data comes from a normal distribution\n- Small p-value → reject normality\n\nKolmogorov-Smirnov Test:\nCompares empirical CDF to theoretical normal CDF.\n- Less powerful than Shapiro-Wilk\n- Sensitive to any departure from normality\n\nAnderson-Darling Test:\nSimilar to K-S but gives more weight to tails.\n- Good for detecting tail departures\n\nD'Agostino-Pearson Test:\nCombines skewness and kurtosis tests.\n- Tests based on third and fourth moments
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
import numpy as npfrom scipy import statsimport matplotlib.pyplot as plt def comprehensive_normality_test(data, alpha=0.05): """ Perform comprehensive normality testing with visual and statistical tests. """ n = len(data) results = {'n': n, 'alpha': alpha} # Descriptive statistics results['mean'] = np.mean(data) results['std'] = np.std(data, ddof=1) results['skewness'] = stats.skew(data) results['kurtosis'] = stats.kurtosis(data) # Excess kurtosis (0 for normal) # Statistical tests # 1. Shapiro-Wilk (best for n < 5000) if n <= 5000: stat, p = stats.shapiro(data) results['shapiro_wilk'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha} # 2. Kolmogorov-Smirnov (against standard normal after standardization) standardized = (data - np.mean(data)) / np.std(data, ddof=1) stat, p = stats.kstest(standardized, 'norm') results['kolmogorov_smirnov'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha} # 3. Anderson-Darling result = stats.anderson(data, dist='norm') # Use 5% significance level (index 2) results['anderson_darling'] = { 'statistic': result.statistic, 'critical_value_5%': result.critical_values[2], 'normal': result.statistic < result.critical_values[2] } # 4. D'Agostino-Pearson (requires n >= 20) if n >= 20: stat, p = stats.normaltest(data) results['dagostino_pearson'] = {'statistic': stat, 'p_value': p, 'normal': p > alpha} return results def print_normality_results(results): print(f"\nNormality Test Results (n = {results['n']}, α = {results['alpha']})") print("=" * 60) print(f"Skewness: {results['skewness']:.4f} (0 for normal)") print(f"Excess Kurtosis: {results['kurtosis']:.4f} (0 for normal)") print("-" * 60) if 'shapiro_wilk' in results: r = results['shapiro_wilk'] print(f"Shapiro-Wilk: W={r['statistic']:.4f}, p={r['p_value']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") r = results['kolmogorov_smirnov'] print(f"Kolmogorov-Smirnov: D={r['statistic']:.4f}, p={r['p_value']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") r = results['anderson_darling'] print(f"Anderson-Darling: A²={r['statistic']:.4f}, crit={r['critical_value_5%']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") if 'dagostino_pearson' in results: r = results['dagostino_pearson'] print(f"D'Agostino-Pearson: K²={r['statistic']:.4f}, p={r['p_value']:.4f} " f"→ {'Normal' if r['normal'] else 'NOT Normal'}") # Test with different distributionsnp.random.seed(42) print("\n" + "="*60)print("Testing NORMAL data:")normal_data = np.random.normal(50, 10, 200)print_normality_results(comprehensive_normality_test(normal_data)) print("\n" + "="*60)print("Testing EXPONENTIAL data (should reject normality):")exp_data = np.random.exponential(10, 200)print_normality_results(comprehensive_normality_test(exp_data))With large samples, normality tests will reject almost any real data (which is never exactly Gaussian). Focus on whether departures from normality are practically significant. The CLT often makes the normal approximation work even with non-normal data. Visual inspection via Q-Q plots is often more informative than p-values for large n.
The Gaussian distribution is the most important continuous distribution in machine learning and statistics. Let's consolidate our understanding:
What's Next:\n\nWe've covered discrete (Bernoulli/Binomial) and continuous (Gaussian) distributions. Next, we explore the Poisson distribution—the natural model for counting rare events over time or space. The Poisson connects to the Binomial (as a limit) and is fundamental to event modeling, queuing theory, and count-based machine learning problems.
You now have a comprehensive understanding of the Gaussian distribution—from its mathematical definition through the Central Limit Theorem to its foundational role in machine learning. The Gaussian's mathematical tractability and theoretical justification make it the default choice for modeling continuous uncertainty.