Loading learning content...
While PMFs, PDFs, and CDFs completely characterize a distribution, they're often more detail than we need. Much of probability theory and machine learning relies on summary statistics—single numbers that capture key aspects of a distribution.
The two most important summaries are:
These concepts are far from abstract. The expectation underlies virtually every loss function in ML—mean squared error, cross-entropy loss, and expected risk are all expectations. Variance underlies model uncertainty, the bias-variance tradeoff, and regularization theory. Mastering these concepts is mastering the mathematical language of machine learning.
By the end of this page, you will understand expectation and variance formally, know their key properties, compute them for discrete and continuous distributions, understand their role in ML loss functions and model analysis, and appreciate the computational perspective essential for practical applications.
The expectation (or expected value or mean) of a random variable is its long-run average value—the center of mass of its probability distribution.
Definition (Expectation):
For a discrete random variable $X$ with PMF $p$: $$E[X] = \sum_{x \in \mathcal{X}} x \cdot p(x)$$
For a continuous random variable $X$ with PDF $f$: $$E[X] = \int_{-\infty}^{\infty} x \cdot f(x) , dx$$
The expectation is often denoted $\mu$, $\mu_X$, or $\bar{X}$ (the last typically for sample means).
Intuition: Weighted Average
The expectation is a weighted average where each value is weighted by its probability. Values that are more likely contribute more to the mean.
| Scenario | Weighted Average | Probability Weight |
|---|---|---|
| Die roll | (1+2+3+4+5+6)/6 = 3.5 | Each face has prob 1/6 |
| Loaded die (6 favored) | Higher than 3.5 | 6 has higher weight |
| Binary outcome {0, 1} | P(X = 1) | Mean equals probability of 1 |
Physical Interpretation:
If you cut out the PDF curve from a piece of cardboard, the expectation is the $x$-coordinate of the balance point (center of mass). The distribution 'balances' on its mean.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy import stats, integrate def compute_expectations(): """ Compute expectations for various distributions. """ print("Computing Expectations") print("=" * 60) # Discrete: Fair die print("\nDiscrete: Fair 6-sided die") support = np.arange(1, 7) pmf = np.ones(6) / 6 expected = np.sum(support * pmf) print(f" E[X] = Σ x·p(x) = (1+2+3+4+5+6)/6 = {expected:.4f}") # Discrete: Loaded die (6 has prob 0.5, others equal) print("\nDiscrete: Loaded die (P(6)=0.5)") loaded_pmf = np.array([0.1, 0.1, 0.1, 0.1, 0.1, 0.5]) expected_loaded = np.sum(support * loaded_pmf) print(f" E[X] = {expected_loaded:.4f} (higher due to loaded 6)") # Discrete: Binomial print("\nDiscrete: Binomial(n=10, p=0.3)") n, p = 10, 0.3 binomial = stats.binom(n, p) print(f" E[X] = n·p = {n}·{p} = {n*p:.4f}") print(f" Computed: {binomial.mean():.4f}") # Continuous: Normal print("\nContinuous: Normal(μ=5, σ²=4)") mu, sigma = 5, 2 normal = stats.norm(mu, sigma) print(f" E[X] = μ = {mu}") print(f" Computed via integration: {normal.mean():.4f}") # Continuous: Exponential print("\nContinuous: Exponential(λ=2)") lam = 2 exponential = stats.expon(scale=1/lam) print(f" E[X] = 1/λ = {1/lam:.4f}") print(f" Computed: {exponential.mean():.4f}") # Verify by numerical integration integral, _ = integrate.quad(lambda x: x * exponential.pdf(x), 0, np.inf) print(f" Via ∫x·f(x)dx: {integral:.4f}") def law_of_large_numbers_demo(): """ Demonstrate the Law of Large Numbers. """ print("\n\nLaw of Large Numbers") print("=" * 60) print("\nThe sample mean converges to E[X] as n → ∞") print(" X̄_n = (1/n) Σ Xᵢ → E[X]") # True expectation for Exp(λ=2) true_mean = 0.5 np.random.seed(42) print(f"\nExponential(λ=2), true E[X] = {true_mean}") print(f"\n{'n':>10} {'Sample Mean':>15} {'|Error|':>12}") print("-" * 40) for n in [10, 100, 1000, 10000, 100000]: samples = np.random.exponential(scale=true_mean, size=n) sample_mean = np.mean(samples) error = abs(sample_mean - true_mean) print(f"{n:>10} {sample_mean:>15.6f} {error:>12.6f}") print("\nThe sample mean concentrates around E[X] = 0.5") compute_expectations()law_of_large_numbers_demo()Some distributions have undefined or infinite expectations. The Cauchy distribution (fat tails) has no mean—its integral diverges. In practice, check that your modeling assumptions imply finite expectations, especially for heavy-tailed phenomena like financial returns or claim sizes.
Often we need the expected value not of $X$ itself, but of some function $g(X)$. The Law of the Unconscious Statistician (LOTUS) tells us how to compute this without first finding the distribution of $g(X)$.
LOTUS (Law of the Unconscious Statistician):
For discrete $X$: $$E[g(X)] = \sum_{x \in \mathcal{X}} g(x) \cdot p(x)$$
For continuous $X$: $$E[g(X)] = \int_{-\infty}^{\infty} g(x) \cdot f(x) , dx$$
We simply weight $g(x)$ by the probability of $x$—no need to derive the distribution of $g(X)$ first.
Why 'Unconscious Statistician'?
The name is mildly pejorative—early statisticians would apply this formula 'unconsciously' without proving it. But it is correct! The proof follows from the definition of expectation for the random variable $Y = g(X)$.
Common Applications:
| Function $g(X)$ | Meaning | ML Use |
|---|---|---|
| $X^2$ | Second moment | Variance computation |
| $(X - \mu)^2$ | Squared deviation | Variance, MSE |
| $|X - y|^2$ | Squared error | Expected loss |
| $\log p(X)$ | Log-probability | Entropy, KL divergence |
| $\mathbf{1}_{X \in A}$ | Indicator | Probability computation |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npfrom scipy import stats, integrate def lotus_demonstrations(): """ Demonstrate LOTUS for computing E[g(X)]. """ print("Law of the Unconscious Statistician (LOTUS)") print("=" * 60) # Example 1: E[X²] for a fair die print("\nExample 1: E[X²] for fair die") support = np.arange(1, 7) pmf = np.ones(6) / 6 # LOTUS: E[X²] = Σ x² · p(x) e_x_squared = np.sum(support**2 * pmf) print(f" E[X²] = Σ x²·p(x) = (1+4+9+16+25+36)/6 = {e_x_squared:.4f}") # Compare to (E[X])² to see they differ e_x = np.sum(support * pmf) print(f" E[X] = {e_x:.4f}") print(f" (E[X])² = {e_x**2:.4f}") print(f" Note: E[X²] ≠ (E[X])², difference = {e_x_squared - e_x**2:.4f}") # Example 2: E[e^X] for Normal (moment generating function idea) print("\nExample 2: E[exp(X)] for X ~ N(0, 1)") normal = stats.norm(0, 1) # LOTUS via integration e_exp_x, _ = integrate.quad(lambda x: np.exp(x) * normal.pdf(x), -20, 20) # Analytical: For N(μ,σ²), E[exp(X)] = exp(μ + σ²/2) analytical = np.exp(0 + 1/2) print(f" Via LOTUS (numerical): {e_exp_x:.6f}") print(f" Analytical formula: {analytical:.6f}") # Example 3: E[max(X, 0)] - the ReLU-like function print("\nExample 3: E[max(X, 0)] for X ~ N(0, 1)") # LOTUS: E[max(X, 0)] = ∫₀^∞ x · f(x) dx relu_mean, _ = integrate.quad(lambda x: max(x, 0) * normal.pdf(x), -10, 10) # Analytical: For standard normal, E[max(X,0)] = 1/√(2π) ≈ 0.3989 analytical_relu = 1 / np.sqrt(2 * np.pi) print(f" Via LOTUS (numerical): {relu_mean:.6f}") print(f" Analytical: {analytical_relu:.6f}") print(" This is relevant for ReLU activation analysis!") def indicator_function_expectations(): """ Show how E[1{X ∈ A}] = P(X ∈ A). """ print("\n\nExpectation of Indicator Functions") print("=" * 60) print("\nKey result: E[1{X ∈ A}] = P(X ∈ A)") print("Because: 1{X ∈ A} is 1 with prob P(X ∈ A), 0 otherwise") print(" E[1{X ∈ A}] = 1·P(A) + 0·P(Aᶜ) = P(A)") # Demonstration normal = stats.norm(0, 1) A = (-1, 1) # Event: X ∈ [-1, 1] # Using CDF prob_A_cdf = normal.cdf(1) - normal.cdf(-1) # Using LOTUS with indicator indicator = lambda x: 1.0 if A[0] <= x <= A[1] else 0.0 prob_A_lotus, _ = integrate.quad(lambda x: indicator(x) * normal.pdf(x), -10, 10) print(f"\n P(X ∈ [-1, 1]) for N(0,1):") print(f" Via CDF: {prob_A_cdf:.6f}") print(f" Via E[indicator]: {prob_A_lotus:.6f}") print("\n Both methods give the same answer!") def ml_loss_functions_as_expectations(): """ Cast ML loss functions as expectations. """ print("\n\nML Loss Functions as Expectations") print("=" * 60) print("\nMSE, MAE, Cross-Entropy are all expectations!") # Setup: predict μ̂ for Y ~ N(μ, σ²) true_mu = 3.0 sigma = 1.0 mu_hat = 2.5 # Our prediction print(f"\nScenario: True Y ~ N({true_mu}, {sigma}²), predict μ̂ = {mu_hat}") # Mean Squared Error as expectation # MSE = E[(Y - μ̂)²] normal = stats.norm(true_mu, sigma) mse_analytical = sigma**2 + (true_mu - mu_hat)**2 # Bias² + Variance mse_numerical, _ = integrate.quad( lambda y: (y - mu_hat)**2 * normal.pdf(y), -10, 10 ) print(f"\n MSE = E[(Y - μ̂)²]") print(f" = Var(Y) + (E[Y] - μ̂)²") print(f" = {sigma**2} + ({true_mu} - {mu_hat})²") print(f" = {mse_analytical:.4f}") print(f" (Numerical: {mse_numerical:.4f})") # Mean Absolute Error mae_numerical, _ = integrate.quad( lambda y: np.abs(y - mu_hat) * normal.pdf(y), -10, 10 ) print(f"\n MAE = E[|Y - μ̂|] = {mae_numerical:.4f}") print("\n Key insight: Loss functions are expectations over data distribution") print(" Minimizing expected loss = Minimizing E[L(Y, μ̂)]") lotus_demonstrations()indicator_function_expectations()ml_loss_functions_as_expectations()Expectation has powerful algebraic properties that make it tractable for analysis. These properties are used constantly in ML theory.
Fundamental Properties:
1. Linearity (The most important property): $$E[aX + bY] = aE[X] + bE[Y]$$
This holds for any random variables $X$ and $Y$, regardless of their dependence structure. Linearity extends to finite sums: $$E\left[\sum_{i=1}^n a_i X_i\right] = \sum_{i=1}^n a_i E[X_i]$$
2. Monotonicity: If $X \leq Y$ with probability 1, then $E[X] \leq E[Y]$.
3. Constant Expectation: $E[c] = c$ for any constant $c$.
4. Non-negative Expectation: If $X \geq 0$ with probability 1, then $E[X] \geq 0$.
Linearity of expectation is simpler and more powerful than it first appears. It holds even when X and Y are dependent—their covariance doesn't affect the sum of expectations. Many seemingly difficult counting/probability problems become trivial once you express them as E[Σ indicators] = Σ P(event_i).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
import numpy as npfrom scipy import stats def linearity_demonstration(): """ Demonstrate linearity of expectation. """ print("Linearity of Expectation") print("=" * 60) # Linearity for independent RVs print("\nProperty: E[aX + bY] = a·E[X] + b·E[Y]") print("(This works even when X and Y are DEPENDENT!)") np.random.seed(42) n = 100000 # Independent example X = np.random.normal(2, 1, n) # E[X] = 2 Y = np.random.exponential(3, n) # E[Y] = 3 a, b = 2, -1 Z = a * X + b * Y print(f"\nIndependent X ~ N(2,1), Y ~ Exp(1/3)") print(f" a = {a}, b = {b}") print(f" E[{a}X + ({b})Y] = {a}·E[X] + ({b})·E[Y] = {a}·2 + ({b})·3 = {a*2 + b*3}") print(f" Sample mean of Z: {np.mean(Z):.4f}") # Dependent example print("\nDependent example: Y = X² (clearly dependent on X)") Y_dep = X ** 2 Z_dep = a * X + b * Y_dep e_x = 2 # E[X] e_y_dep = 4 + 1 # E[X²] = Var(X) + E[X]² = 1 + 4 = 5 print(f" E[X] = {e_x}, E[X²] = E[Y] = {e_y_dep}") print(f" E[{a}X + ({b})X²] = {a}·{e_x} + ({b})·{e_y_dep} = {a*e_x + b*e_y_dep}") print(f" Sample mean of Z: {np.mean(Z_dep):.4f}") print(" Linearity still works despite dependence!") def counting_with_indicators(): """ Show the power of linearity with indicator functions. """ print("\n\nCounting via Linearity + Indicators") print("=" * 60) print("\nProblem: Expected number of fixed points in random permutation") print("(Fixed point: element that stays in its original position)") print("\nNaive approach: Sum over all n! permutations... intractable") print("Smart approach: X = Σᵢ 1{element i is fixed}") print(" E[X] = Σᵢ E[1{i fixed}] = Σᵢ P(i fixed)") print(" P(i fixed) = 1/n for each i") print(" E[X] = n × (1/n) = 1") # Verify empirically def count_fixed_points(n, trials=10000): count = 0 for _ in range(trials): perm = np.random.permutation(n) fixed = np.sum(perm == np.arange(n)) count += fixed return count / trials print("\nEmpirical verification:") for n in [5, 10, 50, 100]: empirical = count_fixed_points(n) print(f" n = {n:3d}: E[fixed points] ≈ {empirical:.4f} (theory: 1.00)") def expectation_bounds(): """ Demonstrate bounds on expectations. """ print("\n\nExpectation Bounds") print("=" * 60) print("\n1. Jensen's Inequality:") print(" For convex g: E[g(X)] ≥ g(E[X])") print(" For concave g: E[g(X)] ≤ g(E[X])") # Example: g(x) = x² is convex # So E[X²] ≥ (E[X])² np.random.seed(42) X = np.random.exponential(2, 10000) # E[X] = 2 e_x = np.mean(X) e_x_sq = np.mean(X**2) print(f"\n Example: X ~ Exp(1/2), g(x) = x² (convex)") print(f" E[X²] = {e_x_sq:.4f} ≥ (E[X])² = {e_x**2:.4f}?") print(f" Yes: {e_x_sq:.4f} ≥ {e_x**2:.4f} ✓") # Example: g(x) = log(x) is concave print(f"\n Example: g(x) = log(x) (concave)") e_log_x = np.mean(np.log(X)) log_e_x = np.log(e_x) print(f" E[log(X)] = {e_log_x:.4f} ≤ log(E[X]) = {log_e_x:.4f}?") print(f" Yes: {e_log_x:.4f} ≤ {log_e_x:.4f} ✓") print("\n Jensen is crucial for understanding:") print(" - Why geometric mean ≤ arithmetic mean") print(" - Why KL divergence ≥ 0") print(" - Variational bounds in ML") linearity_demonstration()counting_with_indicators()expectation_bounds()While expectation captures the center of a distribution, variance captures its spread—how much values deviate from the mean.
Definition (Variance):
$$\text{Var}(X) = E[(X - E[X])^2] = E[(X - \mu)^2]$$
Variance is the expected squared deviation from the mean. The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$, which has the same units as $X$.
Computational Formula:
Using linearity of expectation: $$\text{Var}(X) = E[X^2] - (E[X])^2$$
This formula is often more convenient for calculation.
Why not E[|X - μ|] (mean absolute deviation)? Squaring has several advantages: (1) it's differentiable everywhere, (2) it connects to inner products and geometry, (3) it leads to closed-form solutions in optimization, (4) it connects naturally to Gaussian distributions. The absolute deviation gives the median, not mean, as the optimal point—a different kind of summary.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
import numpy as npfrom scipy import stats def compute_variances(): """ Compute variances for various distributions. """ print("Computing Variances") print("=" * 60) # Fair die print("\nDiscrete: Fair 6-sided die") support = np.arange(1, 7) pmf = np.ones(6) / 6 mu = np.sum(support * pmf) e_x_sq = np.sum(support**2 * pmf) var = e_x_sq - mu**2 print(f" E[X] = {mu:.4f}") print(f" E[X²] = {e_x_sq:.4f}") print(f" Var(X) = E[X²] - (E[X])² = {e_x_sq:.4f} - {mu**2:.4f} = {var:.4f}") print(f" Std(X) = √Var = {np.sqrt(var):.4f}") # Binomial print("\nDiscrete: Binomial(n=10, p=0.3)") n, p = 10, 0.3 binomial = stats.binom(n, p) print(f" Var(X) = n·p·(1-p) = {n}·{p}·{1-p} = {n*p*(1-p):.4f}") print(f" From scipy: {binomial.var():.4f}") # Normal print("\nContinuous: Normal(μ=0, σ²=4)") normal = stats.norm(0, 2) # σ = 2 print(f" Var(X) = σ² = 4") print(f" From scipy: {normal.var():.4f}") # Exponential print("\nContinuous: Exponential(λ=2)") lam = 2 exponential = stats.expon(scale=1/lam) print(f" Var(X) = 1/λ² = {1/lam**2:.4f}") print(f" From scipy: {exponential.var():.4f}") def variance_interpretation(): """ Interpret variance geometrically and statistically. """ print("\n\nInterpreting Variance") print("=" * 60) print("\n1. Chebyshev's Inequality:") print(" P(|X - μ| ≥ k·σ) ≤ 1/k²") print("\n This gives universal bounds on tail probabilities!") print("\n For any distribution:") print(f" {'k':>4} {'P(|X-μ| ≥ kσ) ≤':>20}") for k in [1, 2, 3, 4, 5]: print(f" {k:>4} {1/k**2:>20.4f}") print("\n2. Comparison: Chebyshev vs Normal") print(" The Normal is 'better behaved' than Chebyshev's worst case") normal = stats.norm(0, 1) print(f"\n {'k':>4} {'Chebyshev bound':>18} {'Normal actual':>15}") for k in [1, 2, 3]: cheb_bound = 1 / k**2 normal_actual = 2 * (1 - normal.cdf(k)) print(f" {k:>4} {cheb_bound:>18.4f} {normal_actual:>15.6f}") print("\n Normal tails decay much faster than Chebyshev allows!") def variance_vs_mean(): """ Illustrate that same mean but different variance gives very different distributions. """ print("\n\nSame Mean, Different Variance") print("=" * 60) print("\nAll have mean 0, but very different spreads:") distributions = [ ("N(0, 0.1)", stats.norm(0, np.sqrt(0.1))), ("N(0, 1.0)", stats.norm(0, 1)), ("N(0, 10.)", stats.norm(0, np.sqrt(10))), ] print(f"\n{'Distribution':<15} {'Var':>8} {'P(|X|>1)':>12} {'P(|X|>3)':>12}") print("-" * 50) for name, dist in distributions: var = dist.var() p_outside_1 = 2 * (1 - dist.cdf(1)) p_outside_3 = 2 * (1 - dist.cdf(3)) print(f"{name:<15} {var:>8.2f} {p_outside_1:>12.4f} {p_outside_3:>12.4f}") print("\n Higher variance → more probability in tails") print(" This affects: outliers, risk, uncertainty quantification") compute_variances()variance_interpretation()variance_vs_mean()Unlike expectation, variance is not linear. Its properties are more subtle and depend on the dependence structure between random variables.
Key Properties:
1. Scaling: $$\text{Var}(aX) = a^2 \text{Var}(X)$$
Constants square when pulled out of variance.
2. Shift Invariance: $$\text{Var}(X + b) = \text{Var}(X)$$
Adding a constant doesn't change spread.
3. Sum of Independent RVs: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \quad \text{(if } X \perp Y \text{)}$$
4. General Sum (with Covariance): $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$
Covariance:
The covariance measures how two random variables vary together:
$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$$
Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import numpy as npfrom scipy import stats def variance_scaling_and_shift(): """ Demonstrate Var(aX + b) = a² Var(X). """ print("Variance Properties: Scaling and Shifting") print("=" * 60) np.random.seed(42) X = np.random.normal(3, 2, 100000) # E[X]=3, Var[X]=4 print(f"\nOriginal: X ~ N(3, 4)") print(f" Sample Var(X) = {np.var(X):.4f}") # Scaling a = 3 Y = a * X print(f"\nScaling: Y = {a}X") print(f" Var({a}X) = {a}² · Var(X) = {a**2} × 4 = {a**2 * 4}") print(f" Sample Var(Y) = {np.var(Y):.4f}") # Shifting b = 100 Z = X + b print(f"\nShifting: Z = X + {b}") print(f" Var(X + {b}) = Var(X) = 4") print(f" Sample Var(Z) = {np.var(Z):.4f}") # Both W = a * X + b print(f"\nBoth: W = {a}X + {b}") print(f" Var({a}X + {b}) = {a}² · Var(X) = {a**2 * 4}") print(f" Sample Var(W) = {np.var(W):.4f}") def variance_of_sums(): """ Demonstrate Var(X+Y) with and without independence. """ print("\n\nVariance of Sums: Independence Matters") print("=" * 60) np.random.seed(42) n = 100000 # Independent case print("\n1. Independent X, Y") X = np.random.normal(0, 2, n) # Var = 4 Y = np.random.normal(0, 3, n) # Var = 9 S_indep = X + Y print(f" Var(X) = 4, Var(Y) = 9") print(f" Theory: Var(X+Y) = Var(X) + Var(Y) = 4 + 9 = 13") print(f" Sample Var(X+Y) = {np.var(S_indep):.4f}") # Positive correlation print("\n2. Positively correlated: Y = X + ε") epsilon = np.random.normal(0, 1, n) # Var = 1 Y_pos = X + epsilon # Var(Y) = Var(X) + Var(ε) = 4 + 1 = 5 cov_pos = np.cov(X, Y_pos)[0, 1] S_pos = X + Y_pos print(f" Var(X) = 4, Var(Y) = {np.var(Y_pos):.4f}") print(f" Cov(X, Y) = {cov_pos:.4f}") print(f" Theory: Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)") print(f" = 4 + 5 + 2×{cov_pos:.2f} ≈ {4 + 5 + 2*cov_pos:.4f}") print(f" Sample Var(X+Y) = {np.var(S_pos):.4f}") # Negative correlation print("\n3. Negatively correlated: Y = -X + ε") Y_neg = -X + epsilon # Cov(X, Y) = -Var(X) = -4 cov_neg = np.cov(X, Y_neg)[0, 1] S_neg = X + Y_neg print(f" Var(X) = 4, Var(Y) = {np.var(Y_neg):.4f}") print(f" Cov(X, Y) = {cov_neg:.4f}") print(f" Theory: Var(X+Y) ≈ 4 + 5 + 2×({cov_neg:.2f}) ≈ {4 + 5 + 2*cov_neg:.4f}") print(f" Sample Var(X+Y) = {np.var(S_neg):.4f}") print("\n Key insight: Negative correlation reduces variance!") print(" This is the basis for hedging and portfolio diversification.") def sample_mean_variance(): """ Derive the variance of the sample mean. """ print("\n\nVariance of the Sample Mean") print("=" * 60) print("\nFor i.i.d. samples X₁, ..., Xₙ with Var(Xᵢ) = σ²:") print(" X̄ = (1/n) Σ Xᵢ") print(" Var(X̄) = Var((1/n) Σ Xᵢ)") print(" = (1/n²) Var(Σ Xᵢ)") print(" = (1/n²) × n × σ² [independence]") print(" = σ²/n") print("\n⟹ Std(X̄) = σ/√n (standard error)") # Demonstrate np.random.seed(42) true_var = 9 # σ² = 9, σ = 3 print(f"\nDemonstration: Xᵢ ~ N(0, {true_var})") print(f"{'n':>8} {'Theory Var(X̄)':>15} {'Sample Var(X̄)':>16}") print("-" * 42) for n in [10, 100, 1000]: # Compute many sample means sample_means = [] for _ in range(10000): samples = np.random.normal(0, 3, n) sample_means.append(np.mean(samples)) theory_var = true_var / n sample_var = np.var(sample_means) print(f"{n:>8} {theory_var:>15.6f} {sample_var:>16.6f}") print("\n This is why averaging reduces uncertainty!") variance_scaling_and_shift()variance_of_sums()sample_mean_variance()Expectation and variance are not abstract concepts in ML—they are the mathematical fabric of learning algorithms.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import numpy as npfrom scipy import stats def bias_variance_decomposition(): """ Demonstrate the bias-variance decomposition. """ print("Bias-Variance Decomposition") print("=" * 60) print("\nFor squared error loss:") print(" E[(Ŷ - Y)²] = (Bias)² + Variance + σ²") print("\nwhere:") print(" Bias = E[Ŷ] - f(x) (systematic error)") print(" Variance = Var(Ŷ) (sensitivity to training data)") print(" σ² = Var(Y|X) (irreducible noise)") # Simulation: True function f(x) = sin(x), noisy observations np.random.seed(42) def true_function(x): return np.sin(2 * x) noise_var = 0.1 # σ² = 0.1 n_train = 20 n_fits = 200 # Generate many training sets and fit polynomial models x_test = 0.5 # Test at this point y_true = true_function(x_test) print(f"\nSimulation: f(x) = sin(2x), x_test = {x_test}") print(f" noise σ² = {noise_var}, n_train = {n_train}") for degree in [1, 3, 10]: predictions = [] for _ in range(n_fits): # Generate training data x_train = np.random.uniform(0, 1, n_train) y_train = true_function(x_train) + np.random.normal(0, np.sqrt(noise_var), n_train) # Fit polynomial coeffs = np.polyfit(x_train, y_train, degree) y_pred = np.polyval(coeffs, x_test) predictions.append(y_pred) predictions = np.array(predictions) # Compute components bias = np.mean(predictions) - y_true variance = np.var(predictions) mse = np.mean((predictions - y_true) ** 2) print(f"\n Degree {degree:2d} polynomial:") print(f" Bias² = ({bias:.4f})² = {bias**2:.4f}") print(f" Variance = {variance:.4f}") print(f" Decomposition: {bias**2:.4f} + {variance:.4f} = {bias**2 + variance:.4f}") print(f" Actual MSE = {mse:.4f}") print("\n Low degree → high bias, low variance (underfitting)") print(" High degree → low bias, high variance (overfitting)") def monte_carlo_expectation(): """ Demonstrate Monte Carlo estimation of expectations. """ print("\n\nMonte Carlo Estimation of Expectations") print("=" * 60) print("\nAny expectation E[g(X)] can be approximated by sampling:") print(" E[g(X)] ≈ (1/n) Σᵢ g(Xᵢ) where Xᵢ ~ distribution of X") # Example: Compute E[exp(X)] where X ~ N(0, 1) # Analytical: E[exp(X)] = exp(μ + σ²/2) = exp(0.5) ≈ 1.6487 print("\nExample: E[exp(X)] for X ~ N(0, 1)") analytical = np.exp(0.5) print(f" Analytical result: {analytical:.6f}") np.random.seed(42) print(f"\n {'n':>8} {'Monte Carlo':>15} {'|Error|':>12} {'Std Error':>12}") print(" " + "-" * 50) for n in [100, 1000, 10000, 100000]: samples = np.random.normal(0, 1, n) mc_estimate = np.mean(np.exp(samples)) error = abs(mc_estimate - analytical) # Standard error of MC estimate std_error = np.std(np.exp(samples)) / np.sqrt(n) print(f" {n:>8} {mc_estimate:>15.6f} {error:>12.6f} {std_error:>12.6f}") print("\n Error decreases as O(1/√n) - the Monte Carlo rate") def expectation_in_loss_functions(): """ Show how loss functions are expectations and why this matters. """ print("\n\nLoss Functions as Expectations") print("=" * 60) print("\nRisk = E[L(Y, Ŷ)] = Expected Loss") print("We minimize risk, not individual losses.") # Illustration: Optimal prediction for squared vs absolute loss print("\nOptimal prediction under different losses:") # Simulate Y from a skewed distribution (log-normal) np.random.seed(42) n = 100000 Y = np.random.lognormal(0, 1, n) print(f"\nY ~ LogNormal(0, 1)") print(f" E[Y] (mean) = {np.mean(Y):.4f}") print(f" Median = {np.median(Y):.4f}") # Squared loss: optimal is mean # Absolute loss: optimal is median predictions = np.linspace(0.5, 3.0, 100) min_mse = np.inf min_mae = np.inf opt_mse_pred = None opt_mae_pred = None for pred in predictions: mse = np.mean((Y - pred) ** 2) mae = np.mean(np.abs(Y - pred)) if mse < min_mse: min_mse = mse opt_mse_pred = pred if mae < min_mae: min_mae = mae opt_mae_pred = pred print(f"\n Optimal for MSE: Ŷ = {opt_mse_pred:.4f} (≈ mean)") print(f" Optimal for MAE: Ŷ = {opt_mae_pred:.4f} (≈ median)") print("\n Key insight: The loss function determines the optimal summary!") print(" - Squared loss → predict the mean") print(" - Absolute loss → predict the median") print(" - 0-1 loss (classification) → predict the mode") bias_variance_decomposition()monte_carlo_expectation()expectation_in_loss_functions()Expectation and variance are the primary tools for summarizing distributions and analyzing learning algorithms.
What's Next:
Expectation and variance are the first two moments of a distribution. The next page explores higher moments—skewness, kurtosis, and the full moment hierarchy—which capture asymmetry, tail behavior, and provide complete distributional information via moment generating functions.
You now command the two most important summary statistics in probability and ML. Expectation and variance are not merely descriptive—they are the mathematical foundations of risk, optimization, and the bias-variance tradeoff that governs all of machine learning.