Machine LearningRandom Variables and Distributions

Random Variables and Distributions

LevelIntermediate

Duration90 mins

TopicRandom Variables and Distributions

2 / 5

Continuous Random Variables

Beyond Counting: The Continuum of Outcomes

In the previous page, we mastered discrete random variables—quantities that take countably many values like class labels, word tokens, or click counts. Now we confront a profound extension: continuous random variables, which can take any value in an uncountable set like an interval $[a, b]$ or the entire real line $\mathbb{R}$.

This isn't merely a generalization—it requires a fundamental conceptual shift. For discrete random variables, we could ask 'What is the probability that $X$ equals exactly 3?' and get a meaningful nonzero answer. For continuous random variables, this question becomes problematic. The probability that a truly continuous quantity equals exactly any specific value is zero.

Yet continuous random variables are everywhere in machine learning:

Regression targets: House prices, temperatures, stock returns
Neural network parameters: Weights, biases, activations (before quantization)
Latent representations: Embeddings, hidden states, generated features
Measurement data: Sensor readings, audio waveforms, image intensities

What You Will Learn

By the end of this page, you will understand why continuous random variables require probability density functions (PDFs) instead of PMFs, master the mathematical properties of PDFs, and see how these concepts enable regression models, generative learning, and probabilistic inference in continuous spaces.

The Zero Probability Paradox

Before defining continuous random variables formally, we must confront a seeming paradox that illuminates the conceptual leap required.

The Problem:

Consider a random variable $X$ uniformly distributed on $[0, 1]$—every value in the interval is 'equally likely.' There are uncountably many values in $[0, 1]$ (as many as the entire real line). If we tried to assign equal positive probability to each, what would happen?

Suppose each point has probability $p > 0$. The sum over all points would be: $$\sum_{x \in [0,1]} p = \text{uncountable sum of positive values} = \infty$$

This violates the probability axiom that total probability equals 1. We're forced to conclude that for any individual point $x$: $$P(X = x) = 0$$

Yet the event 'X takes some value in [0, 1]' has probability 1. How can a union of probability-zero events have probability 1?

Resolving the Paradox

The resolution lies in recognizing that we cannot meaningfully sum uncountably many numbers—even zeros. Countable additivity (a probability axiom) says P(∪ Aᵢ) = Σ P(Aᵢ) for countable unions, but [0,1] is uncountable. We need a different approach: instead of point probabilities, we work with probability densities and recover probabilities through integration over intervals.

The Conceptual Shift:

Discrete World	Continuous World
Individual values have probability	Individual values have zero probability
PMF: $p(x) = P(X = x)$	PDF: $f(x)

eq P(X = x)$ | | Sum to get probabilities | Integrate to get probabilities | | $P(X \in A) = \sum_{x \in A} p(x)$ | $P(X \in A) = \int_A f(x) dx$ | | PMF values are probabilities | PDF values are densities (can exceed 1) |

Thinking in terms of probability density rather than point probability is the key insight. Density represents 'probability per unit length'—a rate at which probability accumulates as you sweep across the real line.

Formal Definition: Continuous Random Variables

With the conceptual groundwork laid, we can now define continuous random variables precisely.

Definition (Continuous Random Variable):

A random variable $X$ is called continuous if there exists a non-negative function $f_X: \mathbb{R} \rightarrow [0, \infty)$ such that for all intervals $[a, b]$:

$$P(a \leq X \leq b) = \int_a^b f_X(x) , dx$$

The function $f_X$ is called the probability density function (PDF) of $X$.

Key Properties:

Non-negativity: $f_X(x) \geq 0$ for all $x$
Normalization: $\int_{-\infty}^{\infty} f_X(x) , dx = 1$
Zero point probability: $P(X = x) = \int_x^x f(x) dx = 0$ for all $x$

Critical Distinction: PDF vs PMF

A common source of confusion: the PDF value $f_X(x)$ is not a probability. It's a density—probability per unit interval. This has immediate consequences:

$f_X(x)$ can exceed 1. For example, the uniform distribution on $[0, 0.1]$ has $f_X(x) = 10$ for $x \in [0, 0.1]$.
$f_X(x)$ has units of $[\text{probability}]/[\text{units of } x]$. If $X$ is measured in meters, $f_X(x)$ has units of $1/\text{meter}$.
To get a probability, you must integrate: $P(X \in [a, b]) = \int_a^b f(x) dx$.

Intuition via Histogram:

Imagine sampling $X$ many times and plotting a histogram with very narrow bins of width $\Delta x$. The height of each bar, divided by $\Delta x$, approaches $f(x)$ as sample size increases and bin width shrinks. The PDF is the limiting shape of a normalized histogram.

pdf_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def demonstrate_pdf_properties():
    """
    Demonstrate key properties of probability density functions.
    """
    print("Probability Density Function Properties")
    print("=" * 55)
    
    # Example 1: Uniform distribution on [0, 0.5]
    # PDF = 2.0 on [0, 0.5], zero elsewhere
    a, b = 0, 0.5
    uniform_pdf = 1 / (b - a)  # = 2.0
    
    print(f"
Example 1: Uniform[0, 0.5]")
    print(f"  PDF value: f(x) = {uniform_pdf}")
    print(f"  Note: f(x) = 2.0 > 1, but this is NOT a probability!")
    
    # Verify normalization
    integral, _ = scipy_integrate(lambda x: uniform_pdf, a, b)
    print(f"  ∫ f(x)dx from 0 to 0.5 = {integral:.4f} (normalized correctly)")
    
    # Example 2: Standard Normal
    print(f"
Example 2: Standard Normal N(0, 1)")
    normal = stats.norm(0, 1)
    
    # PDF at various points
    points = [0, 1, 2, -1]
    print("  PDF values (not probabilities!):")
    for x in points:
        print(f"    f({x:2d}) = {normal.pdf(x):.6f}")
    
    # At x=0, f(0) ≈ 0.399, highest for standard normal
    # The value 0.399 still has meaning: probability mass is 
    # concentrated near 0
    
    # Computing actual probabilities via integration
    print("
  Actual probabilities (via integration):")
    print(f"    P(-1 ≤ X ≤ 1) = {normal.cdf(1) - normal.cdf(-1):.6f}")
    print(f"    P(-2 ≤ X ≤ 2) = {normal.cdf(2) - normal.cdf(-2):.6f}")
    print(f"    P(X = 0) = 0.000000 (point probability is zero)")
    
    # Example 3: Exponential distribution
    print(f"
Example 3: Exponential(λ=3)")
    exp_dist = stats.expon(scale=1/3)  # scipy uses scale = 1/λ
    
    # PDF at x=0
    pdf_at_0 = exp_dist.pdf(0)
    print(f"  f(0) = λ = {pdf_at_0:.4f}")
    print(f"  Again, f(0) = 3 > 1, perfectly valid for a density!")
    
    # Key insight: integrate to get probabilities
    print(f"  P(X ≤ 1) = {exp_dist.cdf(1):.6f}")
    print(f"  P(X ≤ 2) = {exp_dist.cdf(2):.6f}")
 
 
def scipy_integrate(f, a, b):
    """Simple numerical integration."""
    from scipy.integrate import quad
    return quad(f, a, b)
 
 
demonstrate_pdf_properties()

The Most Common PDF Mistake

Never interpret f(x) as P(X = x). If you see f(0.3) = 2.5 for some distribution and think 'the probability of 0.3 is 2.5,' you've made the classic error. PDF values are densities; they must be integrated over an interval to yield probabilities. P(X = 0.3) = 0 always for continuous X.

Support of Continuous Distributions

Just as with discrete random variables, the support of a continuous random variable is the set of values it can actually take—where the PDF is positive.

Definition (Support of Continuous RV):

$$\text{supp}(X) = \overline{{x \in \mathbb{R} : f_X(x) > 0}}$$

(The overline denotes closure—including limit points. For practical purposes, it's where $f(x) > 0$.)

Common Support Types:

Continuous Distributions and Their Supports
Distribution	Parameters	Support	ML Use Case
Uniform	a, b	[a, b]	Random initialization, priors
Normal (Gaussian)	μ, σ²	ℝ = (-∞, +∞)	Regression outputs, latent variables
Exponential	λ > 0	[0, +∞)	Time between events, positive quantities
Beta	α, β > 0	[0, 1]	Probabilities, proportions
Gamma	α, β > 0	[0, +∞)	Positive continuous quantities
Log-Normal	μ, σ²	(0, +∞)	Multiplicative processes, sizes
Truncated Normal	μ, σ, a, b	[a, b]	Bounded continuous quantities

Why Support Matters in ML:

1. Model-Data Compatibility: If your target variable is strictly positive (e.g., prices, durations), using a Normal distribution (support = ℝ) allows your model to predict impossible negative values. Better choices:

Log-Normal: support = (0, ∞)
Gamma: support = [0, ∞)
Exponential: support = [0, ∞)

2. Numerical Stability: Evaluating log f(x) when x is outside the support gives log(0) = -∞. This breaks gradient descent. Either:

Clip predictions to valid ranges
Use distributions with appropriate support
Add small epsilon to avoid exact zeros

3. Generative Models: When generating samples, the support constrains what values are possible. A VAE with Gaussian latent space can produce any real vector; one with Beta-distributed components produces values in [0, 1]ⁿ.

support_mismatch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
 
def demonstrate_support_issues():
    """
    Demonstrate problems that arise from support mismatch.
    """
    print("Support Mismatch Issues in ML")
    print("=" * 55)
    
    # Problem: Modeling strictly positive data (e.g., house prices)
    # with a Gaussian distribution
    
    # Simulated house prices (log-normally distributed, hence positive)
    np.random.seed(42)
    true_prices = np.random.lognormal(mean=12, sigma=0.5, size=1000)
    print(f"
Data: House prices (strictly positive)")
    print(f"  Min: ${true_prices.min():, .0f
                        }")
    print(f"  Max: ${true_prices.max():,.0f}")
    print(f"  Mean: ${true_prices.mean():,.0f}")
    
    # BAD: Fit a Gaussian(support = ℝ)
    gaussian_fit = stats.norm.fit(true_prices)
    gaussian_dist = stats.norm(* gaussian_fit)
    
    # What's the probability of negative prices?
    p_negative = gaussian_dist.cdf(0)
    print(f"
BAD: Gaussian model")
    print(f"  Fitted μ = ${gaussian_fit[0]:,.0f}, σ = ${gaussian_fit[1]:,.0f}")
    print(f"  P(price < 0) = {p_negative:.6f}")
    print(f"  This model assigns {p_negative*100:.4f}% probability to impossible negative prices!")
    
    # If we sample from this model:
                        gaussian_samples = gaussian_dist.rvs(10000)
    n_negative = np.sum(gaussian_samples < 0)
    print(f"  In 10,000 samples: {n_negative} are negative (invalid)")
    
    # GOOD: Fit a Log - Normal(support = (0, ∞))
    # Log - normal: if Y ~Normal(μ, σ²), then X = exp(Y) ~LogNormal(μ, σ²)
    log_prices = np.log(true_prices)
    lognormal_params = stats.norm.fit(log_prices)  # Fit normal to log - data
 
    print(f"
GOOD: Log-Normal model")
    print(f"  Support: (0, ∞) — matches data constraint")
    print(f"  P(price < 0) = 0 (impossible by construction)")
    
    # Sample from log - normal: all samples are positive
    lognormal_samples = np.exp(stats.norm(* lognormal_params).rvs(10000))
    n_negative_ln = np.sum(lognormal_samples < 0)
    print(f"  In 10,000 samples: {n_negative_ln} are negative (as expected: 0)")
    
    # Log - likelihood comparison
    ll_gaussian = np.mean(gaussian_dist.logpdf(true_prices))
    ll_lognormal = np.mean(stats.lognorm.logpdf(true_prices, s = lognormal_params[1],
        scale = np.exp(lognormal_params[0])))
 
    print(f"
Log-likelihood comparison:")
    print(f"  Gaussian: {ll_gaussian:.4f}")
    print(f"  Log-Normal: {ll_lognormal:.4f}")
    print(f"  Log-Normal fits better (higher is better)")
 
 
    demonstrate_support_issues()

Computing Probabilities from PDFs

The fundamental operation with continuous random variables is computing probabilities of events, which requires integration.

Probability of an Interval: $$P(a \leq X \leq b) = \int_a^b f_X(x) , dx$$

Key Observations:

Endpoint Flexibility: For continuous RVs, $P(a < X < b) = P(a \leq X < b) = P(a < X \leq b) = P(a \leq X \leq b)$ because $P(X = a) = 0$. Boundaries don't matter.
Complement Rule: $P(X > a) = 1 - P(X \leq a) = 1 - F_X(a)$ where $F_X$ is the CDF.
Union of Disjoint Intervals: For disjoint intervals $A$ and $B$: $$P(X \in A \cup B) = \int_A f(x) dx + \int_B f(x) dx$$

pdf_integration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy import stats, integrate
 
    class ContinuousPDF:
    """
    Wrapper class demonstrating probability computations
    with continuous distributions.
    """
    def __init__(self, distribution):
    """
    Args:
    distribution: A scipy.stats distribution object
    """
    self.dist = distribution
    
    def prob_interval(self, a: float, b: float) -> float:
    """
        Compute P(a ≤ X ≤ b) via CDF difference.
        
        This is more accurate than numerical integration 
        when analytical CDF is available.
        """
    return self.dist.cdf(b) - self.dist.cdf(a)
    
    def prob_interval_integrate(self, a: float, b: float) -> float:
    """
        Compute P(a ≤ X ≤ b) via numerical integration.
        
        Useful when only PDF is available(no closed - form CDF).
        """
    result, _ = integrate.quad(self.dist.pdf, a, b)
    return result
    
    def prob_greater(self, a: float) -> float:
    """Compute P(X > a) = 1 - P(X ≤ a)."""
    return 1 - self.dist.cdf(a)
    
    def prob_less(self, a: float) -> float:
    """Compute P(X < a) = P(X ≤ a) (for continuous RV)."""
    return self.dist.cdf(a)
    
    def prob_outside(self, a: float, b: float) -> float:
    """Compute P(X < a or X > b)."""
    return self.dist.cdf(a) + (1 - self.dist.cdf(b))
    
    def quantile(self, p: float) -> float:
    """Compute x such that P(X ≤ x) = p."""
    return self.dist.ppf(p)
 
 
# Demonstration with Standard Normal
    print("Probability Computations with N(0, 1)")
    print("=" * 55)
 
    normal = ContinuousPDF(stats.norm(0, 1))
 
    print("
Basic Probabilities:")
    print(f"  P(-1 ≤ X ≤ 1) = {normal.prob_interval(-1, 1):.6f}")
    print(f"  P(-2 ≤ X ≤ 2) = {normal.prob_interval(-2, 2):.6f}")
    print(f"  P(-3 ≤ X ≤ 3) = {normal.prob_interval(-3, 3):.6f}")
 
    print("
The 68-95-99.7 Rule:")
    print(f"  Within 1σ: {normal.prob_interval(-1, 1)*100:.1f}%")
    print(f"  Within 2σ: {normal.prob_interval(-2, 2)*100:.1f}%")
    print(f"  Within 3σ: {normal.prob_interval(-3, 3)*100:.1f}%")
 
    print("
Tail Probabilities:")
    print(f"  P(X > 1.645) = {normal.prob_greater(1.645):.6f} (≈ 5%)")
    print(f"  P(X > 1.96)  = {normal.prob_greater(1.96):.6f}  (≈ 2.5%)")
    print(f"  P(X > 2.576) = {normal.prob_greater(2.576):.6f} (≈ 0.5%)")
 
    print("
Quantiles (inverse CDF):")
    print(f"  95th percentile: x = {normal.quantile(0.95):.4f}")
    print(f"  99th percentile: x = {normal.quantile(0.99):.4f}")
    print(f"  Median (50th):   x = {normal.quantile(0.50):.4f}")
 
    print("
Verification via Numerical Integration:")
    p_via_cdf = normal.prob_interval(-1.5, 0.5)
    p_via_int = normal.prob_interval_integrate(-1.5, 0.5)
    print(f"  P(-1.5 ≤ X ≤ 0.5) via CDF:  {p_via_cdf:.10f}")
    print(f"  P(-1.5 ≤ X ≤ 0.5) via ∫:    {p_via_int:.10f}")
    print(f"  Difference: {abs(p_via_cdf - p_via_int):.2e}")

Use CDFs When Available

If your distribution has a closed-form CDF, use P(a ≤ X ≤ b) = F(b) - F(a) instead of numerically integrating the PDF. It's faster, more accurate, and avoids numerical integration pitfalls like handling infinite domains or near-singular functions.

Transformations of Continuous Random Variables

When we apply a function to a continuous random variable, we get a new random variable with a different distribution. Unlike the discrete case (where we just tracked which values mapped together), the continuous case requires careful calculus.

The Change of Variables Formula:

Let $X$ be a continuous RV with PDF $f_X$, and let $Y = g(X)$ where $g$ is a strictly monotonic (strictly increasing or decreasing) differentiable function. Then $Y$ has PDF:

$$f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right|$$

The term $|\frac{d}{dy} g^{-1}(y)|$ is called the Jacobian factor—it accounts for how the transformation stretches or compresses probability density.

Intuition:

Consider $Y = 2X$ where $X \sim \text{Uniform}(0, 1)$. Then $Y$ ranges over $[0, 2]$. But what's the PDF of $Y$?

Original: $f_X(x) = 1$ on $[0, 1]$
The transformation spreads the same probability mass over twice the range
So the PDF must be half as tall: $f_Y(y) = 1/2$ on $[0, 2]$

Using the formula:

$g(x) = 2x$, so $g^{-1}(y) = y/2$
$\frac{d}{dy} g^{-1}(y) = 1/2$
$f_Y(y) = f_X(y/2) \cdot |1/2| = 1 \cdot 1/2 = 1/2$ ✓

The Jacobian compensates for the stretching.

continuous_transformations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from scipy import stats
 
 
def transform_continuous_rv():
    """
    Demonstrate transformations of continuous random variables.
    """
    print("Continuous RV Transformations")
    print("=" * 55)
    
    # Example 1: Linear transformation Y = aX + b
    # If X ~N(μ, σ²), then Y = aX + b ~N(aμ + b, a²σ²)
    print("
Example 1: Linear Transform Y = 2X + 3")
    print("  X ~ N(0, 1)")
 
    X = stats.norm(0, 1)
    
    # Analytical result
    a, b = 2, 3
    mu_Y = a * 0 + b  # = 3
    sigma_Y = abs(a) * 1  # = 2
    Y_analytical = stats.norm(mu_Y, sigma_Y)
    
    # Verify via sampling
    samples_X = X.rvs(100000)
    samples_Y = a * samples_X + b
 
    print(f"  Analytical: Y ~ N({mu_Y}, {sigma_Y**2})")
    print(f"  Empirical:  mean = {samples_Y.mean():.4f}, std = {samples_Y.std():.4f}")
    
    # Example 2: Squaring Y = X²
    # If X ~N(0, 1), then X² ~χ²(1)(chi - squared with 1 df)
    print("
Example 2: Y = X² where X ~ N(0, 1)")
    print("  Result: Y ~ χ²(1)")
 
    samples_X2 = samples_X ** 2
    chi2_1 = stats.chi2(df = 1)
 
    print(f"  Empirical mean: {samples_X2.mean():.4f}")
    print(f"  χ²(1) mean:     {chi2_1.mean():.4f}")
    print(f"  Empirical var:  {samples_X2.var():.4f}")
    print(f"  χ²(1) variance: {chi2_1.var():.4f}")
    
    # Example 3: Exponential transformation Y = exp(X)(Log - normal)
    # If X ~N(μ, σ²), then Y = exp(X) ~LogNormal(μ, σ²)
    print("
Example 3: Y = exp(X) where X ~ N(0, 1)")
    print("  Result: Y ~ LogNormal(0, 1)")
 
    samples_exp = np.exp(samples_X)
    lognorm_01 = stats.lognorm(s = 1, scale = 1)  # s = σ, scale = exp(μ)
 
    print(f"  Empirical mean: {samples_exp.mean():.4f}")
    print(f"  LogNormal mean: {lognorm_01.mean():.4f}")
    
    # Example 4: The log transform(inverse of exp)
    # If Y ~LogNormal(μ, σ²), then log(Y) ~N(μ, σ²)
    print("
Example 4: Z = log(Y) where Y ~ LogNormal")
    print("  This recovers the original Normal distribution")
 
    samples_log = np.log(samples_exp)
    print(f"  Original X ~ N(0,1), Empirical: mean={samples_log.mean():.4f}, std={samples_log.std():.4f}")
 
 
def jacobian_example():
    """
    Demonstrate the Jacobian correction in detail.
    """
    print("
" + "=" * 55)
    print("The Jacobian Factor: Why It Matters")
    print("=" * 55)
    
    # Without Jacobian correction, transformed densities won't normalize
    
    # X ~Uniform(0, 1), f_X(x) = 1
    # Y = X² ∈[0, 1]
    # g(x) = x², g⁻¹(y) = √y
    # dg⁻¹/dy = 1/(2√y)
    
    # Correct PDF: f_Y(y) = f_X(√y) * | 1 / (2√y)| = 1 * 1 / (2√y) = 1 / (2√y)
 
    print("
X ~ Uniform(0, 1), Y = X²")
    print("Without Jacobian: f_Y(y) = f_X(√y) = 1 (WRONG)")
    print("With Jacobian:    f_Y(y) = 1/(2√y)   (CORRECT)")
    
    # Verify normalization
    from scipy.integrate import quad
    
    wrong_pdf = lambda y: 1.0 if 0 < y < 1 else 0
    correct_pdf = lambda y: 1 / (2 * np.sqrt(y)) if 0 < y < 1 else 0
 
    integral_wrong, _ = quad(wrong_pdf, 0.001, 0.999)
    integral_correct, _ = quad(correct_pdf, 0.001, 0.999)
 
    print(f"
Normalization check (should be 1.0):")
    print(f"  Without Jacobian: ∫ f_Y dy ≈ {integral_wrong:.4f} (wrong!)")
    print(f"  With Jacobian:    ∫ f_Y dy ≈ {integral_correct:.4f} (correct)")
 
 
    transform_continuous_rv()
    jacobian_example()

Jacobian in Deep Learning

The Jacobian factor appears in normalizing flows, a class of generative models that transform simple distributions (like Gaussians) into complex ones through a chain of invertible functions. The log-likelihood requires computing |det J| at each step—a computational bottleneck that drives architectural choices in flow-based models.

Essential Continuous Distributions for ML

Certain continuous distributions appear repeatedly in machine learning. Mastering their properties is essential for both understanding existing models and designing new ones.

The Big Four Continuous Distributions

•Uniform Distribution U(a, b): Constant density over [a, b]. PDF: f(x) = 1/(b-a). Used for random initialization, uninformative priors, and random number generation.
•Normal (Gaussian) N(μ, σ²): The bell curve. PDF: f(x) = (1/√(2πσ²)) exp(-(x-μ)²/2σ²). The default assumption for noise, priors, latent variables. Central limit theorem justifies its ubiquity.
•Exponential Exp(λ): Models time between events. PDF: f(x) = λ exp(-λx) for x ≥ 0. Memoryless property: P(X > s+t | X > s) = P(X > t). Used for modeling durations, inter-arrival times.
•Beta Beta(α, β): Defined on [0, 1]. PDF: f(x) = x^(α-1)(1-x)^(β-1)/B(α,β). Perfect for modeling probabilities and proportions. Prior for Bernoulli/Binomial likelihood.

key_distributions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from scipy import stats
 
def compare_key_distributions():
    """
    Compare properties of key continuous distributions.
    """
    print("Key Continuous Distributions for ML")
    print("=" * 60)
 
    distributions = [
        ("Uniform(0, 1)", stats.uniform(0, 1), "[0, 1]"),
        ("Normal(0, 1)", stats.norm(0, 1), "ℝ"),
        ("Exp(λ=1)", stats.expon(scale = 1), "[0, ∞)"),
        ("Beta(2, 5)", stats.beta(2, 5), "[0, 1]"),
        ("Gamma(2, 1)", stats.gamma(2, scale = 1), "[0, ∞)"),
    ]
 
    print(f"
{'Distribution':<20} {'Support':<12} {'Mean':>8} {'Var':>8} {'Median':>8}")
    print("-" * 60)
 
    for name, dist, support in distributions:
        print(f"{name:<20} {support:<12} {dist.mean():>8.4f} {dist.var():>8.4f} {dist.median():>8.4f}")
    
    # ML Application: Regression residuals
    print("
 
ML Application: Modeling Regression Residuals")
    print("=" * 60)
    
    # Generate synthetic residuals(should be normal if homoscedastic)
        np.random.seed(42)
    true_residuals = np.random.normal(0, 2, 500)
    
    # Fit normal distribution
    mu, sigma = stats.norm.fit(true_residuals)
    print(f"
Fitted Normal: μ = {mu:.4f}, σ = {sigma:.4f}")
    
    # Check fit quality with Shapiro - Wilk test
    stat, p_value = stats.shapiro(true_residuals[: 50])  # use subset for speed
    print(f"Shapiro-Wilk test: p = {p_value:.4f}")
    print(f"Normality assumption: {'Supported' if p_value > 0.05 else 'Rejected'}")
    
    # ML Application: Bayesian priors
    print("
 
ML Application: Beta Distribution as Prior")
    print("=" * 60)
 
    priors = [
        ("Uninformative", 1, 1),        # Uniform
            ("Weak belief p≈0.5", 2, 2),    # Centered, low confidence
            ("Strong belief p≈0.5", 10, 10), # Centered, high confidence
            ("Belief p≈0.2", 2, 8),          # Skewed toward small p
    ]
 
    print(f"
{'Prior Type':<25} {'α':>4} {'β':>4} {'Mean':>8} {'Mode':>8} {'Std':>8}")
    print("-" * 60)
 
    for name, a, b in priors:
        dist = stats.beta(a, b)
    mode = (a - 1) / (a + b - 2) if a > 1 and b > 1 else 'N/A'
    mode_str = f"{mode:.4f}" if isinstance(mode, float) else mode
    print(f"{name:<25} {a:>4} {b:>4} {dist.mean():>8.4f} {mode_str:>8} {dist.std():>8.4f}")
 
 
    compare_key_distributions()

Beyond Pure Continuous: Mixed Distributions

Not all random variables are purely discrete or purely continuous. Real ML systems often involve mixed distributions that have both continuous components and discrete point masses.

Definition (Mixed Distribution):

A random variable $X$ has a mixed distribution if its CDF can be written as: $$F_X(x) = p \cdot F_d(x) + (1-p) \cdot F_c(x)$$

where $F_d$ is a discrete CDF, $F_c$ is a continuous CDF, and $p \in (0, 1)$.

Mixed Distributions in ML

•Zero-Inflated Models: Count data with excess zeros. User engagement: P(views=0) might be 70% (discrete mass), with remaining 30% following Poisson for non-zero counts.
•Censored Data: Survival analysis where some observations are right-censored. Time-to-event is continuous, but 'still alive at study end' is a point mass.
•Tobit Models: Regression where the dependent variable is censored (e.g., income capped at reporting threshold). Mass at threshold + continuous below.
•Hurdle Models: Two-stage models—first decide if outcome is zero (discrete), then model positive values (continuous).
•Spike-and-Slab Priors: Bayesian variable selection. Weight is exactly 0 with probability p (spike), or drawn from continuous prior (slab).

zero_inflated_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy import stats
 
def zero_inflated_poisson_demo():
    """
    Demonstrate a Zero - Inflated Poisson(ZIP) model,
        a mixed distribution common in ML applications.
    """
    print("Zero-Inflated Poisson Distribution")
    print("=" * 55)
    print("Model: With probability π, X=0 (structural zero)")
    print("       With probability 1-π, X ~ Poisson(λ)")
    
    # Parameters
    pi = 0.3       # Probability of structural zero
    lambda_ = 3.0  # Poisson parameter
 
    print(f"
Parameters: π = {pi}, λ = {lambda_}")
    
    # Generate samples
    np.random.seed(42)
    n_samples = 10000
    
    # Two - step generation
    structural_zero = np.random.binomial(1, pi, n_samples)  # 1 if structural zero
    poisson_samples = np.random.poisson(lambda_, n_samples)
    zip_samples = np.where(structural_zero == 1, 0, poisson_samples)
    
    # Compare to regular Poisson
    poisson_only = np.random.poisson(lambda_, n_samples)
 
    print(f"
Empirical comparison:")
    print(f"  ZIP P(X=0) = {np.mean(zip_samples == 0):.4f}")
    print(f"  Poisson P(X=0) = {np.mean(poisson_only == 0):.4f}")
    print(f"  Theoretical ZIP P(X=0) = π + (1-π)exp(-λ) = {pi + (1-pi)*np.exp(-lambda_):.4f}")
    print(f"  Theoretical Poisson P(X=0) = exp(-λ) = {np.exp(-lambda_):.4f}")
 
    print(f"
  ZIP mean = {np.mean(zip_samples):.4f}")
    print(f"  Poisson mean = {np.mean(poisson_only):.4f}")
    print(f"  Theoretical ZIP mean = (1-π)λ = {(1-pi)*lambda_:.4f}")
    
    # This matters for ML because naive Poisson regression on 
    # zero - inflated data will give biased parameter estimates
    print("
⚠️  Using standard Poisson on ZIP data biases all estimates!")
    print("    Always check for excess zeros in count data.")
 
 
def censored_data_demo():
    """
    Demonstrate censored data, another mixed distribution.
    """
    print("
 
Censored Data (Right Censoring)")
    print("=" * 55)
    print("Model: True X ~ Exponential(λ), but we only observe min(X, C)")
 
    lambda_ = 0.5  # True rate
    C = 3.0        # Censoring threshold
    n_samples = 10000
 
    np.random.seed(42)
    true_values = np.random.exponential(1 / lambda_, n_samples)
    observed = np.minimum(true_values, C)
    is_censored = true_values >= C
 
    print(f"
True λ = {lambda_}, Censoring at C = {C}")
    print(f"Censoring rate: {np.mean(is_censored)*100:.1f}%")
    
    # Naive mean estimate(biased)
    naive_mean = np.mean(observed)
    true_mean = 1 / lambda_
 
    print(f"
Mean estimation:")
    print(f"  True mean (1/λ): {true_mean:.4f}")
    print(f"  Naive mean: {naive_mean:.4f} (biased downward!)")
    print(f"  Bias: {(naive_mean - true_mean):.4f}")
 
    print("
⚠️  Ignoring censoring leads to biased estimates!")
    print("    Use survival analysis methods (Kaplan-Meier, Cox PH)")
 
 
    zero_inflated_poisson_demo()
    censored_data_demo()

Ignoring Mixture Structure Causes Bias

When data comes from a mixed distribution but you fit a pure continuous or pure discrete model, your parameter estimates will be biased, often severely. Always examine your data for excess point masses (like zeros) or censoring before choosing a model family.

Summary: The Continuous Framework

We've established the mathematical framework for continuous random variables—the foundation for modeling quantities that vary smoothly across uncountable domains.

Key Concepts Mastered

•Zero Point Probability: P(X = x) = 0 for any specific value x when X is continuous—probability concentrates on intervals, not points.
•Probability Density Function: The PDF f(x) is not a probability but a density; integrate to get probabilities: P(a ≤ X ≤ b) = ∫f(x)dx.
•PDF Properties: Non-negativity, normalization (∫f(x)dx = 1), and density values can exceed 1.
•Support Matters: Matching distribution support to data constraints (e.g., positive-only quantities) prevents impossible predictions and numerical issues.
•Transformations Require Jacobian: When Y = g(X), the PDF of Y involves |dg⁻¹/dy|, the Jacobian factor that preserves probability normalization.
•Key Distributions: Normal (everywhere), Exponential (positive, memoryless), Beta ([0,1], priors), Uniform (random initialization).
•Mixed Distributions: Real data often combines discrete point masses with continuous components—recognizing this prevents modeling bias.

What's Next:

The next page unifies our treatment of discrete and continuous random variables through the Cumulative Distribution Function (CDF) and provides a detailed treatment of PMFs and PDFs together. We'll see how the CDF provides a universal interface to probability distributions, regardless of whether they're discrete, continuous, or mixed.

Page Complete

You now understand continuous random variables and probability density functions. The conceptual shift from 'probability of a point' to 'density at a point' is fundamental—it unlocks the mathematics of regression, generative modeling, and all continuous inference in machine learning.

2 / 5

Loading learning content...

Machine LearningRandom Variables and Distributions

Random Variables and Distributions

LevelIntermediate

Duration90 mins

TopicRandom Variables and Distributions

2 / 5

Continuous Random Variables

Beyond Counting: The Continuum of Outcomes

Yet continuous random variables are everywhere in machine learning:

Regression targets: House prices, temperatures, stock returns
Neural network parameters: Weights, biases, activations (before quantization)
Latent representations: Embeddings, hidden states, generated features
Measurement data: Sensor readings, audio waveforms, image intensities

What You Will Learn

The Zero Probability Paradox

Before defining continuous random variables formally, we must confront a seeming paradox that illuminates the conceptual leap required.

The Problem:

Suppose each point has probability $p > 0$. The sum over all points would be: $$\sum_{x \in [0,1]} p = \text{uncountable sum of positive values} = \infty$$

This violates the probability axiom that total probability equals 1. We're forced to conclude that for any individual point $x$: $$P(X = x) = 0$$

Yet the event 'X takes some value in [0, 1]' has probability 1. How can a union of probability-zero events have probability 1?

Resolving the Paradox

The Conceptual Shift:

Discrete World	Continuous World
Individual values have probability	Individual values have zero probability
PMF: $p(x) = P(X = x)$	PDF: $f(x)

Formal Definition: Continuous Random Variables

With the conceptual groundwork laid, we can now define continuous random variables precisely.

Definition (Continuous Random Variable):

A random variable $X$ is called continuous if there exists a non-negative function $f_X: \mathbb{R} \rightarrow [0, \infty)$ such that for all intervals $[a, b]$:

$$P(a \leq X \leq b) = \int_a^b f_X(x) , dx$$

The function $f_X$ is called the probability density function (PDF) of $X$.

Key Properties:

Non-negativity: $f_X(x) \geq 0$ for all $x$
Normalization: $\int_{-\infty}^{\infty} f_X(x) , dx = 1$
Zero point probability: $P(X = x) = \int_x^x f(x) dx = 0$ for all $x$

Critical Distinction: PDF vs PMF

A common source of confusion: the PDF value $f_X(x)$ is not a probability. It's a density—probability per unit interval. This has immediate consequences:

$f_X(x)$ can exceed 1. For example, the uniform distribution on $[0, 0.1]$ has $f_X(x) = 10$ for $x \in [0, 0.1]$.
$f_X(x)$ has units of $[\text{probability}]/[\text{units of } x]$. If $X$ is measured in meters, $f_X(x)$ has units of $1/\text{meter}$.
To get a probability, you must integrate: $P(X \in [a, b]) = \int_a^b f(x) dx$.

Intuition via Histogram:

pdf_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def demonstrate_pdf_properties():
    """
    Demonstrate key properties of probability density functions.
    """
    print("Probability Density Function Properties")
    print("=" * 55)
    
    # Example 1: Uniform distribution on [0, 0.5]
    # PDF = 2.0 on [0, 0.5], zero elsewhere
    a, b = 0, 0.5
    uniform_pdf = 1 / (b - a)  # = 2.0
    
    print(f"
Example 1: Uniform[0, 0.5]")
    print(f"  PDF value: f(x) = {uniform_pdf}")
    print(f"  Note: f(x) = 2.0 > 1, but this is NOT a probability!")
    
    # Verify normalization
    integral, _ = scipy_integrate(lambda x: uniform_pdf, a, b)
    print(f"  ∫ f(x)dx from 0 to 0.5 = {integral:.4f} (normalized correctly)")
    
    # Example 2: Standard Normal
    print(f"
Example 2: Standard Normal N(0, 1)")
    normal = stats.norm(0, 1)
    
    # PDF at various points
    points = [0, 1, 2, -1]
    print("  PDF values (not probabilities!):")
    for x in points:
        print(f"    f({x:2d}) = {normal.pdf(x):.6f}")
    
    # At x=0, f(0) ≈ 0.399, highest for standard normal
    # The value 0.399 still has meaning: probability mass is 
    # concentrated near 0
    
    # Computing actual probabilities via integration
    print("
  Actual probabilities (via integration):")
    print(f"    P(-1 ≤ X ≤ 1) = {normal.cdf(1) - normal.cdf(-1):.6f}")
    print(f"    P(-2 ≤ X ≤ 2) = {normal.cdf(2) - normal.cdf(-2):.6f}")
    print(f"    P(X = 0) = 0.000000 (point probability is zero)")
    
    # Example 3: Exponential distribution
    print(f"
Example 3: Exponential(λ=3)")
    exp_dist = stats.expon(scale=1/3)  # scipy uses scale = 1/λ
    
    # PDF at x=0
    pdf_at_0 = exp_dist.pdf(0)
    print(f"  f(0) = λ = {pdf_at_0:.4f}")
    print(f"  Again, f(0) = 3 > 1, perfectly valid for a density!")
    
    # Key insight: integrate to get probabilities
    print(f"  P(X ≤ 1) = {exp_dist.cdf(1):.6f}")
    print(f"  P(X ≤ 2) = {exp_dist.cdf(2):.6f}")
 
 
def scipy_integrate(f, a, b):
    """Simple numerical integration."""
    from scipy.integrate import quad
    return quad(f, a, b)
 
 
demonstrate_pdf_properties()

The Most Common PDF Mistake

Support of Continuous Distributions

Just as with discrete random variables, the support of a continuous random variable is the set of values it can actually take—where the PDF is positive.

Definition (Support of Continuous RV):

$$\text{supp}(X) = \overline{{x \in \mathbb{R} : f_X(x) > 0}}$$

(The overline denotes closure—including limit points. For practical purposes, it's where $f(x) > 0$.)

Common Support Types:

Continuous Distributions and Their Supports
Distribution	Parameters	Support	ML Use Case
Uniform	a, b	[a, b]	Random initialization, priors
Normal (Gaussian)	μ, σ²	ℝ = (-∞, +∞)	Regression outputs, latent variables
Exponential	λ > 0	[0, +∞)	Time between events, positive quantities
Beta	α, β > 0	[0, 1]	Probabilities, proportions
Gamma	α, β > 0	[0, +∞)	Positive continuous quantities
Log-Normal	μ, σ²	(0, +∞)	Multiplicative processes, sizes
Truncated Normal	μ, σ, a, b	[a, b]	Bounded continuous quantities

Why Support Matters in ML:

Log-Normal: support = (0, ∞)
Gamma: support = [0, ∞)
Exponential: support = [0, ∞)

2. Numerical Stability: Evaluating log f(x) when x is outside the support gives log(0) = -∞. This breaks gradient descent. Either:

Clip predictions to valid ranges
Use distributions with appropriate support
Add small epsilon to avoid exact zeros

support_mismatch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from scipy import stats
 
def demonstrate_support_issues():
    """
    Demonstrate problems that arise from support mismatch.
    """
    print("Support Mismatch Issues in ML")
    print("=" * 55)
    
    # Problem: Modeling strictly positive data (e.g., house prices)
    # with a Gaussian distribution
    
    # Simulated house prices (log-normally distributed, hence positive)
    np.random.seed(42)
    true_prices = np.random.lognormal(mean=12, sigma=0.5, size=1000)
    print(f"
Data: House prices (strictly positive)")
    print(f"  Min: ${true_prices.min():, .0f
                        }")
    print(f"  Max: ${true_prices.max():,.0f}")
    print(f"  Mean: ${true_prices.mean():,.0f}")
    
    # BAD: Fit a Gaussian(support = ℝ)
    gaussian_fit = stats.norm.fit(true_prices)
    gaussian_dist = stats.norm(* gaussian_fit)
    
    # What's the probability of negative prices?
    p_negative = gaussian_dist.cdf(0)
    print(f"
BAD: Gaussian model")
    print(f"  Fitted μ = ${gaussian_fit[0]:,.0f}, σ = ${gaussian_fit[1]:,.0f}")
    print(f"  P(price < 0) = {p_negative:.6f}")
    print(f"  This model assigns {p_negative*100:.4f}% probability to impossible negative prices!")
    
    # If we sample from this model:
                        gaussian_samples = gaussian_dist.rvs(10000)
    n_negative = np.sum(gaussian_samples < 0)
    print(f"  In 10,000 samples: {n_negative} are negative (invalid)")
    
    # GOOD: Fit a Log - Normal(support = (0, ∞))
    # Log - normal: if Y ~Normal(μ, σ²), then X = exp(Y) ~LogNormal(μ, σ²)
    log_prices = np.log(true_prices)
    lognormal_params = stats.norm.fit(log_prices)  # Fit normal to log - data
 
    print(f"
GOOD: Log-Normal model")
    print(f"  Support: (0, ∞) — matches data constraint")
    print(f"  P(price < 0) = 0 (impossible by construction)")
    
    # Sample from log - normal: all samples are positive
    lognormal_samples = np.exp(stats.norm(* lognormal_params).rvs(10000))
    n_negative_ln = np.sum(lognormal_samples < 0)
    print(f"  In 10,000 samples: {n_negative_ln} are negative (as expected: 0)")
    
    # Log - likelihood comparison
    ll_gaussian = np.mean(gaussian_dist.logpdf(true_prices))
    ll_lognormal = np.mean(stats.lognorm.logpdf(true_prices, s = lognormal_params[1],
        scale = np.exp(lognormal_params[0])))
 
    print(f"
Log-likelihood comparison:")
    print(f"  Gaussian: {ll_gaussian:.4f}")
    print(f"  Log-Normal: {ll_lognormal:.4f}")
    print(f"  Log-Normal fits better (higher is better)")
 
 
    demonstrate_support_issues()

Computing Probabilities from PDFs

The fundamental operation with continuous random variables is computing probabilities of events, which requires integration.

Probability of an Interval: $$P(a \leq X \leq b) = \int_a^b f_X(x) , dx$$

Key Observations:

Endpoint Flexibility: For continuous RVs, $P(a < X < b) = P(a \leq X < b) = P(a < X \leq b) = P(a \leq X \leq b)$ because $P(X = a) = 0$. Boundaries don't matter.
Complement Rule: $P(X > a) = 1 - P(X \leq a) = 1 - F_X(a)$ where $F_X$ is the CDF.
Union of Disjoint Intervals: For disjoint intervals $A$ and $B$: $$P(X \in A \cup B) = \int_A f(x) dx + \int_B f(x) dx$$

pdf_integration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from scipy import stats, integrate
 
    class ContinuousPDF:
    """
    Wrapper class demonstrating probability computations
    with continuous distributions.
    """
    def __init__(self, distribution):
    """
    Args:
    distribution: A scipy.stats distribution object
    """
    self.dist = distribution
    
    def prob_interval(self, a: float, b: float) -> float:
    """
        Compute P(a ≤ X ≤ b) via CDF difference.
        
        This is more accurate than numerical integration 
        when analytical CDF is available.
        """
    return self.dist.cdf(b) - self.dist.cdf(a)
    
    def prob_interval_integrate(self, a: float, b: float) -> float:
    """
        Compute P(a ≤ X ≤ b) via numerical integration.
        
        Useful when only PDF is available(no closed - form CDF).
        """
    result, _ = integrate.quad(self.dist.pdf, a, b)
    return result
    
    def prob_greater(self, a: float) -> float:
    """Compute P(X > a) = 1 - P(X ≤ a)."""
    return 1 - self.dist.cdf(a)
    
    def prob_less(self, a: float) -> float:
    """Compute P(X < a) = P(X ≤ a) (for continuous RV)."""
    return self.dist.cdf(a)
    
    def prob_outside(self, a: float, b: float) -> float:
    """Compute P(X < a or X > b)."""
    return self.dist.cdf(a) + (1 - self.dist.cdf(b))
    
    def quantile(self, p: float) -> float:
    """Compute x such that P(X ≤ x) = p."""
    return self.dist.ppf(p)
 
 
# Demonstration with Standard Normal
    print("Probability Computations with N(0, 1)")
    print("=" * 55)
 
    normal = ContinuousPDF(stats.norm(0, 1))
 
    print("
Basic Probabilities:")
    print(f"  P(-1 ≤ X ≤ 1) = {normal.prob_interval(-1, 1):.6f}")
    print(f"  P(-2 ≤ X ≤ 2) = {normal.prob_interval(-2, 2):.6f}")
    print(f"  P(-3 ≤ X ≤ 3) = {normal.prob_interval(-3, 3):.6f}")
 
    print("
The 68-95-99.7 Rule:")
    print(f"  Within 1σ: {normal.prob_interval(-1, 1)*100:.1f}%")
    print(f"  Within 2σ: {normal.prob_interval(-2, 2)*100:.1f}%")
    print(f"  Within 3σ: {normal.prob_interval(-3, 3)*100:.1f}%")
 
    print("
Tail Probabilities:")
    print(f"  P(X > 1.645) = {normal.prob_greater(1.645):.6f} (≈ 5%)")
    print(f"  P(X > 1.96)  = {normal.prob_greater(1.96):.6f}  (≈ 2.5%)")
    print(f"  P(X > 2.576) = {normal.prob_greater(2.576):.6f} (≈ 0.5%)")
 
    print("
Quantiles (inverse CDF):")
    print(f"  95th percentile: x = {normal.quantile(0.95):.4f}")
    print(f"  99th percentile: x = {normal.quantile(0.99):.4f}")
    print(f"  Median (50th):   x = {normal.quantile(0.50):.4f}")
 
    print("
Verification via Numerical Integration:")
    p_via_cdf = normal.prob_interval(-1.5, 0.5)
    p_via_int = normal.prob_interval_integrate(-1.5, 0.5)
    print(f"  P(-1.5 ≤ X ≤ 0.5) via CDF:  {p_via_cdf:.10f}")
    print(f"  P(-1.5 ≤ X ≤ 0.5) via ∫:    {p_via_int:.10f}")
    print(f"  Difference: {abs(p_via_cdf - p_via_int):.2e}")

Use CDFs When Available

Transformations of Continuous Random Variables

The Change of Variables Formula:

Let $X$ be a continuous RV with PDF $f_X$, and let $Y = g(X)$ where $g$ is a strictly monotonic (strictly increasing or decreasing) differentiable function. Then $Y$ has PDF:

$$f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right|$$

The term $|\frac{d}{dy} g^{-1}(y)|$ is called the Jacobian factor—it accounts for how the transformation stretches or compresses probability density.

Intuition:

Consider $Y = 2X$ where $X \sim \text{Uniform}(0, 1)$. Then $Y$ ranges over $[0, 2]$. But what's the PDF of $Y$?

Original: $f_X(x) = 1$ on $[0, 1]$
The transformation spreads the same probability mass over twice the range
So the PDF must be half as tall: $f_Y(y) = 1/2$ on $[0, 2]$

Using the formula:

$g(x) = 2x$, so $g^{-1}(y) = y/2$
$\frac{d}{dy} g^{-1}(y) = 1/2$
$f_Y(y) = f_X(y/2) \cdot |1/2| = 1 \cdot 1/2 = 1/2$ ✓

The Jacobian compensates for the stretching.

continuous_transformations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import numpy as np
from scipy import stats
 
 
def transform_continuous_rv():
    """
    Demonstrate transformations of continuous random variables.
    """
    print("Continuous RV Transformations")
    print("=" * 55)
    
    # Example 1: Linear transformation Y = aX + b
    # If X ~N(μ, σ²), then Y = aX + b ~N(aμ + b, a²σ²)
    print("
Example 1: Linear Transform Y = 2X + 3")
    print("  X ~ N(0, 1)")
 
    X = stats.norm(0, 1)
    
    # Analytical result
    a, b = 2, 3
    mu_Y = a * 0 + b  # = 3
    sigma_Y = abs(a) * 1  # = 2
    Y_analytical = stats.norm(mu_Y, sigma_Y)
    
    # Verify via sampling
    samples_X = X.rvs(100000)
    samples_Y = a * samples_X + b
 
    print(f"  Analytical: Y ~ N({mu_Y}, {sigma_Y**2})")
    print(f"  Empirical:  mean = {samples_Y.mean():.4f}, std = {samples_Y.std():.4f}")
    
    # Example 2: Squaring Y = X²
    # If X ~N(0, 1), then X² ~χ²(1)(chi - squared with 1 df)
    print("
Example 2: Y = X² where X ~ N(0, 1)")
    print("  Result: Y ~ χ²(1)")
 
    samples_X2 = samples_X ** 2
    chi2_1 = stats.chi2(df = 1)
 
    print(f"  Empirical mean: {samples_X2.mean():.4f}")
    print(f"  χ²(1) mean:     {chi2_1.mean():.4f}")
    print(f"  Empirical var:  {samples_X2.var():.4f}")
    print(f"  χ²(1) variance: {chi2_1.var():.4f}")
    
    # Example 3: Exponential transformation Y = exp(X)(Log - normal)
    # If X ~N(μ, σ²), then Y = exp(X) ~LogNormal(μ, σ²)
    print("
Example 3: Y = exp(X) where X ~ N(0, 1)")
    print("  Result: Y ~ LogNormal(0, 1)")
 
    samples_exp = np.exp(samples_X)
    lognorm_01 = stats.lognorm(s = 1, scale = 1)  # s = σ, scale = exp(μ)
 
    print(f"  Empirical mean: {samples_exp.mean():.4f}")
    print(f"  LogNormal mean: {lognorm_01.mean():.4f}")
    
    # Example 4: The log transform(inverse of exp)
    # If Y ~LogNormal(μ, σ²), then log(Y) ~N(μ, σ²)
    print("
Example 4: Z = log(Y) where Y ~ LogNormal")
    print("  This recovers the original Normal distribution")
 
    samples_log = np.log(samples_exp)
    print(f"  Original X ~ N(0,1), Empirical: mean={samples_log.mean():.4f}, std={samples_log.std():.4f}")
 
 
def jacobian_example():
    """
    Demonstrate the Jacobian correction in detail.
    """
    print("
" + "=" * 55)
    print("The Jacobian Factor: Why It Matters")
    print("=" * 55)
    
    # Without Jacobian correction, transformed densities won't normalize
    
    # X ~Uniform(0, 1), f_X(x) = 1
    # Y = X² ∈[0, 1]
    # g(x) = x², g⁻¹(y) = √y
    # dg⁻¹/dy = 1/(2√y)
    
    # Correct PDF: f_Y(y) = f_X(√y) * | 1 / (2√y)| = 1 * 1 / (2√y) = 1 / (2√y)
 
    print("
X ~ Uniform(0, 1), Y = X²")
    print("Without Jacobian: f_Y(y) = f_X(√y) = 1 (WRONG)")
    print("With Jacobian:    f_Y(y) = 1/(2√y)   (CORRECT)")
    
    # Verify normalization
    from scipy.integrate import quad
    
    wrong_pdf = lambda y: 1.0 if 0 < y < 1 else 0
    correct_pdf = lambda y: 1 / (2 * np.sqrt(y)) if 0 < y < 1 else 0
 
    integral_wrong, _ = quad(wrong_pdf, 0.001, 0.999)
    integral_correct, _ = quad(correct_pdf, 0.001, 0.999)
 
    print(f"
Normalization check (should be 1.0):")
    print(f"  Without Jacobian: ∫ f_Y dy ≈ {integral_wrong:.4f} (wrong!)")
    print(f"  With Jacobian:    ∫ f_Y dy ≈ {integral_correct:.4f} (correct)")
 
 
    transform_continuous_rv()
    jacobian_example()

Jacobian in Deep Learning

Essential Continuous Distributions for ML

Certain continuous distributions appear repeatedly in machine learning. Mastering their properties is essential for both understanding existing models and designing new ones.

The Big Four Continuous Distributions

•Uniform Distribution U(a, b): Constant density over [a, b]. PDF: f(x) = 1/(b-a). Used for random initialization, uninformative priors, and random number generation.
•Normal (Gaussian) N(μ, σ²): The bell curve. PDF: f(x) = (1/√(2πσ²)) exp(-(x-μ)²/2σ²). The default assumption for noise, priors, latent variables. Central limit theorem justifies its ubiquity.
•Exponential Exp(λ): Models time between events. PDF: f(x) = λ exp(-λx) for x ≥ 0. Memoryless property: P(X > s+t | X > s) = P(X > t). Used for modeling durations, inter-arrival times.
•Beta Beta(α, β): Defined on [0, 1]. PDF: f(x) = x^(α-1)(1-x)^(β-1)/B(α,β). Perfect for modeling probabilities and proportions. Prior for Bernoulli/Binomial likelihood.

key_distributions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from scipy import stats
 
def compare_key_distributions():
    """
    Compare properties of key continuous distributions.
    """
    print("Key Continuous Distributions for ML")
    print("=" * 60)
 
    distributions = [
        ("Uniform(0, 1)", stats.uniform(0, 1), "[0, 1]"),
        ("Normal(0, 1)", stats.norm(0, 1), "ℝ"),
        ("Exp(λ=1)", stats.expon(scale = 1), "[0, ∞)"),
        ("Beta(2, 5)", stats.beta(2, 5), "[0, 1]"),
        ("Gamma(2, 1)", stats.gamma(2, scale = 1), "[0, ∞)"),
    ]
 
    print(f"
{'Distribution':<20} {'Support':<12} {'Mean':>8} {'Var':>8} {'Median':>8}")
    print("-" * 60)
 
    for name, dist, support in distributions:
        print(f"{name:<20} {support:<12} {dist.mean():>8.4f} {dist.var():>8.4f} {dist.median():>8.4f}")
    
    # ML Application: Regression residuals
    print("
 
ML Application: Modeling Regression Residuals")
    print("=" * 60)
    
    # Generate synthetic residuals(should be normal if homoscedastic)
        np.random.seed(42)
    true_residuals = np.random.normal(0, 2, 500)
    
    # Fit normal distribution
    mu, sigma = stats.norm.fit(true_residuals)
    print(f"
Fitted Normal: μ = {mu:.4f}, σ = {sigma:.4f}")
    
    # Check fit quality with Shapiro - Wilk test
    stat, p_value = stats.shapiro(true_residuals[: 50])  # use subset for speed
    print(f"Shapiro-Wilk test: p = {p_value:.4f}")
    print(f"Normality assumption: {'Supported' if p_value > 0.05 else 'Rejected'}")
    
    # ML Application: Bayesian priors
    print("
 
ML Application: Beta Distribution as Prior")
    print("=" * 60)
 
    priors = [
        ("Uninformative", 1, 1),        # Uniform
            ("Weak belief p≈0.5", 2, 2),    # Centered, low confidence
            ("Strong belief p≈0.5", 10, 10), # Centered, high confidence
            ("Belief p≈0.2", 2, 8),          # Skewed toward small p
    ]
 
    print(f"
{'Prior Type':<25} {'α':>4} {'β':>4} {'Mean':>8} {'Mode':>8} {'Std':>8}")
    print("-" * 60)
 
    for name, a, b in priors:
        dist = stats.beta(a, b)
    mode = (a - 1) / (a + b - 2) if a > 1 and b > 1 else 'N/A'
    mode_str = f"{mode:.4f}" if isinstance(mode, float) else mode
    print(f"{name:<25} {a:>4} {b:>4} {dist.mean():>8.4f} {mode_str:>8} {dist.std():>8.4f}")
 
 
    compare_key_distributions()

Beyond Pure Continuous: Mixed Distributions

Not all random variables are purely discrete or purely continuous. Real ML systems often involve mixed distributions that have both continuous components and discrete point masses.

Definition (Mixed Distribution):

A random variable $X$ has a mixed distribution if its CDF can be written as: $$F_X(x) = p \cdot F_d(x) + (1-p) \cdot F_c(x)$$

where $F_d$ is a discrete CDF, $F_c$ is a continuous CDF, and $p \in (0, 1)$.

Mixed Distributions in ML

•Zero-Inflated Models: Count data with excess zeros. User engagement: P(views=0) might be 70% (discrete mass), with remaining 30% following Poisson for non-zero counts.
•Censored Data: Survival analysis where some observations are right-censored. Time-to-event is continuous, but 'still alive at study end' is a point mass.
•Tobit Models: Regression where the dependent variable is censored (e.g., income capped at reporting threshold). Mass at threshold + continuous below.
•Hurdle Models: Two-stage models—first decide if outcome is zero (discrete), then model positive values (continuous).
•Spike-and-Slab Priors: Bayesian variable selection. Weight is exactly 0 with probability p (spike), or drawn from continuous prior (slab).

zero_inflated_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy import stats
 
def zero_inflated_poisson_demo():
    """
    Demonstrate a Zero - Inflated Poisson(ZIP) model,
        a mixed distribution common in ML applications.
    """
    print("Zero-Inflated Poisson Distribution")
    print("=" * 55)
    print("Model: With probability π, X=0 (structural zero)")
    print("       With probability 1-π, X ~ Poisson(λ)")
    
    # Parameters
    pi = 0.3       # Probability of structural zero
    lambda_ = 3.0  # Poisson parameter
 
    print(f"
Parameters: π = {pi}, λ = {lambda_}")
    
    # Generate samples
    np.random.seed(42)
    n_samples = 10000
    
    # Two - step generation
    structural_zero = np.random.binomial(1, pi, n_samples)  # 1 if structural zero
    poisson_samples = np.random.poisson(lambda_, n_samples)
    zip_samples = np.where(structural_zero == 1, 0, poisson_samples)
    
    # Compare to regular Poisson
    poisson_only = np.random.poisson(lambda_, n_samples)
 
    print(f"
Empirical comparison:")
    print(f"  ZIP P(X=0) = {np.mean(zip_samples == 0):.4f}")
    print(f"  Poisson P(X=0) = {np.mean(poisson_only == 0):.4f}")
    print(f"  Theoretical ZIP P(X=0) = π + (1-π)exp(-λ) = {pi + (1-pi)*np.exp(-lambda_):.4f}")
    print(f"  Theoretical Poisson P(X=0) = exp(-λ) = {np.exp(-lambda_):.4f}")
 
    print(f"
  ZIP mean = {np.mean(zip_samples):.4f}")
    print(f"  Poisson mean = {np.mean(poisson_only):.4f}")
    print(f"  Theoretical ZIP mean = (1-π)λ = {(1-pi)*lambda_:.4f}")
    
    # This matters for ML because naive Poisson regression on 
    # zero - inflated data will give biased parameter estimates
    print("
⚠️  Using standard Poisson on ZIP data biases all estimates!")
    print("    Always check for excess zeros in count data.")
 
 
def censored_data_demo():
    """
    Demonstrate censored data, another mixed distribution.
    """
    print("
 
Censored Data (Right Censoring)")
    print("=" * 55)
    print("Model: True X ~ Exponential(λ), but we only observe min(X, C)")
 
    lambda_ = 0.5  # True rate
    C = 3.0        # Censoring threshold
    n_samples = 10000
 
    np.random.seed(42)
    true_values = np.random.exponential(1 / lambda_, n_samples)
    observed = np.minimum(true_values, C)
    is_censored = true_values >= C
 
    print(f"
True λ = {lambda_}, Censoring at C = {C}")
    print(f"Censoring rate: {np.mean(is_censored)*100:.1f}%")
    
    # Naive mean estimate(biased)
    naive_mean = np.mean(observed)
    true_mean = 1 / lambda_
 
    print(f"
Mean estimation:")
    print(f"  True mean (1/λ): {true_mean:.4f}")
    print(f"  Naive mean: {naive_mean:.4f} (biased downward!)")
    print(f"  Bias: {(naive_mean - true_mean):.4f}")
 
    print("
⚠️  Ignoring censoring leads to biased estimates!")
    print("    Use survival analysis methods (Kaplan-Meier, Cox PH)")
 
 
    zero_inflated_poisson_demo()
    censored_data_demo()

Ignoring Mixture Structure Causes Bias

Summary: The Continuous Framework

We've established the mathematical framework for continuous random variables—the foundation for modeling quantities that vary smoothly across uncountable domains.

Key Concepts Mastered

•Zero Point Probability: P(X = x) = 0 for any specific value x when X is continuous—probability concentrates on intervals, not points.
•Probability Density Function: The PDF f(x) is not a probability but a density; integrate to get probabilities: P(a ≤ X ≤ b) = ∫f(x)dx.
•PDF Properties: Non-negativity, normalization (∫f(x)dx = 1), and density values can exceed 1.
•Support Matters: Matching distribution support to data constraints (e.g., positive-only quantities) prevents impossible predictions and numerical issues.
•Transformations Require Jacobian: When Y = g(X), the PDF of Y involves |dg⁻¹/dy|, the Jacobian factor that preserves probability normalization.
•Key Distributions: Normal (everywhere), Exponential (positive, memoryless), Beta ([0,1], priors), Uniform (random initialization).
•Mixed Distributions: Real data often combines discrete point masses with continuous components—recognizing this prevents modeling bias.

What's Next:

Page Complete

2 / 5