Loading learning content...
Having established sample spaces and events as the vocabulary of probability, we now face a fundamental question: How do we assign numbers to events in a consistent, meaningful way?
Intuitively, we might say 'probability is the frequency with which an event occurs if we repeat the experiment many times.' This frequentist interpretation is useful but immediately runs into problems:
Alternatively, we might interpret probability as 'degree of belief'—a Bayesian view. But then:
The solution, developed by Andrey Kolmogorov in 1933, was to axiomatize probability—to specify minimal rules that any valid probability assignment must satisfy. These axioms don't tell us what probability means; they tell us how probability must behave.
By the end of this page, you will understand Kolmogorov's axioms of probability, derive fundamental probability rules from these axioms, and recognize how these abstract principles underpin concrete ML computations like loss functions, model calibration, and probabilistic predictions.
Let (Ω, ℱ) be a measurable space where Ω is the sample space and ℱ is a σ-algebra of events. A probability measure P is a function P: ℱ → [0, 1] that assigns a number to each event, satisfying three axioms:
For any event A ∈ ℱ:
P(A) ≥ 0
Probabilities cannot be negative. This seems obvious but is essential for interpretation—a negative probability has no meaning.
P(Ω) = 1
The probability that something happens is 1. The sample space, containing all possible outcomes, has probability 1 because exactly one outcome must occur.
For any countable sequence of mutually exclusive events A₁, A₂, A₃, ... ∈ ℱ (i.e., Aᵢ ∩ Aⱼ = ∅ for i ≠ j):
P(∪ᵢ₌₁^∞ Aᵢ) = Σᵢ₌₁^∞ P(Aᵢ)
If events cannot overlap, the probability of 'at least one occurring' equals the sum of their individual probabilities. This extends to infinitely many events, which is crucial for continuous probability.
The triple (Ω, ℱ, P) is called a probability space.
These three axioms appear deceptively simple, yet they are sufficient to derive ALL of probability theory. Every theorem about expectations, variances, conditional probabilities, limit theorems, and more follows from these three statements. This is the power of axiomatic mathematics: from minimal assumptions, vast structures emerge.
| Axiom | Statement | Intuition | Why Essential |
|---|---|---|---|
| Non-Negativity | P(A) ≥ 0 for all events A | Probability measures 'how likely'—can't be negative | Enables interpretation as proportion/frequency |
| Normalization | P(Ω) = 1 | Something must happen | Provides reference scale for all probabilities |
| Countable Additivity | P(∪ Aᵢ) = Σ P(Aᵢ) for disjoint Aᵢ | Non-overlapping events add up | Allows infinite sums; essential for continuous distributions |
From Kolmogorov's three axioms, we can derive many fundamental properties of probability. These derivations demonstrate the power of axiomatic reasoning—each property is a logical consequence, not an additional assumption.
Theorem: P(∅) = 0
Proof:
Interpretation: An impossible event has zero probability—nothing can occur outside all possibilities.
Theorem: P(A^c) = 1 - P(A)
Proof:
Interpretation: The probability of 'not A' equals one minus the probability of A. If there's a 70% chance of rain, there's a 30% chance of no rain.
ML Application: If a classifier predicts class 1 with probability 0.8, it implicitly predicts class 0 (in binary classification) with probability 0.2. This is the foundation of probabilistic predictions.
Theorem: For any event A, 0 ≤ P(A) ≤ 1
Proof:
Interpretation: Probabilities are always between 0 and 1. This seems obvious but is a derived property, not an axiom.
ML Application: This is why neural network outputs for classification use softmax or sigmoid—to constrain outputs to [0, 1] and be interpretable as probabilities.
Theorem: If A ⊆ B, then P(A) ≤ P(B)
Proof:
Interpretation: If event A is contained in event B (A implies B), then A is at most as likely as B. 'Drawing an ace' is less likely than 'drawing a face card or ace.'
ML Application: If your model's correct predictions are a subset of another model's correct predictions, the second model has higher accuracy.
Notice how each proof uses only the three axioms and previously proven properties. This chain of logical deduction is the essence of mathematics. Every probability theorem you'll ever use—from Bayes' theorem to the central limit theorem—ultimately traces back to these three axioms.
Axiom 3 tells us how to compute P(A ∪ B) when A and B are disjoint. But what if A and B overlap? We need the General Addition Rule (also called the Inclusion-Exclusion Principle for two sets).
Theorem: For any events A and B:
P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
Proof: We decompose A ∪ B into three disjoint pieces:
But let's use a simpler approach:
Also:
When we add P(A) + P(B), outcomes in the intersection A ∩ B get counted twice—once when counting A, once when counting B. Subtracting P(A ∩ B) corrects this double-counting. This is the intuition behind inclusion-exclusion.
P(A ∪ B ∪ C) = P(A) + P(B) + P(C) - P(A ∩ B) - P(A ∩ C) - P(B ∩ C) + P(A ∩ B ∩ C)
The pattern: add singles, subtract pairs, add triples. This extends to n events:
P(∪ᵢ₌₁ⁿ Aᵢ) = Σᵢ P(Aᵢ) - Σᵢ<ⱼ P(Aᵢ ∩ Aⱼ) + Σᵢ<ⱼ<ₖ P(Aᵢ ∩ Aⱼ ∩ Aₖ) - ... + (-1)ⁿ⁺¹ P(∩ᵢ₌₁ⁿ Aᵢ)
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
from typing import List, Callable, Set, Anyfrom itertools import combinations def probability_of_union( events: List[Set[Any]], sample_space: Set[Any]) -> float: """ Compute P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) using inclusion-exclusion. For finite sample spaces, this computes exact probabilities by working with the sets directly. Parameters: events: List of event sets sample_space: The complete sample space Ω Returns: Probability of at least one event occurring """ n = len(events) if n == 0: return 0.0 omega_size = len(sample_space) total_prob = 0.0 for k in range(1, n + 1): # Sum over all k-subsets of events sign = (-1) ** (k + 1) # Alternating signs for event_subset in combinations(events, k): # Compute intersection of all events in subset intersection = event_subset[0] for event in event_subset[1:]: intersection = intersection & event # Add (or subtract) probability of this intersection total_prob += sign * len(intersection) / omega_size return total_prob def probability_of_at_least_one_simple( events: List[Set[Any]], sample_space: Set[Any]) -> float: """ Alternative: compute P(at least one) by finding the actual union. Simpler but less illustrative of inclusion-exclusion. """ union = set() for event in events: union = union | event return len(union) / len(sample_space) # Example: Two dice, multiple eventsif __name__ == "__main__": # Sample space for two dice Omega = {(i, j) for i in range(1, 7) for j in range(1, 7)} # Define events A = {(i, j) for (i, j) in Omega if i + j == 7} # Sum is 7 B = {(i, j) for (i, j) in Omega if i == j} # Doubles C = {(i, j) for (i, j) in Omega if i >= 5} # First die ≥ 5 print("Event A (sum = 7):", len(A), "outcomes") print("Event B (doubles):", len(B), "outcomes") print("Event C (first ≥ 5):", len(C), "outcomes") # Two-event inclusion-exclusion p_a = len(A) / len(Omega) p_b = len(B) / len(Omega) p_a_and_b = len(A & B) / len(Omega) print(f"\nP(A) = {p_a:.4f}") print(f"P(B) = {p_b:.4f}") print(f"P(A ∩ B) = {p_a_and_b:.4f}") print(f"P(A ∪ B) via inclusion-exclusion = {p_a + p_b - p_a_and_b:.4f}") print(f"P(A ∪ B) via direct computation = {len(A | B) / len(Omega):.4f}") # Three-event inclusion-exclusion print(f"\nP(A ∪ B ∪ C) via function = {probability_of_union([A, B, C], Omega):.4f}") print(f"P(A ∪ B ∪ C) via direct = {probability_of_at_least_one_simple([A, B, C], Omega):.4f}")A powerful consequence of countable additivity (Axiom 3) is that probability behaves 'continuously' with respect to limits of events. This is crucial for working with infinite sequences and continuous distributions.
Theorem (Continuity from Below): If A₁ ⊆ A₂ ⊆ A₃ ⊆ ... is an increasing sequence of events, and A = ∪ᵢ₌₁^∞ Aᵢ, then:
P(A) = lim_{n→∞} P(Aₙ)
Theorem (Continuity from Above): If A₁ ⊇ A₂ ⊇ A₃ ⊇ ... is a decreasing sequence of events, and A = ∩ᵢ₌₁^∞ Aᵢ, then:
P(A) = lim_{n→∞} P(Aₙ)
If events get progressively larger (A₁ ⊆ A₂ ⊆ ...) and converge to some limit event A, the probabilities also converge: P(Aₙ) → P(A). There are no 'jumps' at infinity. This mirrors how continuous functions behave at limits and is essential for defining probability on continuous spaces.
The distinction between finite additivity (works for any finite collection of disjoint events) and countable additivity (works for countable collections) is subtle but crucial.
Finite additivity alone would not guarantee:
Kolmogorov's choice of countable additivity as the third axiom was deliberate—it provides just enough structure to build all of probability theory without being so restrictive as to exclude useful probability spaces.
The axioms tell us what properties a probability measure must have. But how do we actually construct one? Different constructions lead to different probability spaces.
For finite sample spaces with equally likely outcomes:
P(A) = |A| / |Ω|
This satisfies all three axioms:
| Experiment | Sample Space | Example Event | Probability |
|---|---|---|---|
| Fair coin flip | Ω = {H, T} | A = {H} | P(A) = 1/2 |
| Fair die roll | Ω = {1,2,3,4,5,6} | A = {even} = {2,4,6} | P(A) = 3/6 = 1/2 |
| Drawing a card | Ω = 52 cards | A = {aces} = 4 cards | P(A) = 4/52 = 1/13 |
| Two fair dice | Ω = 36 pairs | A = {sum = 7} = 6 pairs | P(A) = 6/36 = 1/6 |
For discrete sample spaces, we can assign probabilities directly to each outcome:
p(ω) = probability of outcome ω
Constraints (from axioms):
Then for any event A:
P(A) = Σ_{ω ∈ A} p(ω)
For continuous sample spaces like Ω = ℝ, we cannot assign positive probability to individual points (otherwise we'd exceed 1 by summing uncountably many). Instead, we use a probability density function (PDF) f(x):
P(a ≤ X ≤ b) = ∫ₐᵇ f(x) dx
Constraints:
The PDF gives probability per unit length, not probability of points.
Example: Uniform distribution on [0, 1]
f(x) = 1 for x ∈ [0, 1], and f(x) = 0 otherwise
P(0.2 ≤ X ≤ 0.5) = ∫_{0.2}^{0.5} 1 dx = 0.3
The value f(x) is NOT a probability—it's a density. For continuous random variables:
• P(X = 0.5) = 0 (probability of any single point is zero) • f(0.5) can be any non-negative number • f(x) can exceed 1 (e.g., uniform on [0, 0.1] has f(x) = 10)
Only integrals of f(x) yield probabilities, and those must be ≤ 1.
Machine learning models that output 'probabilities' must satisfy Kolmogorov's axioms to be valid probability distributions. Let's examine how this manifests in practice.
For multi-class classification with K classes, a neural network outputs logits z₁, z₂, ..., zₖ. To convert these to valid probabilities, we apply softmax:
P(class i) = exp(zᵢ) / Σⱼ exp(zⱼ)
Let's verify the axioms:
For binary classification, sigmoid maps a single logit z to a probability:
P(class 1) = σ(z) = 1 / (1 + exp(-z))
P(class 0) = 1 - σ(z) = σ(-z)
Verification:
Poorly designed models can produce invalid 'probabilities':
While these 'probabilities' technically satisfy the axioms (they're in [0,1] and sum to 1), they don't accurately estimate true conditional probabilities—a problem known as miscalibration.
A classifier is well-calibrated if, among all test examples where it predicts P(class 1) = 0.7, approximately 70% actually belong to class 1. The axioms ensure outputs are valid probabilities; calibration ensures they're accurate probabilities. Both matter for trustworthy ML systems.
Let's consolidate the probability rules derived from the axioms into a practical reference.
| Rule | Formula | When to Use |
|---|---|---|
| Complement | P(A^c) = 1 - P(A) | When 'not A' is easier to compute |
| Addition (disjoint) | P(A ∪ B) = P(A) + P(B) | Events cannot co-occur |
| Addition (general) | P(A ∪ B) = P(A) + P(B) - P(A ∩ B) | Events may overlap |
| Bounds | 0 ≤ P(A) ≤ 1 | Sanity check |
| Monotonicity | If A ⊆ B, then P(A) ≤ P(B) | A implies B |
| Union bound | P(∪ᵢ Aᵢ) ≤ Σᵢ P(Aᵢ) | Upper bound for any union |
A tremendously useful inequality:
P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) ≤ P(A₁) + P(A₂) + ... + P(Aₙ)
This follows from inclusion-exclusion by dropping the negative terms.
Why it matters in ML:
Suppose you want multiple statistical tests to simultaneously succeed with high probability. If each test fails with probability at most δ, and you run n tests:
This union bound argument appears constantly in:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npfrom typing import Callable # =============================================================# Probability Rules Implementation# ============================================================= def complement_probability(p_event: float) -> float: """P(A^c) = 1 - P(A)""" return 1 - p_event def union_probability_disjoint(*probs: float) -> float: """ P(A₁ ∪ A₂ ∪ ... ∪ Aₙ) for mutually exclusive events. Simply sums the probabilities. """ return sum(probs) def union_probability_two(p_a: float, p_b: float, p_intersection: float) -> float: """ P(A ∪ B) = P(A) + P(B) - P(A ∩ B) General addition rule for two events. """ return p_a + p_b - p_intersection def union_bound(*probs: float) -> float: """ P(∪ᵢ Aᵢ) ≤ Σᵢ P(Aᵢ) Returns the union bound (upper bound on union probability). Actual probability may be lower due to overlaps. """ bound = sum(probs) # Cap at 1 since probability cannot exceed 1 return min(bound, 1.0) def bonferroni_correction(desired_overall_prob: float, num_tests: int) -> float: """ Compute per-test significance level to achieve overall level. If we want P(any Type I error) ≤ α with n tests, each test should use significance level α/n. """ return desired_overall_prob / num_tests # =============================================================# Softmax: Converting logits to valid probabilities# ============================================================= def softmax(logits: np.ndarray) -> np.ndarray: """ Convert raw logits to probability distribution satisfying axioms. Uses numerically stable implementation (subtract max for stability). Parameters: logits: Array of shape (K,) or (batch_size, K) Returns: Probabilities satisfying: - All values in [0, 1] (non-negativity + bounds) - Sum to 1 along last axis (normalization) """ # Numerical stability: subtract max shifted = logits - np.max(logits, axis=-1, keepdims=True) exp_logits = np.exp(shifted) return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True) def sigmoid(z: np.ndarray) -> np.ndarray: """ Binary classification probability: P(class=1) = sigmoid(z) Satisfies P(class=0) + P(class=1) = 1 """ return 1 / (1 + np.exp(-z)) def verify_probability_axioms(probs: np.ndarray, tol: float = 1e-6) -> dict: """ Check if an array represents a valid probability distribution. Returns dict with verification results for each axiom. """ results = { "non_negative": np.all(probs >= -tol), "upper_bounded": np.all(probs <= 1 + tol), "normalized": np.abs(np.sum(probs) - 1) < tol, } results["valid"] = all(results.values()) return results if __name__ == "__main__": # Example 1: Complement rule p_rain = 0.7 p_no_rain = complement_probability(p_rain) print(f"P(rain) = {p_rain}, P(no rain) = {p_no_rain}") # Example 2: Union of overlapping events p_a = 0.3 p_b = 0.4 p_a_and_b = 0.1 p_a_or_b = union_probability_two(p_a, p_b, p_a_and_b) print(f"\nP(A ∪ B) = {p_a_or_b}") # Example 3: Union bound failure_probs = [0.05, 0.03, 0.02, 0.01] bound = union_bound(*failure_probs) print(f"\nUnion bound on P(any failure): {bound}") # Example 4: Softmax logits = np.array([2.0, 1.0, 0.5, 0.1]) probs = softmax(logits) print(f"\nLogits: {logits}") print(f"Softmax probabilities: {probs}") print(f"Sum: {probs.sum()}") print(f"Axiom verification: {verify_probability_axioms(probs)}") # Example 5: Bonferroni correction n_tests = 20 overall_alpha = 0.05 per_test_alpha = bonferroni_correction(overall_alpha, n_tests) print(f"\nFor {n_tests} tests at α={overall_alpha}:") print(f"Per-test threshold: {per_test_alpha}")We've established the rigorous mathematical foundation that underlies all probabilistic reasoning in machine learning.
What's Next:
With probability measures established, we can now study conditional probability—how to update our beliefs when we learn new information. This is the key to Bayesian reasoning, chain rules, and ultimately Bayes' theorem, which powers everything from spam filters to medical diagnosis to large language models.
You now understand probability at its deepest level—not as vague intuition about 'chances,' but as a precisely defined mathematical structure. Every formula you use in ML, from cross-entropy loss to Bayesian posteriors, traces back to these three axioms. This foundation makes you a more rigorous and confident practitioner.