Loading content...
Every machine learning model operates in a world riddled with uncertainty. When a spam classifier examines an email, it cannot definitively know the sender's intent—it can only assess the likelihood that the message is spam. When a recommendation system suggests a movie, it's making an educated guess about what you might enjoy. When a medical diagnosis system evaluates symptoms, it's reasoning about probable conditions.
Probability theory provides the mathematical framework for reasoning precisely about uncertain outcomes. Without it, machine learning would be reduced to rigid, brittle rules incapable of handling the inherent ambiguity of real-world data.
Before we can understand sophisticated probabilistic concepts like Bayesian inference, maximum likelihood estimation, or probabilistic graphical models, we must first master the foundational building blocks: sample spaces and events.
By the end of this page, you will understand how to formally define a random experiment, construct its sample space, identify and manipulate events using set operations, and recognize how these abstract concepts map directly to practical machine learning scenarios.
Probability theory begins with the concept of a random experiment (also called a stochastic experiment or chance experiment). A random experiment is any procedure that satisfies three essential criteria:
These criteria may seem abstract, but they precisely capture the nature of uncertainty we encounter throughout machine learning.
A deterministic process always produces the same output for the same input (e.g., f(x) = 2x + 3). A stochastic process involves randomness, so identical conditions can yield different results. Most real-world phenomena—and virtually all interesting ML problems—are stochastic because they involve unmeasured variables, noise, or inherently random mechanisms.
The formalization of random experiments is not merely academic rigor. It provides:
Without this formalization, discussions of probability devolve into hand-waving and intuition—fine for casual conversation, dangerous for building systems that must work correctly.
The sample space, denoted by Ω (capital Greek letter omega - sometimes denoted S in some textbooks), is the set of all possible outcomes of a random experiment. Every outcome appears exactly once in the sample space, and the outcomes are mutually exclusive—exactly one outcome occurs when the experiment is performed.
Formal Definition:
The sample space Ω of a random experiment is the set containing all possible outcomes ω of that experiment. When the experiment is performed, exactly one outcome ω ∈ Ω occurs.
The sample space is the starting point for all probabilistic analysis. Once we define Ω, we can then ask questions about events (subsets of Ω) and their probabilities.
| Experiment | Sample Space Ω | Type | Cardinality |
|---|---|---|---|
| Single coin flip | Ω = {H, T} | Finite, Discrete | |Ω| = 2 |
| Single die roll | Ω = {1, 2, 3, 4, 5, 6} | Finite, Discrete | |Ω| = 6 |
| Two coin flips | Ω = {HH, HT, TH, TT} | Finite, Discrete | |Ω| = 4 |
| Number of emails received in an hour | Ω = {0, 1, 2, 3, ...} = ℕ₀ | Countably Infinite, Discrete | |Ω| = ℵ₀ |
| Time until server failure | Ω = [0, ∞) | Uncountably Infinite, Continuous | |Ω| = 𝔠 |
| Location of a random point in a unit square | Ω = [0,1] × [0,1] | Uncountably Infinite, Continuous | |Ω| = 𝔠 |
Sample spaces fall into two fundamental categories:
Discrete Sample Spaces:
Continuous Sample Spaces:
This distinction profoundly affects how we compute probabilities. For discrete spaces, we can assign probabilities to individual outcomes. For continuous spaces, individual outcomes typically have zero probability—we must instead work with intervals or regions.
In a continuous sample space, asking 'What is the probability that the temperature is exactly 23.4571...°C?' makes little sense—the answer is zero. Instead, we ask 'What is the probability that the temperature lies between 23°C and 24°C?' This is why probability density functions (PDFs) measure probability density per unit length, not probability of specific values.
Constructing the correct sample space is a critical skill. An improperly defined sample space leads to flawed probability calculations and incorrect conclusions. The process requires careful thought about what exactly constitutes a distinct outcome.
Mutual Exclusivity: Outcomes must be distinct—if one outcome occurs, no other outcome can simultaneously occur
Exhaustiveness: The sample space must include all possible outcomes—nothing that can happen is missing
Appropriate Granularity: The level of detail must match the questions we want to answer
When in doubt, use the most detailed sample space possible—the one that distinguishes outcomes whenever they could reasonably be distinguished. You can always aggregate later, but you cannot disaggregate a coarse sample space. Starting with ordered pairs for dice lets us answer questions about sums, but starting with sums alone prevents us from answering questions about specific die values.
When an experiment consists of multiple sub-experiments performed together or in sequence, we construct the overall sample space as the Cartesian product of the individual sample spaces.
If experiment A has sample space Ωₐ and experiment B has sample space Ωᵦ, then the compound experiment has sample space:
Ω = Ωₐ × Ωᵦ = {(a, b) : a ∈ Ωₐ, b ∈ Ωᵦ}
This generalizes to any number of sub-experiments:
Ω = Ω₁ × Ω₂ × ... × Ωₙ
with |Ω| = |Ω₁| × |Ω₂| × ... × |Ωₙ| for finite sample spaces.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import itertoolsfrom typing import Set, Tuple, List, Any def construct_cartesian_product(*sample_spaces: Set[Any]) -> Set[Tuple]: """ Construct the Cartesian product of multiple sample spaces. This is the standard way to build sample spaces for compound experiments. Parameters: *sample_spaces: Variable number of sample space sets Returns: Set of tuples representing all combinations of outcomes Example: >>> coin = {'H', 'T'} >>> die = {1, 2, 3, 4, 5, 6} >>> compound_space = construct_cartesian_product(coin, die) >>> len(compound_space) # 2 × 6 = 12 12 """ return set(itertools.product(*sample_spaces)) def construct_dice_sample_space(num_dice: int, faces: int = 6) -> Set[Tuple[int, ...]]: """ Construct the sample space for rolling multiple dice. Parameters: num_dice: Number of dice to roll faces: Number of faces per die (default: 6) Returns: Set of tuples, each representing one possible outcome Example: >>> space = construct_dice_sample_space(2) >>> len(space) # 6^2 = 36 36 >>> (1, 6) in space True """ single_die = set(range(1, faces + 1)) return set(itertools.product(*[single_die] * num_dice)) def construct_binary_sequence_space(length: int) -> Set[Tuple[int, ...]]: """ Construct sample space for binary sequences of given length. Useful for: coin flip sequences, binary classification outcomes, bit strings, presence/absence of features. Parameters: length: Length of the binary sequence Returns: Set of all possible binary tuples Example: >>> space = construct_binary_sequence_space(3) >>> len(space) # 2^3 = 8 8 >>> (0, 1, 0) in space True """ return set(itertools.product([0, 1], repeat=length)) # Demonstrationif __name__ == "__main__": # Example 1: Coin and Die coin = {'H', 'T'} die = {1, 2, 3, 4, 5, 6} space = construct_cartesian_product(coin, die) print(f"Coin + Die sample space size: {len(space)}") print(f"Sample outcomes: {list(space)[:5]}...") # Example 2: Three coin flips triple_coin = construct_binary_sequence_space(3) print(f"\nThree coin flips sample space size: {len(triple_coin)}") print(f"All outcomes: {sorted(triple_coin)}") # Example 3: Two dice two_dice = construct_dice_sample_space(2) print(f"\nTwo dice sample space size: {len(two_dice)}") # Count outcomes that sum to 7 sum_seven = [outcome for outcome in two_dice if sum(outcome) == 7] print(f"Outcomes summing to 7: {sum_seven}") print(f"Count: {len(sum_seven)} out of {len(two_dice)}")While the sample space contains all possible outcomes, we are rarely interested in a single specific outcome. Instead, we ask questions like:
These questions correspond to events—collections of outcomes that share some property of interest.
Formal Definition:
An event E is a subset of the sample space: E ⊆ Ω. An event E is said to occur if the outcome ω of the experiment is an element of E (i.e., ω ∈ E).
Every event is a set, and we use set notation and operations to work with events.
| Event Description | Event Set E | Event Occurs When... |
|---|---|---|
| Roll an even number | E = {2, 4, 6} | ω ∈ {2, 4, 6} |
| Roll less than 4 | E = {1, 2, 3} | ω ∈ {1, 2, 3} |
| Roll a prime number | E = {2, 3, 5} | ω ∈ {2, 3, 5} |
| Roll a 6 | E = {6} | ω = 6 |
| Roll anything | E = {1, 2, 3, 4, 5, 6} = Ω | Always (certain event) |
| Roll a 7 | E = ∅ (empty set) | Never (impossible event) |
Certain Event (Ω): The sample space itself is an event that always occurs. If Ω is the set of all possible outcomes, and exactly one outcome must occur, then the event Ω occurs with absolute certainty.
Impossible Event (∅): The empty set ∅ is an event that never occurs. Since ∅ contains no outcomes, and some outcome must occur, ∅ cannot occur.
Elementary Events (Singletons): An event containing exactly one outcome is called an elementary event or simple event. For a die roll, {3} is an elementary event. These are the atomic units from which all other events are built.
In ML, events often represent model behaviors or data properties:
• E = 'Model predicts class 1' — the set of all inputs x where f(x) = 1 • E = 'Training converges' — the set of all random initializations leading to convergence • E = 'User clicks' — the set of all (user, context, item) tuples resulting in clicks
Thinking of ML behaviors as events allows us to apply the full power of probability theory.
Since events are sets, we can combine them using standard set operations. These operations allow us to express complex questions about outcomes in terms of simpler events.
| Operation | Notation | Meaning | Event Occurs When... |
|---|---|---|---|
| Complement | E^c or Ē or E' | NOT E | E does not occur (ω ∉ E) |
| Union | E₁ ∪ E₂ | E₁ OR E₂ (or both) | At least one of E₁, E₂ occurs |
| Intersection | E₁ ∩ E₂ | E₁ AND E₂ | Both E₁ and E₂ occur |
| Difference | E₁ \ E₂ or E₁ - E₂ | E₁ but NOT E₂ | E₁ occurs but E₂ does not |
| Symmetric Difference | E₁ △ E₂ | Exactly one of E₁, E₂ | E₁ or E₂ but not both |
Two events A and B are mutually exclusive (or disjoint) if they cannot both occur simultaneously:
A ∩ B = ∅
If A occurs, B definitely does not, and vice versa.
Examples:
A collection of events {E₁, E₂, ..., Eₙ} is exhaustive if their union covers the entire sample space:
E₁ ∪ E₂ ∪ ... ∪ Eₙ = Ω
At least one of the events must occur.
If events are both mutually exclusive AND exhaustive, they form a partition of Ω. Every outcome belongs to exactly one event. This is a crucial concept for decomposing probabilities.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
from typing import Set, Tuple, FrozenSetfrom itertools import product # Define sample space for two diceDICE_SAMPLE_SPACE: Set[Tuple[int, int]] = { (i, j) for i in range(1, 7) for j in range(1, 7)} def complement(event: Set, sample_space: Set) -> Set: """Return the complement of an event: E^c = Ω \ E""" return sample_space - event def union(event_a: Set, event_b: Set) -> Set: """Return union of two events: A ∪ B""" return event_a | event_b def intersection(event_a: Set, event_b: Set) -> Set: """Return intersection of two events: A ∩ B""" return event_a & event_b def difference(event_a: Set, event_b: Set) -> Set: """Return set difference: A \ B (A but not B)""" return event_a - event_b def symmetric_difference(event_a: Set, event_b: Set) -> Set: """Return symmetric difference: A △ B (exactly one of A or B)""" return event_a ^ event_b def are_mutually_exclusive(event_a: Set, event_b: Set) -> bool: """Check if two events are mutually exclusive (disjoint): A ∩ B = ∅""" return len(intersection(event_a, event_b)) == 0 def are_exhaustive(events: list, sample_space: Set) -> bool: """Check if events cover the entire sample space: ∪ Eᵢ = Ω""" union_of_events = set() for event in events: union_of_events = union_of_events | event return union_of_events == sample_space def is_partition(events: list, sample_space: Set) -> bool: """Check if events form a partition (mutually exclusive and exhaustive)""" # Check exhaustive if not are_exhaustive(events, sample_space): return False # Check pairwise mutually exclusive for i, event_i in enumerate(events): for j, event_j in enumerate(events): if i < j and not are_mutually_exclusive(event_i, event_j): return False return True # Define some interesting events for two dicedef first_die_even(sample_space: Set[Tuple[int, int]]) -> Set[Tuple[int, int]]: """Event: First die shows an even number""" return {(i, j) for (i, j) in sample_space if i % 2 == 0} def sum_equals(sample_space: Set[Tuple[int, int]], target: int) -> Set[Tuple[int, int]]: """Event: Sum of dice equals target""" return {(i, j) for (i, j) in sample_space if i + j == target} def doubles(sample_space: Set[Tuple[int, int]]) -> Set[Tuple[int, int]]: """Event: Both dice show the same value""" return {(i, j) for (i, j) in sample_space if i == j} if __name__ == "__main__": Ω = DICE_SAMPLE_SPACE # Define events A = first_die_even(Ω) B = sum_equals(Ω, 7) C = doubles(Ω) print(f"Sample space size: |Ω| = {len(Ω)}") print(f"|A| (first die even): {len(A)}") print(f"|B| (sum equals 7): {len(B)}") print(f"|C| (doubles): {len(C)}") print(f"\n|A ∩ B|: {len(intersection(A, B))}") print(f"A ∩ B = {intersection(A, B)}") print(f"\n|A ∪ C|: {len(union(A, C))}") print(f"\nAre B and C mutually exclusive? {are_mutually_exclusive(B, C)}") print(f"B ∩ C = {intersection(B, C)}") # Check partition properties E1 = first_die_even(Ω) E2 = complement(E1, Ω) # First die odd print(f"\nIs {{E1, E2}} a partition? {is_partition([E1, E2], Ω)}")De Morgan's Laws are among the most important identities in set theory and probability. They describe the relationship between unions, intersections, and complements.
De Morgan's Laws:
(A ∪ B)^c = A^c ∩ B^c
The complement of a union is the intersection of the complements.
(A ∩ B)^c = A^c ∪ B^c
The complement of an intersection is the union of the complements.
These laws generalize to any finite or countable collection of events:
(∪ᵢ Aᵢ)^c = ∩ᵢ Aᵢ^c
(∩ᵢ Aᵢ)^c = ∪ᵢ Aᵢ^c
First Law: 'NOT (A or B)' means 'NOT A and NOT B'—if neither A nor B happened, then both failed to happen.
Second Law: 'NOT (A and B)' means 'NOT A or NOT B'—if it's false that both happened, then at least one didn't happen.
Think of it as: when you take a complement, unions and intersections swap roles.
De Morgan's Laws allow us to transform complex probability calculations:
P(A ∪ B ∪ C) — The probability that at least one event occurs
Can be computed as:
P(A ∪ B ∪ C) = 1 - P((A ∪ B ∪ C)^c) = 1 - P(A^c ∩ B^c ∩ C^c)
If A, B, C are independent, this becomes much easier to calculate:
= 1 - P(A^c)P(B^c)P(C^c)
This technique—computing the probability of 'at least one' by subtracting the probability of 'none'—is ubiquitous in probability and statistics.
For finite sample spaces, we typically consider all possible subsets as valid events. But for infinite sample spaces—especially continuous ones—this approach creates mathematical paradoxes. The solution is to carefully specify which subsets qualify as events.
Definition: σ-algebra (Sigma-algebra)
A σ-algebra (or σ-field) on Ω is a collection ℱ of subsets of Ω satisfying:
The pair (Ω, ℱ) is called a measurable space, and elements of ℱ are called measurable sets or events.
For continuous sample spaces like [0,1], the power set (all subsets) is 'too big'—it contains pathological sets for which probability cannot be consistently defined. The Vitali set, for example, leads to paradoxes if we try to assign it a probability. σ-algebras exclude these problematic sets while keeping all 'reasonable' events.
Trivial σ-algebra: ℱ = {∅, Ω}
Power set: ℱ = 2^Ω (all subsets)
Borel σ-algebra on ℝ: ℬ(ℝ)
A complete probability model is specified by the probability triple (Ω, ℱ, P):
This rigorous foundation, developed by Andrey Kolmogorov in 1933, resolved centuries of ambiguity about probability and enables the advanced probabilistic machinery used in modern machine learning.
While σ-algebras are essential for measure-theoretic probability, most ML practitioners can work effectively without deep measure theory knowledge. The key takeaway is that not every subset of a continuous sample space is a valid event, and mathematicians have developed rigorous frameworks (σ-algebras, Lebesgue measure) to handle this subtlety. These frameworks justify the probability rules we use daily.
We've established the foundational concepts of probability theory. Let's consolidate the key ideas and connect them explicitly to machine learning.
| ML Application | Sample Space | Key Events | Why It Matters |
|---|---|---|---|
| Binary Classification | Ω = {0, 1} per sample | E = 'Positive class' | Defines precision, recall, accuracy |
| Multiclass Classification | Ω = {1, 2, ..., K} | Eₖ = 'Predict class k' | Partition structure enables multi-class metrics |
| Regression | Ω = ℝ (continuous) | E = 'Prediction within ε of true value' | Continuous Ω requires density functions |
| Sequence Modeling | Ω = Σⁿ (all sequences) | E = 'Sequence starts with pattern' | Exponentially large sample spaces |
| Reinforcement Learning | Ω = All possible trajectories | E = 'Reach goal state' | Stochastic policies sample from Ω |
| Generative Models (VAE, GAN) | Ω = All possible outputs | E = 'Generated sample is realistic' | Learning to sample from complex Ω |
What's Next:
With sample spaces and events established, we're ready to assign probabilities to events. The next page covers the Axioms of Probability—the minimal set of rules that any valid probability measure must satisfy. These axioms, formulated by Kolmogorov, provide the logical foundation for all probabilistic reasoning.
You now understand how to define and work with sample spaces and events—the vocabulary of probability theory. You can construct sample spaces for various experiments, define events as subsets, combine events using set operations, and recognize the mathematical structures (σ-algebras) that make continuous probability rigorous. Next, we'll see how to assign numbers—probabilities—to these events.