Loading learning content...
The power of probabilistic reasoning lies not in static probability assignments, but in our ability to update beliefs when new information arrives. This is precisely what machine learning models do: they take in data (evidence) and adjust their predictions accordingly.
Consider these scenarios:
Each scenario involves the same fundamental question: given that event B has occurred, what is the probability of event A? This is conditional probability—the mathematical language for reasoning under partial information.
By the end of this page, you will master the definition and intuition of conditional probability, apply the multiplication rule to compute joint probabilities, understand the chain rule for sequences of events, and recognize how conditional probability enables ML models to make informed predictions.
Let A and B be events in a probability space, with P(B) > 0. The conditional probability of A given B, denoted P(A|B) (read 'probability of A given B'), is defined as:
This formula captures a simple intuition: once we know B has occurred, we restrict our attention to outcomes within B. The conditional probability P(A|B) asks: of all outcomes in B, what fraction also lie in A?
Imagine the sample space Ω as the entire area of a rectangle. Event B is some region within this rectangle. When we condition on B, we're effectively saying: 'The outcome is somewhere in B. Given that, what's the chance it's also in A?'
The answer is the fraction of B's area that overlaps with A:
P(A|B) = Area(A ∩ B) / Area(B)
Conditional probability P(A|B) is only defined when P(B) > 0. Conditioning on an impossible event (P(B) = 0) makes no sense—we cannot learn anything from observing something that cannot happen. In practice, this means we should never condition on events we haven't actually observed or that have measure zero.
Conditional probability P(·|B) is itself a valid probability measure on the reduced sample space B. It satisfies all of Kolmogorov's axioms:
1. Non-negativity: P(A|B) ≥ 0 for all events A
2. Normalization: P(B|B) = P(B ∩ B) / P(B) = P(B) / P(B) = 1
3. Countable Additivity: For disjoint events A₁, A₂, ...: P(∪ᵢ Aᵢ | B) = Σᵢ P(Aᵢ | B)
Because P(·|B) is a valid probability measure, all theorems derived from the axioms also hold for conditional probabilities:
When we condition on B, we're effectively working in a new probability space where B is the new sample space and probabilities are rescaled by 1/P(B). All probability rules apply, just within this restricted universe. This perspective is powerful: it means we can apply everything we know about probability to conditional settings.
Case 1: A and B are disjoint (A ∩ B = ∅)
P(A|B) = P(∅) / P(B) = 0 / P(B) = 0
If A and B cannot co-occur, then given B happened, A definitely did not.
Case 2: B ⊆ A (B implies A)
P(A|B) = P(B) / P(B) = 1
If every outcome in B is also in A, then knowing B tells us A occurred with certainty.
Case 3: A ⊆ B (A implies B)
P(A|B) = P(A) / P(B) ≥ P(A)
Since P(B) ≤ 1, dividing by P(B) increases (or maintains) the probability.
Rearranging the definition of conditional probability yields the multiplication rule—a formula for computing joint probabilities:
Both forms are equivalent and immensely useful. The multiplication rule says: the probability that both A and B occur equals the probability that B occurs, times the probability that A occurs given B.
Think of two sequential selections:
The probability of both happening is the product of these sequential probabilities.
Consider a recommendation system predicting whether a user will:
The probability of a complete purchase journey:
P(V ∩ C ∩ P) = P(V) · P(C|V) · P(P|V ∩ C)
This factorization mirrors how the user actually behaves: they first view (probability P(V)), then given they viewed, they add to cart (probability P(C|V)), then given they viewed and added to cart, they purchase (probability P(P|V ∩ C)).
The model can estimate each conditional probability separately and multiply them for the full journey probability.
The multiplication rule extends to any number of events through the chain rule (also called the general product rule):
Each term conditions on all preceding events. This rule is fundamental to:
Modern language models like GPT use the chain rule to model text probability:
P(w₁, w₂, ..., wₙ) = P(w₁) · P(w₂|w₁) · P(w₃|w₁,w₂) · ... · P(wₙ|w₁,...,wₙ₋₁)
Each word's probability is conditioned on all previous words. The model learns these conditional distributions from massive text corpora. This is exactly the chain rule applied to sequences.
By induction:
Base case (n=2): P(A₁ ∩ A₂) = P(A₁) · P(A₂|A₁) ✓ (the multiplication rule)
Inductive step: Assume the chain rule holds for n-1 events. Then:
P(A₁ ∩ ... ∩ Aₙ) = P((A₁ ∩ ... ∩ Aₙ₋₁) ∩ Aₙ) = P(A₁ ∩ ... ∩ Aₙ₋₁) · P(Aₙ | A₁ ∩ ... ∩ Aₙ₋₁) [multiplication rule] = [P(A₁) · P(A₂|A₁) · ... · P(Aₙ₋₁|A₁ ∩ ... ∩ Aₙ₋₂)] · P(Aₙ | A₁ ∩ ... ∩ Aₙ₋₁) [inductive hypothesis]
This matches the chain rule for n events. ∎
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249
import numpy as npfrom typing import List, Tuple, Dict # =============================================================# Conditional Probability Fundamentals# ============================================================= def conditional_probability( p_intersection: float, p_condition: float) -> float: """ Compute P(A|B) = P(A ∩ B) / P(B) Parameters: p_intersection: P(A ∩ B), probability of both events p_condition: P(B), probability of conditioning event Returns: P(A|B), conditional probability Raises: ValueError: If P(B) = 0 """ if p_condition == 0: raise ValueError("Cannot condition on event with probability 0") return p_intersection / p_condition def joint_from_conditional( p_condition: float, p_given_condition: float) -> float: """ Multiplication rule: P(A ∩ B) = P(B) · P(A|B) Parameters: p_condition: P(B) p_given_condition: P(A|B) Returns: P(A ∩ B) """ return p_condition * p_given_condition def chain_rule(conditional_probs: List[float]) -> float: """ Apply chain rule to compute joint probability. P(A₁ ∩ A₂ ∩ ... ∩ Aₙ) = P(A₁) · P(A₂|A₁) · P(A₃|A₁,A₂) · ... Parameters: conditional_probs: List [P(A₁), P(A₂|A₁), P(A₃|A₁,A₂), ...] Returns: Joint probability of all events """ result = 1.0 for p in conditional_probs: result *= p return result # =============================================================# Conditional Probability from Samples (Empirical)# ============================================================= def empirical_conditional( joint_count: int, condition_count: int) -> float: """ Compute empirical conditional probability from counts. P̂(A|B) = Count(A ∩ B) / Count(B) Parameters: joint_count: Number of samples where both A and B occurred condition_count: Number of samples where B occurred Returns: Estimated conditional probability """ if condition_count == 0: raise ValueError("No samples satisfy the condition") return joint_count / condition_count def compute_conditional_from_data( data: List[Dict], target_key: str, target_value, condition_key: str, condition_value) -> Tuple[float, int, int]: """ Compute conditional probability from a dataset. Parameters: data: List of dictionaries (samples) target_key, target_value: Define event A condition_key, condition_value: Define event B Returns: (P(A|B), count of A∩B, count of B) """ condition_samples = [d for d in data if d.get(condition_key) == condition_value] joint_samples = [d for d in condition_samples if d.get(target_key) == target_value] n_condition = len(condition_samples) n_joint = len(joint_samples) if n_condition == 0: return 0.0, 0, 0 return n_joint / n_condition, n_joint, n_condition # =============================================================# Language Model Chain Rule Example# ============================================================= class SimpleLanguageModel: """ Simple n-gram language model demonstrating the chain rule. Models P(w₁, w₂, ..., wₙ) = ∏ᵢ P(wᵢ | w₁, ..., wᵢ₋₁) For simplicity, uses bigrams: P(wᵢ | wᵢ₋₁) instead of full history. """ def __init__(self): self.unigram_counts: Dict[str, int] = {} self.bigram_counts: Dict[Tuple[str, str], int] = {} self.total_words = 0 def train(self, sentences: List[List[str]]): """Train on tokenized sentences.""" for sentence in sentences: # Add start token tokens = ["<START>"] + sentence for i, word in enumerate(tokens): # Unigram counts self.unigram_counts[word] = self.unigram_counts.get(word, 0) + 1 self.total_words += 1 # Bigram counts if i > 0: bigram = (tokens[i-1], word) self.bigram_counts[bigram] = self.bigram_counts.get(bigram, 0) + 1 def p_word_given_previous(self, word: str, previous: str) -> float: """ Compute P(word | previous) using MLE with Laplace smoothing. """ bigram_count = self.bigram_counts.get((previous, word), 0) previous_count = self.unigram_counts.get(previous, 0) vocab_size = len(self.unigram_counts) # Laplace smoothing to avoid zero probabilities return (bigram_count + 1) / (previous_count + vocab_size) def sentence_probability(self, sentence: List[str]) -> float: """ Compute P(w₁, ..., wₙ) using the chain rule. P(sentence) = P(w₁|<START>) · P(w₂|w₁) · ... · P(wₙ|wₙ₋₁) """ tokens = ["<START>"] + sentence log_prob = 0.0 # Use log to avoid underflow for i in range(1, len(tokens)): p = self.p_word_given_previous(tokens[i], tokens[i-1]) log_prob += np.log(p) return np.exp(log_prob) if __name__ == "__main__": # Example 1: Die roll conditional probability print("=" * 60) print("Example 1: Die Roll") print("=" * 60) # P(greater than 3 | even) p_intersection = 2/6 # P({4,6}) p_condition = 3/6 # P({2,4,6}) p_conditional = conditional_probability(p_intersection, p_condition) print(f"P(>3 | even) = {p_conditional:.4f}") # Example 2: Chain rule - card drawing print("" + "=" * 60) print("Example 2: Drawing Two Aces (Chain Rule)") print("=" * 60) probs = [4/52, 3/51] # P(A₁), P(A₂|A₁) p_two_aces = chain_rule(probs) print(f"P(two aces) = {p_two_aces:.6f} = 1/{int(1/p_two_aces):.0f}") # Example 3: Empirical conditional from data print("" + "=" * 60) print("Example 3: Empirical Conditional (Click Data)") print("=" * 60) # Simulated click data data = [ {"page_view": True, "clicked": True}, {"page_view": True, "clicked": False}, {"page_view": True, "clicked": True}, {"page_view": True, "clicked": False}, {"page_view": True, "clicked": False}, {"page_view": False, "clicked": False}, ] p_click_given_view, n_joint, n_view = compute_conditional_from_data( data, "clicked", True, "page_view", True ) print(f"P(clicked | page_view) = {n_joint}/{n_view} = {p_click_given_view:.2f}") # Example 4: Language model print("" + "=" * 60) print("Example 4: Language Model (Chain Rule)") print("=" * 60) lm = SimpleLanguageModel() training_data = [ ["the", "cat", "sat"], ["the", "dog", "ran"], ["the", "cat", "ran"], ["a", "cat", "sat"], ] lm.train(training_data) test_sentence = ["the", "cat", "sat"] prob = lm.sentence_probability(test_sentence) print(f"P('{' '.join(test_sentence)}') = {prob:.6f}") # Show chain rule decomposition tokens = ["<START>"] + test_sentence print("Chain rule decomposition:") for i in range(1, len(tokens)): p = lm.p_word_given_previous(tokens[i], tokens[i-1]) print(f" P({tokens[i]} | {tokens[i-1]}) = {p:.4f}")Conditional probability is deceptively subtle, and even experienced practitioners make errors. Let's examine common pitfalls.
These are generally not equal! This error, called the confusion of the inverse, leads to serious reasoning failures.
When computing P(A|B), we must account for the base rate—the unconditional probability P(A).
Example: A medical test is 99% accurate (detects disease 99% of the time if present, correctly clears 99% if absent). A patient tests positive. What's P(disease | positive)?
If the disease affects only 1 in 10,000 people, P(disease | positive) ≈ 1% despite the 'highly accurate' test! Most positives are false positives because the disease is so rare.
After observing multiple pieces of evidence, we must update sequentially and correctly:
Wrong: P(A | B ∩ C) = P(A|B) × P(A|C) ❌
Correct: Use Bayes' theorem and proper conditioning.
P(A|B) = P(A) only if A and B are independent. In most real scenarios, conditioning changes probabilities. Never assume independence without justification.
In ML classification, accuracy on rare classes can be misleading. A model that predicts 'not fraud' for everything achieves 99.9% accuracy if only 0.1% of transactions are fraudulent—but it's useless for detecting fraud. Conditional thinking (P(fraud | prediction)) matters more than overall accuracy.
The Law of Total Probability allows us to compute P(A) by breaking it down across a partition of the sample space.
Since the Bᵢ partition Ω, event A can be decomposed:
A = (A ∩ B₁) ∪ (A ∩ B₂) ∪ ... ∪ (A ∩ Bₙ)
These pieces are disjoint, so:
P(A) = Σᵢ P(A ∩ Bᵢ) = Σᵢ P(A|Bᵢ) · P(Bᵢ)
The law of total probability is the foundation of mixture models:
P(x) = Σₖ P(x | component k) · P(component k) = Σₖ πₖ · p(x | θₖ)
where πₖ is the mixing weight for component k and p(x | θₖ) is the component distribution.
Examples:
When facing a complex probability calculation, ask: 'Can I partition the space into simpler cases?' If P(A) is hard to compute directly but P(A|Bᵢ) is easy for each piece of a partition, use total probability. This divide-and-conquer strategy is ubiquitous in probability.
Conditional probability isn't just a theoretical concept—it's the operational definition of what ML models compute.
A classifier learns P(Y | X), the conditional distribution of labels given features:
Every prediction is fundamentally a conditional probability.
Discriminative models learn P(Y | X) directly. Generative models learn P(X | Y) and P(Y), then use Bayes' theorem:
P(Y | X) = P(X | Y) · P(Y) / P(X)
Cross-entropy loss for classification:
L = -log P(Y_true | X)
Minimizing cross-entropy is equivalent to maximizing the conditional log-likelihood.
| Model | What It Learns | Conditional Form |
|---|---|---|
| Logistic Regression | Binary classification | P(Y=1 | X) = σ(w^T X) |
| Softmax Classifier | Multi-class classification | P(Y=k | X) = softmaxₖ(Wx) |
| Language Model (GPT) | Next token prediction | P(wₜ | w₁...wₜ₋₁) |
| Conditional VAE | Conditional generation | P(X | Z, Y) |
| Conditional GAN | Conditional generation | P(X | Z, Y) |
| Sequence-to-Sequence | Sequence transduction | P(Y₁...Yₘ | X₁...Xₙ) |
Nearly every ML prediction is a conditional probability: P(output | input). Whether classifying images, generating text, or predicting stock prices, we're asking 'What is the probability of this output, given this input?' Understanding conditional probability deeply means understanding what ML models actually compute.
Conditional probability transforms probability from a static description of uncertainty into a dynamic framework for learning from evidence.
What's Next:
Conditional probability has one critical limitation: it tells us P(A|B) if we know P(A ∩ B) and P(B). But what if we only know P(B|A)? The answer is Bayes' Theorem—the cornerstone of Bayesian reasoning, which inverts conditional probabilities and enables learning from data. This is the subject of our next page.
You now command conditional probability—the mechanism by which evidence updates beliefs. You can compute conditional probabilities, avoid common fallacies, apply the chain rule to sequential events, and recognize that ML models fundamentally output conditional probabilities. With the law of total probability, you can marginalize over unknown variables. Next: Bayes' theorem will complete your toolkit for probabilistic reasoning.