Loading learning content...
We've seen that P(A|B) and P(B|A) are generally different quantities. But what if we know P(B|A) and need P(A|B)? This is one of the most common situations in practice:
Bayes' theorem provides the bridge from one conditional direction to the other. It is arguably the most important theorem in probability, and the foundation of Bayesian statistics and machine learning.
Named after Reverend Thomas Bayes (1702-1761), who first formulated a special case, the theorem in its general form is a simple algebraic consequence of the definition of conditional probability—yet its implications are profound and far-reaching.
By the end of this page, you will derive Bayes' theorem from first principles, understand its components (prior, likelihood, evidence, posterior), apply it to practical classification problems, and recognize its central role throughout probabilistic machine learning.
Bayes' theorem follows directly from the definition of conditional probability and the multiplication rule.
From the definition P(A|B) = P(A ∩ B) / P(B), we get:
P(A ∩ B) = P(A|B) · P(B)
But we can also write:
P(A ∩ B) = P(B|A) · P(A)
Since both expressions equal P(A ∩ B):
P(A|B) · P(B) = P(B|A) · P(A)
Solving for P(A|B):
This is Bayes' Theorem in its basic form.
The proof is just two lines of algebra! Yet this simple result enables sophisticated reasoning about uncertain hypotheses, parameter estimation, medical diagnosis, spam filtering, and much of modern machine learning. Sometimes the most profound ideas have the simplest derivations.
We can expand P(B) using the law of total probability:
P(B) = P(B|A) · P(A) + P(B|A^c) · P(A^c)
Substituting:
This form is useful when P(B) isn't given directly but can be computed from the components.
If we have mutually exclusive, exhaustive hypotheses H₁, H₂, ..., Hₙ:
P(Hᵢ | E) = P(E | Hᵢ) · P(Hᵢ) / Σⱼ P(E | Hⱼ) · P(Hⱼ)
This is the form used in classification: given evidence E (features), which hypothesis Hᵢ (class) is most probable?
Each component of Bayes' theorem has a name and a specific interpretation:
Posterior = Likelihood × Prior / Evidence
Let's break down each term:
| Term | Name | Interpretation | Where It Comes From |
|---|---|---|---|
| P(H) | Prior | Probability of hypothesis BEFORE seeing evidence | Background knowledge, historical rates, initial beliefs |
| P(E|H) | Likelihood | Probability of evidence IF hypothesis is true | Model of how evidence is generated |
| P(E) | Evidence (Marginal Likelihood) | Total probability of observing this evidence | Normalization: ensures posterior sums to 1 |
| P(H|E) | Posterior | Probability of hypothesis AFTER seeing evidence | What we're trying to compute! |
Prior P(H): What we believed before seeing any data. If 1% of patients have a disease, P(disease) = 0.01.
Likelihood P(E|H): How well the evidence fits the hypothesis. If the disease causes a positive test 99% of the time, P(positive | disease) = 0.99.
Evidence P(E): How common is this evidence overall? This accounts for both true positives (disease + positive) and false positives (no disease + positive).
Posterior P(H|E): Our updated belief after seeing evidence. This is what we want for making decisions.
Bayes' theorem describes a belief update:
This iterative process is the essence of Bayesian learning.
P(H|E) ≠ P(E|H) × P(H) in general. The evidence P(E) in the denominator is crucial—it normalizes the product into a valid probability. Without it, the result might exceed 1 or not sum to 1 across all hypotheses.
Ignoring the prior (base rate) leads to wildly incorrect conclusions. This error is common in:
• Medical diagnosis (overconfidence in test results) • Criminal justice (DNA matching with large suspect pools) • Fraud detection (rare fraud + many transactions = many false alarms) • ML classification (imbalanced classes)
Always consider the prior!
Bayes' theorem appears throughout machine learning in multiple forms and at multiple levels.
Classifiers compute P(Y | X), which can be written using Bayes:
P(Y = k | X) = P(X | Y = k) × P(Y = k) / P(X)
Generative models (like Naive Bayes, Gaussian Discriminant Analysis) learn the likelihood P(X | Y) and prior P(Y), then apply Bayes.
Discriminative models (like logistic regression, neural nets) learn P(Y | X) directly, implicitly combining the Bayesian pieces.
Bayesian learning treats model parameters θ as random variables:
P(θ | Data) = P(Data | θ) × P(θ) / P(Data)
This is fundamentally different from frequentist approaches where θ is a fixed (unknown) value.
| Level | What's the 'Hypothesis'? | What's the 'Evidence'? | Application |
|---|---|---|---|
| Classification | Class label Y | Features X | P(Y|X) for prediction |
| Parameter Learning | Model parameters θ | Training data D | P(θ|D) for Bayesian learning |
| Model Selection | Model M | Data D | P(M|D) for model comparison |
| Causal Inference | Causal structure G | Observational data | P(G|data) for structure learning |
Modern Bayesian deep learning applies Bayes at the parameter level:
The integral (marginalizing over posterior) is intractable for neural networks, so approximations (variational inference, Monte Carlo dropout, ensembles) are used.
L2 regularization (weight decay) is equivalent to a Gaussian prior on weights:
P(W) = N(0, σ²I)
L1 regularization corresponds to a Laplace prior. The regularization strength λ relates to the prior variance. Bayesian thinking reveals that 'regularization' is really 'prior belief that weights should be small.'
Bayes' theorem has an elegant form when expressed in terms of odds rather than probabilities.
The odds of event A is:
O(A) = P(A) / P(A^c) = P(A) / (1 - P(A))
For example, if P(A) = 0.75, then O(A) = 0.75 / 0.25 = 3 (or '3 to 1').
Posterior Odds = Prior Odds × Likelihood Ratio
O(H|E) = O(H) × P(E|H) / P(E|H^c)
The term P(E|H) / P(E|H^c) is called the likelihood ratio or Bayes factor.
Multiplicative updates: Each piece of evidence multiplies the odds by its likelihood ratio. Clean, composable.
Evidence aggregation: With independent evidence E₁, E₂, ...: O(H | E₁, E₂, ...) = O(H) × LR₁ × LR₂ × ...
Natural in log-space: Taking logs gives additive updates: log O(H|E) = log O(H) + log LR
The log-odds (or logit) is:
logit(P) = log[P / (1-P)]
This maps [0, 1] to (-∞, +∞) and is the basis of logistic regression.
Logistic regression models the log-odds as a linear function of features:
log[P(Y=1|X) / P(Y=0|X)] = w^T X + b
This is the log-odds form of Bayes with a linear log-likelihood ratio!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273
import numpy as npfrom typing import Tuple, Dict, List # =============================================================# Bayes' Theorem Implementations# ============================================================= def bayes_theorem( prior: float, likelihood: float, false_positive_rate: float) -> float: """ Compute posterior probability using Bayes' theorem. P(H|E) = P(E|H) * P(H) / P(E) where P(E) = P(E|H)P(H) + P(E|¬H)P(¬H) Parameters: prior: P(H) - prior probability of hypothesis likelihood: P(E|H) - probability of evidence if H is true false_positive_rate: P(E|¬H) - probability of evidence if H is false Returns: P(H|E) - posterior probability """ # Compute P(E) using total probability evidence = likelihood * prior + false_positive_rate * (1 - prior) if evidence == 0: return 0.0 posterior = (likelihood * prior) / evidence return posterior def bayes_multiple_hypotheses( priors: np.ndarray, likelihoods: np.ndarray) -> np.ndarray: """ Bayes' theorem for multiple mutually exclusive hypotheses. P(Hᵢ|E) = P(E|Hᵢ) * P(Hᵢ) / Σⱼ P(E|Hⱼ) * P(Hⱼ) Parameters: priors: Array of prior probabilities [P(H₁), P(H₂), ...] likelihoods: Array of likelihoods [P(E|H₁), P(E|H₂), ...] Returns: Array of posterior probabilities """ assert len(priors) == len(likelihoods) assert np.isclose(priors.sum(), 1.0), "Priors must sum to 1" numerators = likelihoods * priors evidence = numerators.sum() posteriors = numerators / evidence return posteriors def odds(probability: float) -> float: """Convert probability to odds: O(A) = P(A) / (1 - P(A))""" if probability >= 1: return float('inf') return probability / (1 - probability) def probability_from_odds(o: float) -> float: """Convert odds to probability: P = O / (1 + O)""" if np.isinf(o): return 1.0 return o / (1 + o) def logit(p: float) -> float: """Log-odds (logit function): log(p / (1-p))""" if p <= 0: return float('-inf') if p >= 1: return float('inf') return np.log(p / (1 - p)) def sigmoid(x: float) -> float: """Inverse logit (sigmoid): 1 / (1 + exp(-x))""" return 1 / (1 + np.exp(-x)) def bayes_odds_form( prior_odds: float, likelihood_ratio: float) -> Tuple[float, float]: """ Bayes' theorem in odds form. Posterior Odds = Prior Odds × Likelihood Ratio Returns: (posterior_odds, posterior_probability) """ posterior_odds = prior_odds * likelihood_ratio posterior_prob = probability_from_odds(posterior_odds) return posterior_odds, posterior_prob def sequential_bayes( prior: float, observations: List[Tuple[float, float]]) -> List[float]: """ Apply Bayes' theorem sequentially for multiple observations. Parameters: prior: Initial prior P(H) observations: List of (likelihood_if_H, likelihood_if_not_H) tuples Returns: List of posterior probabilities after each observation """ posteriors = [] current = prior for likelihood_h, likelihood_not_h in observations: current = bayes_theorem(current, likelihood_h, likelihood_not_h) posteriors.append(current) return posteriors # =============================================================# Naive Bayes Classifier# ============================================================= class NaiveBayesClassifier: """ Simple Naive Bayes classifier demonstrating Bayes' theorem. Assumes binary features and binary classification. """ def __init__(self, alpha: float = 1.0): """ Parameters: alpha: Laplace smoothing parameter """ self.alpha = alpha self.class_prior: Dict[int, float] = {} self.feature_likelihood: Dict[Tuple[int, int, int], float] = {} def fit(self, X: np.ndarray, y: np.ndarray): """Train the classifier on binary feature matrix X and labels y.""" n_samples, n_features = X.shape classes = np.unique(y) # Compute class priors for c in classes: self.class_prior[c] = (np.sum(y == c) + self.alpha) / (n_samples + self.alpha * len(classes)) # Compute feature likelihoods P(Xⱼ=v | Y=c) for c in classes: X_c = X[y == c] n_c = len(X_c) for j in range(n_features): for v in [0, 1]: count = np.sum(X_c[:, j] == v) self.feature_likelihood[(j, v, c)] = (count + self.alpha) / (n_c + 2 * self.alpha) def predict_proba(self, X: np.ndarray) -> np.ndarray: """Predict class probabilities using Bayes' theorem.""" n_samples, n_features = X.shape classes = list(self.class_prior.keys()) log_probs = np.zeros((n_samples, len(classes))) for i, c in enumerate(classes): # Start with log prior log_probs[:, i] = np.log(self.class_prior[c]) # Add log likelihoods for each feature for j in range(n_features): for sample_idx in range(n_samples): v = int(X[sample_idx, j]) log_probs[sample_idx, i] += np.log(self.feature_likelihood[(j, v, c)]) # Normalize to get proper probabilities (softmax) log_probs -= np.max(log_probs, axis=1, keepdims=True) # Numerical stability probs = np.exp(log_probs) probs /= probs.sum(axis=1, keepdims=True) return probs def predict(self, X: np.ndarray) -> np.ndarray: """Predict most likely class.""" probs = self.predict_proba(X) classes = list(self.class_prior.keys()) return np.array([classes[i] for i in probs.argmax(axis=1)]) if __name__ == "__main__": # Example 1: Medical diagnosis print("=" * 60) print("Example 1: Medical Diagnosis") print("=" * 60) prior = 0.001 # Disease prevalence sensitivity = 0.99 # P(positive | disease) false_positive = 0.01 # P(positive | no disease) posterior = bayes_theorem(prior, sensitivity, false_positive) print(f"Prior P(disease): {prior}") print(f"P(positive | disease): {sensitivity}") print(f"P(positive | no disease): {false_positive}") print(f"Posterior P(disease | positive): {posterior:.4f} = {posterior*100:.1f}%") # Example 2: Odds form print("\n" + "=" * 60) print("Example 2: Odds Form") print("=" * 60) prior_odds = odds(prior) LR = sensitivity / false_positive post_odds, post_prob = bayes_odds_form(prior_odds, LR) print(f"Prior odds: {prior_odds:.4f}") print(f"Likelihood ratio: {LR:.1f}") print(f"Posterior odds: {post_odds:.4f}") print(f"Posterior probability: {post_prob:.4f}") # Example 3: Sequential updates print("\n" + "=" * 60) print("Example 3: Sequential Bayes Updates") print("=" * 60) # Multiple positive tests (assuming independence) observations = [(0.99, 0.01)] * 3 # Three positive tests posteriors = sequential_bayes(0.001, observations) print(f"Prior: {0.001}") for i, p in enumerate(posteriors, 1): print(f"After test {i}: P(disease) = {p:.4f}") # Example 4: Naive Bayes on synthetic data print("\n" + "=" * 60) print("Example 4: Naive Bayes Classifier") print("=" * 60) # Create simple dataset np.random.seed(42) n_samples = 200 X = np.random.binomial(1, 0.5, (n_samples, 4)) # Class 1 if features 0 and 1 are both 1 y = ((X[:, 0] == 1) & (X[:, 1] == 1)).astype(int) clf = NaiveBayesClassifier() clf.fit(X, y) # Test predictions test_X = np.array([ [1, 1, 0, 0], # Should predict 1 [0, 0, 1, 1], # Should predict 0 [1, 0, 1, 0], # Uncertain ]) probs = clf.predict_proba(test_X) preds = clf.predict(test_X) print("Test predictions:") for i, (x, prob, pred) in enumerate(zip(test_X, probs, preds)): print(f" {x} -> P(class=1) = {prob[1]:.3f}, predicted: {pred}")Bayes' theorem is simple to state but subtle to apply correctly. Here are common errors and their corrections.
Error: Treating the likelihood P(E|H) as if it were the posterior P(H|E).
Example: 'The test is 99% accurate, so if you test positive, there's a 99% chance you have the disease.'
Reality: This ignores the disease's base rate. With a rare disease, most positives are false positives despite the 'accurate' test.
Error: Using a prior that doesn't match the relevant population.
Example: Using the general population's disease rate when the patient already has symptoms that pre-select a higher-risk group.
Solution: The prior should reflect everything known BEFORE the current evidence, including any pre-screening.
Error: Computing P(E|H) when you need P(H|E), or vice versa.
Example: 'Most spam emails contain the word 'free'. Therefore, an email with 'free' is probably spam.'
Reality: P('free' | spam) being high doesn't mean P(spam | 'free') is high. Many legitimate emails also contain 'free'.
Error: Treating a positive test as proof of the hypothesis.
Bayesian Reality: Evidence shifts probabilities; it rarely proves or disproves. Even with strong evidence (high likelihood ratio), a very low prior keeps the posterior modest.
The fundamental lesson of Bayes' theorem is that these are different quantities:
• P(wet streets | rain) ≈ 1 (rain makes streets wet) • P(rain | wet streets) << 1 (streets could be wet from many causes)
Bayes' theorem is the ONLY valid way to convert one into the other.
Bayes' theorem is mathematical fact, but its interpretation divides statisticians into 'Bayesian' and 'frequentist' camps.
| Aspect | Frequentist | Bayesian |
|---|---|---|
| Point estimation | Maximum Likelihood Estimate (MLE) | Maximum A Posteriori (MAP) or Posterior Mean |
| Interval estimation | Confidence interval | Credible interval |
| Model comparison | Likelihood ratio tests, AIC | Bayes factors, marginal likelihood |
| Regularization view | Penalty term | Prior distribution |
| Handling of parameters | Fixed unknown constants | Random variables with distributions |
| Sequential updates | Requires full re-analysis | Natural: posterior becomes new prior |
Most practitioners today use both Bayesian and frequentist tools depending on the problem. Deep learning is mostly frequentist (MLE via SGD), but Bayesian neural networks, Gaussian processes, and probabilistic programming bring Bayesian methods to modern ML. Understanding both perspectives enriches your toolkit.
Bayes' theorem is the mathematical formalization of learning from evidence—the essence of what machine learning systems do.
Module Complete!
You have now mastered the foundations of probability theory essential for machine learning:
These concepts form the bedrock upon which all probabilistic machine learning is built—from simple Naive Bayes classifiers to sophisticated Bayesian neural networks and generative models.
Congratulations! You now possess a rigorous understanding of probability theory's foundations. You can reason precisely about uncertainty, update beliefs correctly with Bayes' theorem, recognize independence structures, and understand why probabilistic thinking is at the heart of machine learning. You're prepared for the next module: Random Variables and Distributions.