Loading content...
At the heart of machine learning lies a deceptively simple question: Given observed data, how do we determine the parameters of the underlying model that generated it? This is the estimation problem, and its solution forms the theoretical backbone of nearly every learning algorithm you'll ever encounter.
Imagine you flip a coin 100 times and observe 67 heads. What's the probability of heads for this coin? Intuitively, you might guess 0.67—but can you justify this mathematically? Why not 0.5 (the fair coin assumption)? Why not 0.65 or 0.70? Maximum Likelihood Estimation (MLE) provides the rigorous framework to answer this question and, more importantly, generalizes to arbitrarily complex models.
By the end of this page, you will understand the philosophical foundations of MLE, derive the likelihood function from first principles, master the log-likelihood transformation technique, apply MLE to classic distributions (Bernoulli, Gaussian, Exponential), and understand why MLE is the workhorse of modern machine learning—from logistic regression to deep learning.
Before diving into mathematics, we must understand what we're trying to achieve with parameter estimation. This philosophical grounding will guide our technical development and help us recognize when different methods are appropriate.
The fundamental assumption:
We assume our data comes from some underlying probability distribution characterized by unknown parameters. Our goal is to infer these parameters from observations. This is an inverse problem—we observe outputs (data) and want to determine inputs (parameters).
Consider these concrete scenarios:
| Scenario | Data Observed | Unknown Parameter(s) | True Distribution |
|---|---|---|---|
| Biased coin | Sequence of H/T | Probability p of heads | Bernoulli(p) |
| Sensor readings | Noisy measurements | True value μ, noise σ² | Gaussian(μ, σ²) |
| Customer arrivals | Time between customers | Rate parameter λ | Exponential(λ) |
| Spam classification | Email features x, labels y | Weights w | Logistic model |
| Image generation | Pixel values | Neural network weights θ | Generative model |
Two competing philosophies:
There are two fundamental approaches to parameter estimation:
Frequentist approach: Parameters are fixed (but unknown) constants. We seek a point estimate—a single "best" value that somehow optimizes our knowledge given the data. MLE falls in this camp.
Bayesian approach: Parameters are random variables with their own probability distributions. We maintain full distributions over parameters, updating our beliefs as data arrives. Maximum A Posteriori (MAP) estimation bridges these views.
This page focuses on the frequentist MLE approach. We'll explore MAP in the next page, where the complementary perspectives become clear.
MLE is simpler conceptually and computationally. It requires no prior assumptions about parameters. Most classical ML algorithms (linear regression, logistic regression, neural networks with cross-entropy loss) are fundamentally MLE in disguise. Understanding MLE deeply prepares you for the Bayesian extensions.
The likelihood function is the mathematical object at the heart of MLE. Its definition is deceptively simple but has profound implications.
Definition:
Given observed data D = {x₁, x₂, ..., xₙ} and a parametric model with parameters θ, the likelihood function is:
$$\mathcal{L}(\theta | \mathbf{D}) = P(\mathbf{D} | \theta)$$
This reads as: "the likelihood of parameters θ given data D equals the probability of observing data D given parameters θ."
Critical distinction: Probability vs. Likelihood
This is where many learners stumble. The mathematical formula is the same, but the interpretation differs:
The i.i.d. assumption:
When data points are independent and identically distributed (i.i.d.), the joint probability factorizes into a product:
$$\mathcal{L}(\theta | \mathbf{D}) = P(x_1, x_2, ..., x_n | \theta) = \prod_{i=1}^{n} P(x_i | \theta)$$
This is the key computational enabler. Without independence, we'd need to model complex dependencies between all data points—often intractable. The i.i.d. assumption turns an n-dimensional problem into n one-dimensional problems multiplied together.
Example: Bernoulli likelihood
Consider n coin flips with unknown probability p of heads. If we observe k heads and (n-k) tails:
$$\mathcal{L}(p | k, n) = \binom{n}{k} p^k (1-p)^{n-k}$$
For p = 0.5, 0.6, 0.7 with n = 100, k = 67:
The likelihood at p = 0.7 is vastly higher than at p = 0.5—quantifying our intuition that p ≈ 0.67 is more plausible than p = 0.5.
A common error is treating L(θ|D) as a probability distribution over θ. It's not! The likelihood doesn't integrate to 1 over θ, and we cannot make statements like 'the probability that θ = 0.7 is 50%' using MLE alone. This is a key limitation that MAP addresses.
With the likelihood function defined, the MLE principle is elegantly simple:
MLE Principle:
Choose the parameter value θ that maximizes the likelihood of the observed data:
$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta | \mathbf{D}) = \arg\max_{\theta} \prod_{i=1}^{n} P(x_i | \theta)$$
The hat notation (θ̂) denotes an estimate—our best guess based on data, as opposed to the true (unknowable) parameter θ*.
Intuition:
If we must pick a single value for θ, we should pick the one that makes the observed data most probable. Any other choice would imply we think the data we actually saw was less likely than data we didn't see—a philosophically awkward position.
The optimization challenge:
Finding the maximum of L(θ) requires calculus. We seek:
$$\frac{\partial \mathcal{L}(\theta)}{\partial \theta} = 0$$
However, differentiating products is messy. This motivates the log-likelihood transformation.
The Log-Likelihood Transformation:
The logarithm is a monotonically increasing function: if a > b, then log(a) > log(b). This means maximizing L(θ) is equivalent to maximizing log L(θ):
$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta) = \arg\max_{\theta} \ell(\theta)$$
where ℓ(θ) = log L(θ) is the log-likelihood.
Why log-likelihood is computationally superior:
1234567891011121314151617181920212223242526272829303132333435363738
import numpy as np # Numerical instability with raw likelihooddef compute_likelihood_naive(data, theta): """Compute likelihood directly - NUMERICALLY UNSTABLE""" likelihood = 1.0 for x in data: likelihood *= bernoulli_pmf(x, theta) # Products of small numbers return likelihood # Will underflow to 0 for large n! # Numerical stability with log-likelihooddef compute_log_likelihood(data, theta): """Compute log-likelihood - NUMERICALLY STABLE""" log_likelihood = 0.0 for x in data: log_likelihood += np.log(bernoulli_pmf(x, theta)) return log_likelihood # Never underflows! def bernoulli_pmf(x, p): """Bernoulli probability mass function""" return p if x == 1 else (1 - p) # Demonstrationnp.random.seed(42)n = 1000 # 1000 coin flipstrue_p = 0.7data = np.random.binomial(1, true_p, size=n) # Naive likelihood computation - FAILSnaive_likelihood = compute_likelihood_naive(data, 0.7)print(f"Naive likelihood: {naive_likelihood}") # Will print 0.0 (underflow!) # Log-likelihood computation - WORKSlog_likelihood = compute_log_likelihood(data, 0.7)print(f"Log-likelihood: {log_likelihood:.4f}") # Stable result # We can recover likelihood via exp if needed (for small n)# But typically we just work in log-spaceLet's derive the MLE for the simplest non-trivial case: estimating the bias of a coin. This example illustrates the complete MLE workflow.
Problem setup:
Step 1: Write the likelihood function
$$\mathcal{L}(p | \mathbf{D}) = \prod_{i=1}^{n} P(x_i | p) = \prod_{i=1}^{n} p^{x_i}(1-p)^{1-x_i}$$
Simplifying (let k = Σxᵢ = number of heads):
$$\mathcal{L}(p) = p^k (1-p)^{n-k}$$
Step 2: Transform to log-likelihood
$$\ell(p) = \log \mathcal{L}(p) = k \log p + (n-k) \log(1-p)$$
Step 3: Differentiate and set to zero
$$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0$$
Step 4: Solve for p
Multiplying through: $$k(1-p) = (n-k)p$$ $$k - kp = np - kp$$ $$k = np$$ $$\boxed{\hat{p}_{MLE} = \frac{k}{n}}$$
The MLE for a Bernoulli parameter is simply the sample proportion!
This matches intuition perfectly: if you flip a coin 100 times and see 67 heads, your best estimate for P(heads) is 67/100 = 0.67.
Step 5: Verify it's a maximum (not minimum)
We should check the second derivative:
$$\frac{d^2\ell}{dp^2} = -\frac{k}{p^2} - \frac{n-k}{(1-p)^2}$$
For p ∈ (0, 1) with k > 0 and k < n, this is always negative, confirming we have a maximum.
Edge cases:
These extreme estimates illustrate a weakness of MLE with limited data—there's no regularization pulling the estimate toward reasonable values. MAP estimation addresses this.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
import numpy as npimport matplotlib.pyplot as pltfrom scipy.optimize import minimize_scalar def bernoulli_log_likelihood(p, data): """ Compute log-likelihood for Bernoulli distribution. Parameters: p: probability parameter (0 < p < 1) data: array of 0s and 1s Returns: log-likelihood value """ if p <= 0 or p >= 1: return -np.inf # Log of 0 is -infinity k = np.sum(data) n = len(data) return k * np.log(p) + (n - k) * np.log(1 - p) def bernoulli_mle_closed_form(data): """Closed-form MLE for Bernoulli parameter.""" return np.mean(data) def bernoulli_mle_numerical(data): """Numerical optimization for MLE (for verification).""" result = minimize_scalar( lambda p: -bernoulli_log_likelihood(p, data), bounds=(1e-10, 1 - 1e-10), method='bounded' ) return result.x # Generate synthetic datanp.random.seed(42)true_p = 0.7n = 100data = np.random.binomial(1, true_p, size=n) # Compute MLE both waysmle_closed = bernoulli_mle_closed_form(data)mle_numerical = bernoulli_mle_numerical(data) print(f"True parameter: {true_p}")print(f"Observed proportion: {np.mean(data):.4f}")print(f"MLE (closed-form): {mle_closed:.4f}")print(f"MLE (numerical): {mle_numerical:.4f}") # Visualize the log-likelihood functionp_range = np.linspace(0.01, 0.99, 1000)log_likelihoods = [bernoulli_log_likelihood(p, data) for p in p_range] plt.figure(figsize=(10, 6))plt.plot(p_range, log_likelihoods, 'b-', linewidth=2)plt.axvline(x=mle_closed, color='r', linestyle='--', label=f'MLE = {mle_closed:.3f}')plt.axvline(x=true_p, color='g', linestyle=':', label=f'True p = {true_p}')plt.xlabel('Parameter p', fontsize=12)plt.ylabel('Log-Likelihood ℓ(p)', fontsize=12)plt.title('Log-Likelihood Function for Bernoulli MLE', fontsize=14)plt.legend()plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()The Gaussian (Normal) distribution is perhaps the most important in all of statistics and machine learning. Let's derive the MLE for both parameters: mean μ and variance σ².
Problem setup:
Step 1: Write the likelihood function
The Gaussian PDF is:
$$P(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$
For n i.i.d. observations:
$$\mathcal{L}(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$
$$= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right)$$
Step 2: Transform to log-likelihood
$$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$
Step 3: Differentiate with respect to μ
$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu) = 0$$
$$\sum_{i=1}^n x_i - n\mu = 0$$
$$\boxed{\hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}}$$
The MLE for the Gaussian mean is the sample mean!
Step 4: Differentiate with respect to σ²
Let τ = σ² for clarity:
$$\frac{\partial \ell}{\partial \tau} = -\frac{n}{2\tau} + \frac{1}{2\tau^2}\sum_{i=1}^n(x_i-\mu)^2 = 0$$
Substituting μ̂ = x̄:
$$\frac{n}{2\tau} = \frac{1}{2\tau^2}\sum_{i=1}^n(x_i-\bar{x})^2$$
$$\boxed{\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2}$$
The MLE for variance is the sample variance with n in the denominator.
Important note on bias:
The MLE estimator σ̂²ₘₗₑ is biased! Its expected value is:
$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 \neq \sigma^2$$
This is why the unbiased sample variance uses (n-1) in the denominator (Bessel's correction). We'll explore bias in depth on Page 3.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npfrom scipy import stats def gaussian_mle(data): """ Compute MLE for Gaussian distribution parameters. Parameters: data: array of observations Returns: mu_mle: MLE for mean sigma2_mle: MLE for variance (biased) sigma2_unbiased: unbiased variance estimator """ n = len(data) mu_mle = np.mean(data) sigma2_mle = np.mean((data - mu_mle) ** 2) # Biased (divides by n) sigma2_unbiased = np.var(data, ddof=1) # Unbiased (divides by n-1) return mu_mle, sigma2_mle, sigma2_unbiased # Verify MLE bias through simulationnp.random.seed(42)true_mu = 5.0true_sigma2 = 4.0 # σ = 2n_samples = 20n_simulations = 10000 mle_variances = []unbiased_variances = [] for _ in range(n_simulations): data = np.random.normal(true_mu, np.sqrt(true_sigma2), size=n_samples) _, sigma2_mle, sigma2_unbiased = gaussian_mle(data) mle_variances.append(sigma2_mle) unbiased_variances.append(sigma2_unbiased) print(f"True variance: {true_sigma2}")print(f"Expected MLE variance (biased): {(n_samples-1)/n_samples * true_sigma2:.4f}")print(f"Mean of MLE estimates: {np.mean(mle_variances):.4f}")print(f"Mean of unbiased estimates: {np.mean(unbiased_variances):.4f}") # The MLE systematically underestimates variance# But bias decreases with sample size: (n-1)/n → 1 as n → ∞ # Practical example: Fitting a Gaussian to real datasample_data = np.array([2.3, 3.1, 2.8, 4.2, 3.5, 3.0, 2.9, 3.8, 3.2, 3.4])mu_hat, sigma2_hat_mle, sigma2_hat_unbiased = gaussian_mle(sample_data) print(f"\nFitted Gaussian parameters:")print(f" μ̂ = {mu_hat:.4f}")print(f" σ̂² (MLE, biased) = {sigma2_hat_mle:.4f}")print(f" σ̂² (unbiased) = {sigma2_hat_unbiased:.4f}") # Compare with scipy's implementationmu_scipy, sigma_scipy = stats.norm.fit(sample_data)print(f" scipy's fit: μ = {mu_scipy:.4f}, σ = {sigma_scipy:.4f}")print(f" scipy's σ²: {sigma_scipy**2:.4f}") # scipy uses MLE (biased)When you minimize squared error in linear regression, you're performing MLE under the assumption of Gaussian noise! The OLS solution is exactly the MLE solution for y = Xβ + ε where ε ~ N(0, σ²I). This connection explains why squared loss is so fundamental—it's not arbitrary, but emerges from a probabilistic model.
The Exponential distribution models waiting times and is fundamental in reliability analysis, queueing theory, and survival analysis.
Problem setup:
The Exponential PDF:
$$P(x | \lambda) = \lambda e^{-\lambda x}, \quad x > 0$$
The mean of this distribution is 1/λ.
Step 1: Write the log-likelihood
$$\ell(\lambda) = \sum_{i=1}^n \log(\lambda e^{-\lambda x_i}) = n\log\lambda - \lambda\sum_{i=1}^n x_i$$
Step 2: Differentiate and solve
$$\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0$$
$$\boxed{\hat{\lambda}{MLE} = \frac{n}{\sum{i=1}^n x_i} = \frac{1}{\bar{x}}}$$
The MLE for the rate is the reciprocal of the sample mean!
Intuition: If you observe average waiting times of 5 minutes, your best estimate for the rate is 1/5 = 0.2 events per minute.
Verify it's a maximum:
$$\frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} < 0 \text{ for all } \lambda > 0 \checkmark$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def exponential_mle(data): """ Compute MLE for Exponential distribution rate parameter. Parameters: data: array of positive observations (waiting times) Returns: lambda_mle: MLE for rate parameter """ return 1.0 / np.mean(data) # Real-world example: Customer service waiting timesnp.random.seed(42)true_lambda = 0.5 # True rate: 0.5 customers per minute (mean wait = 2 min)n = 50 # Simulate waiting timeswaiting_times = np.random.exponential(scale=1/true_lambda, size=n) # Compute MLElambda_hat = exponential_mle(waiting_times)mean_wait = np.mean(waiting_times) print(f"True rate parameter: λ = {true_lambda}")print(f"True mean waiting time: {1/true_lambda} minutes")print(f"Observed mean waiting time: {mean_wait:.3f} minutes")print(f"MLE rate estimate: λ̂ = {lambda_hat:.4f}") # Visualize the fitfig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Left: Histogram with fitted distributionx = np.linspace(0, max(waiting_times) * 1.1, 200)axes[0].hist(waiting_times, bins=15, density=True, alpha=0.7, label='Observed data', color='steelblue')axes[0].plot(x, stats.expon.pdf(x, scale=1/lambda_hat), 'r-', lw=2, label=f'MLE fit (λ̂={lambda_hat:.3f})')axes[0].plot(x, stats.expon.pdf(x, scale=1/true_lambda), 'g--', lw=2, label=f'True distribution (λ={true_lambda})')axes[0].set_xlabel('Waiting Time (minutes)')axes[0].set_ylabel('Density')axes[0].set_title('Exponential Distribution: Data vs MLE Fit')axes[0].legend() # Right: Log-likelihood functionlambda_range = np.linspace(0.1, 1.5, 200)log_likelihoods = [len(waiting_times) * np.log(lam) - lam * np.sum(waiting_times) for lam in lambda_range] axes[1].plot(lambda_range, log_likelihoods, 'b-', lw=2)axes[1].axvline(x=lambda_hat, color='r', linestyle='--', label=f'MLE = {lambda_hat:.3f}')axes[1].axvline(x=true_lambda, color='g', linestyle=':', label=f'True λ = {true_lambda}')axes[1].set_xlabel('Rate Parameter λ')axes[1].set_ylabel('Log-Likelihood ℓ(λ)')axes[1].set_title('Log-Likelihood Function')axes[1].legend() plt.tight_layout()plt.show()The examples above—Bernoulli, Gaussian, Exponential—are relatively simple. But MLE's true power emerges when applied to complex models. Nearly every supervised learning algorithm can be understood as MLE under an appropriate probabilistic interpretation.
Linear Regression as MLE:
Consider the model y = w^T x + ε where ε ~ N(0, σ²). The likelihood of observing output y given input x and weights w is:
$$P(y | \mathbf{x}, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mathbf{w}^T\mathbf{x})^2}{2\sigma^2}\right)$$
The negative log-likelihood (NLL) for n observations:
$$-\ell(\mathbf{w}) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$
Minimizing NLL is equivalent to minimizing Mean Squared Error! The OLS solution IS the MLE.
Logistic Regression as MLE:
For binary classification, we model P(y=1|x,w) = σ(w^T x) where σ is the sigmoid function.
The likelihood for one observation: $$P(y | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x})^y (1 - \sigma(\mathbf{w}^T\mathbf{x}))^{1-y}$$
The negative log-likelihood is the Binary Cross-Entropy Loss: $$-\ell(\mathbf{w}) = -\sum_{i=1}^n [y_i \log \sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T\mathbf{x}_i))]$$
Neural Networks as MLE:
Deep learning with cross-entropy loss is MLE under a multinomial (softmax) output distribution. The network learns to maximize the likelihood of correct class probabilities.
| Algorithm | Probabilistic Model | MLE Objective | Standard Loss Name |
|---|---|---|---|
| Linear Regression | y = w^Tx + ε, ε~N(0,σ²) | Minimize NLL | Mean Squared Error |
| Logistic Regression | P(y|x) = σ(w^Tx) | Minimize NLL | Binary Cross-Entropy |
| Softmax Classifier | P(y|x) = softmax(Wx) | Minimize NLL | Categorical Cross-Entropy |
| Poisson Regression | P(y|x) = Poisson(exp(w^Tx)) | Minimize NLL | Poisson Deviance |
| Gaussian Mixture Models | P(x) = Σₖ πₖ N(μₖ,Σₖ) | Maximize Likelihood (EM) | Log-Likelihood |
| Neural Language Models | P(wₜ|w₁...wₜ₋₁) | Minimize NLL | Cross-Entropy / Perplexity |
Understanding MLE provides a unified lens through which to view machine learning. Loss functions aren't arbitrary choices—they emerge from probabilistic assumptions. When you understand the underlying distributions, you can derive appropriate losses for any problem, customize them for domain-specific needs, and understand when standard losses might fail.
MLE isn't just convenient—it has remarkable theoretical properties that make it the method of choice for many estimation problems. These properties explain why MLE dominates practical machine learning.
Key Asymptotic Properties (as n → ∞):
The Fisher Information:
The Fisher Information measures how much the likelihood function curves around the true parameter—essentially, how much information the data provides about θ:
$$I(\theta) = -E\left[\frac{\partial^2 \log P(X|\theta)}{\partial \theta^2}\right] = E\left[\left(\frac{\partial \log P(X|\theta)}{\partial \theta}\right)^2\right]$$
High Fisher Information → Sharply peaked likelihood → Precise estimates possible Low Fisher Information → Flat likelihood → Estimation is difficult
The Cramér-Rao Lower Bound:
For any unbiased estimator θ̂:
$$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$$
MLE asymptotically achieves this bound—it extracts the maximum possible information from data.
These properties are asymptotic—they hold as n → ∞. In finite samples, MLE can be biased (as we saw with Gaussian variance), sensitive to outliers, prone to overfitting without regularization, and may converge to local optima in non-convex problems. Understanding these limitations motivates the study of MAP estimation and regularization techniques.
Moving from theory to practice requires attention to several computational and numerical details:
1. Numerical Optimization:
For complex models without closed-form solutions, we use iterative optimization:
2. Numerical Stability:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
import numpy as np # ========================================# Log-Sum-Exp Trick for Numerical Stability# ======================================== def log_sum_exp_naive(x): """UNSTABLE: Will overflow for large values""" return np.log(np.sum(np.exp(x))) def log_sum_exp_stable(x): """STABLE: Subtracts maximum before exp""" max_x = np.max(x) return max_x + np.log(np.sum(np.exp(x - max_x))) # Demonstrationlarge_values = np.array([1000, 1001, 1002])print(f"Naive log-sum-exp: {log_sum_exp_naive(large_values)}") # inf!print(f"Stable log-sum-exp: {log_sum_exp_stable(large_values)}") # Correct # ========================================# Safe Log Probability# ======================================== def safe_log(x, eps=1e-10): """Prevent log(0) by clipping to minimum value""" return np.log(np.clip(x, eps, 1.0)) def binary_cross_entropy_naive(y_true, y_pred): """UNSTABLE: Can produce -inf when y_pred = 0 or 1""" return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) def binary_cross_entropy_stable(y_true, y_pred, eps=1e-15): """STABLE: Clips predictions before log""" y_pred_clipped = np.clip(y_pred, eps, 1 - eps) return -np.mean( y_true * np.log(y_pred_clipped) + (1 - y_true) * np.log(1 - y_pred_clipped) ) # Demonstration with edge casey_true = np.array([1, 0, 1])y_pred = np.array([1.0, 0.0, 0.99]) # Edge cases: exact 0 and 1 print(f"\nNaive BCE: {binary_cross_entropy_naive(y_true, y_pred)}") # -inf or nanprint(f"Stable BCE: {binary_cross_entropy_stable(y_true, y_pred):.6f}") # Finite # ========================================# Softmax with Temperature# ======================================== def softmax_naive(logits): """UNSTABLE: Overflows for large logits""" exp_logits = np.exp(logits) return exp_logits / np.sum(exp_logits) def softmax_stable(logits, temperature=1.0): """STABLE: Subtracts max and supports temperature scaling""" scaled_logits = logits / temperature shifted = scaled_logits - np.max(scaled_logits) exp_shifted = np.exp(shifted) return exp_shifted / np.sum(exp_shifted) logits = np.array([1000, 1001, 1002])print(f"\nNaive softmax: {softmax_naive(logits)}") # [nan, nan, nan]print(f"Stable softmax: {softmax_stable(logits)}") # Correct probabilitiesWe've covered the foundational theory and practice of Maximum Likelihood Estimation. Let's consolidate the key insights:
Looking ahead:
MLE treats parameters as fixed unknowns and seeks point estimates. But what if we want to express uncertainty about parameters? What if we have prior beliefs we want to incorporate? The next page introduces Maximum A Posteriori (MAP) estimation, which bridges MLE with Bayesian thinking and provides the foundation for regularization in machine learning.
You've mastered the foundations of Maximum Likelihood Estimation—the most important parameter estimation method in machine learning. You can now derive MLEs for common distributions, understand why standard loss functions take their forms, implement numerically stable MLE computations, and recognize the theoretical guarantees and limitations of this approach.