Estimation Theory - Learning Module

Loading content...

0/245

Maximum Likelihood Estimation (MLE)

The Central Problem of Learning from Data

At the heart of machine learning lies a deceptively simple question: Given observed data, how do we determine the parameters of the underlying model that generated it? This is the estimation problem, and its solution forms the theoretical backbone of nearly every learning algorithm you'll ever encounter.

Imagine you flip a coin 100 times and observe 67 heads. What's the probability of heads for this coin? Intuitively, you might guess 0.67—but can you justify this mathematically? Why not 0.5 (the fair coin assumption)? Why not 0.65 or 0.70? Maximum Likelihood Estimation (MLE) provides the rigorous framework to answer this question and, more importantly, generalizes to arbitrarily complex models.

What You Will Learn

By the end of this page, you will understand the philosophical foundations of MLE, derive the likelihood function from first principles, master the log-likelihood transformation technique, apply MLE to classic distributions (Bernoulli, Gaussian, Exponential), and understand why MLE is the workhorse of modern machine learning—from logistic regression to deep learning.

The Philosophy of Estimation

Before diving into mathematics, we must understand what we're trying to achieve with parameter estimation. This philosophical grounding will guide our technical development and help us recognize when different methods are appropriate.

The fundamental assumption:

We assume our data comes from some underlying probability distribution characterized by unknown parameters. Our goal is to infer these parameters from observations. This is an inverse problem—we observe outputs (data) and want to determine inputs (parameters).

Consider these concrete scenarios:

Estimation Problems in Machine Learning
Scenario	Data Observed	Unknown Parameter(s)	True Distribution
Biased coin	Sequence of H/T	Probability p of heads	Bernoulli(p)
Sensor readings	Noisy measurements	True value μ, noise σ²	Gaussian(μ, σ²)
Customer arrivals	Time between customers	Rate parameter λ	Exponential(λ)
Spam classification	Email features x, labels y	Weights w	Logistic model
Image generation	Pixel values	Neural network weights θ	Generative model

Two competing philosophies:

There are two fundamental approaches to parameter estimation:

Frequentist approach: Parameters are fixed (but unknown) constants. We seek a point estimate—a single "best" value that somehow optimizes our knowledge given the data. MLE falls in this camp.
Bayesian approach: Parameters are random variables with their own probability distributions. We maintain full distributions over parameters, updating our beliefs as data arrives. Maximum A Posteriori (MAP) estimation bridges these views.

This page focuses on the frequentist MLE approach. We'll explore MAP in the next page, where the complementary perspectives become clear.

Why Start with Frequentist Methods?

MLE is simpler conceptually and computationally. It requires no prior assumptions about parameters. Most classical ML algorithms (linear regression, logistic regression, neural networks with cross-entropy loss) are fundamentally MLE in disguise. Understanding MLE deeply prepares you for the Bayesian extensions.

The Likelihood Function: Foundation of MLE

The likelihood function is the mathematical object at the heart of MLE. Its definition is deceptively simple but has profound implications.

Definition:

Given observed data D = {x₁, x₂, ..., xₙ} and a parametric model with parameters θ, the likelihood function is:

$$\mathcal{L}(\theta | \mathbf{D}) = P(\mathbf{D} | \theta)$$

This reads as: "the likelihood of parameters θ given data D equals the probability of observing data D given parameters θ."

Critical distinction: Probability vs. Likelihood

This is where many learners stumble. The mathematical formula is the same, but the interpretation differs:

Probability P(D | θ)

•θ is fixed (known)
•D is the variable (unknown)
•Sums to 1 over all possible data
•Answers: 'Given this model, what data might we see?'
•Forward inference

Likelihood L(θ | D)

•D is fixed (observed)
•θ is the variable (unknown)
•Does NOT sum to 1 over θ
•Answers: 'Given this data, which model is plausible?'
•Inverse inference

The i.i.d. assumption:

When data points are independent and identically distributed (i.i.d.), the joint probability factorizes into a product:

$$\mathcal{L}(\theta | \mathbf{D}) = P(x_1, x_2, ..., x_n | \theta) = \prod_{i=1}^{n} P(x_i | \theta)$$

This is the key computational enabler. Without independence, we'd need to model complex dependencies between all data points—often intractable. The i.i.d. assumption turns an n-dimensional problem into n one-dimensional problems multiplied together.

Example: Bernoulli likelihood

Consider n coin flips with unknown probability p of heads. If we observe k heads and (n-k) tails:

$$\mathcal{L}(p | k, n) = \binom{n}{k} p^k (1-p)^{n-k}$$

For p = 0.5, 0.6, 0.7 with n = 100, k = 67:

L(0.5) = C(100,67) × 0.5¹⁰⁰ ≈ 1.4 × 10⁻⁶
L(0.6) = C(100,67) × 0.6⁶⁷ × 0.4³³ ≈ 4.1 × 10⁻³
L(0.7) = C(100,67) × 0.7⁶⁷ × 0.3³³ ≈ 8.6 × 10⁻³

The likelihood at p = 0.7 is vastly higher than at p = 0.5—quantifying our intuition that p ≈ 0.67 is more plausible than p = 0.5.

Likelihood is NOT Probability

A common error is treating L(θ|D) as a probability distribution over θ. It's not! The likelihood doesn't integrate to 1 over θ, and we cannot make statements like 'the probability that θ = 0.7 is 50%' using MLE alone. This is a key limitation that MAP addresses.

The Maximum Likelihood Principle

With the likelihood function defined, the MLE principle is elegantly simple:

MLE Principle:

Choose the parameter value θ that maximizes the likelihood of the observed data:

$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta | \mathbf{D}) = \arg\max_{\theta} \prod_{i=1}^{n} P(x_i | \theta)$$

The hat notation (θ̂) denotes an estimate—our best guess based on data, as opposed to the true (unknowable) parameter θ*.

Intuition:

If we must pick a single value for θ, we should pick the one that makes the observed data most probable. Any other choice would imply we think the data we actually saw was less likely than data we didn't see—a philosophically awkward position.

The optimization challenge:

Finding the maximum of L(θ) requires calculus. We seek:

$$\frac{\partial \mathcal{L}(\theta)}{\partial \theta} = 0$$

However, differentiating products is messy. This motivates the log-likelihood transformation.

The Log-Likelihood Transformation:

The logarithm is a monotonically increasing function: if a > b, then log(a) > log(b). This means maximizing L(θ) is equivalent to maximizing log L(θ):

$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta) = \arg\max_{\theta} \ell(\theta)$$

where ℓ(θ) = log L(θ) is the log-likelihood.

Why log-likelihood is computationally superior:

Advantages of Log-Likelihood

•Products → Sums: log(∏ᵢ P(xᵢ|θ)) = Σᵢ log P(xᵢ|θ). Sums are easier to differentiate than products.
•Numerical stability: Products of many small probabilities underflow to zero. Sums of log-probabilities remain stable.
•Additive structure: Adding new data points just adds more terms—no recomputation needed.
•Convexity preservation: For many important distributions (exponential family), the negative log-likelihood is convex, guaranteeing a unique global maximum.

numerical_stability_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
 
# Numerical instability with raw likelihood
def compute_likelihood_naive(data, theta):
    """Compute likelihood directly - NUMERICALLY UNSTABLE"""
    likelihood = 1.0
    for x in data:
        likelihood *= bernoulli_pmf(x, theta)  # Products of small numbers
    return likelihood  # Will underflow to 0 for large n!
 
# Numerical stability with log-likelihood
def compute_log_likelihood(data, theta):
    """Compute log-likelihood - NUMERICALLY STABLE"""
    log_likelihood = 0.0
    for x in data:
        log_likelihood += np.log(bernoulli_pmf(x, theta))
    return log_likelihood  # Never underflows!
 
def bernoulli_pmf(x, p):
    """Bernoulli probability mass function"""
    return p if x == 1 else (1 - p)
 
# Demonstration
np.random.seed(42)
n = 1000  # 1000 coin flips
true_p = 0.7
data = np.random.binomial(1, true_p, size=n)
 
# Naive likelihood computation - FAILS
naive_likelihood = compute_likelihood_naive(data, 0.7)
print(f"Naive likelihood: {naive_likelihood}")  # Will print 0.0 (underflow!)
 
# Log-likelihood computation - WORKS
log_likelihood = compute_log_likelihood(data, 0.7)
print(f"Log-likelihood: {log_likelihood:.4f}")  # Stable result
 
# We can recover likelihood via exp if needed (for small n)
# But typically we just work in log-space

MLE for the Bernoulli Distribution

Let's derive the MLE for the simplest non-trivial case: estimating the bias of a coin. This example illustrates the complete MLE workflow.

Problem setup:

Data: n independent coin flips D = {x₁, x₂, ..., xₙ} where xᵢ ∈ {0, 1}
Model: Bernoulli(p) with unknown p ∈ [0, 1]
Goal: Find p̂ₘₗₑ that maximizes the likelihood

Step 1: Write the likelihood function

$$\mathcal{L}(p | \mathbf{D}) = \prod_{i=1}^{n} P(x_i | p) = \prod_{i=1}^{n} p^{x_i}(1-p)^{1-x_i}$$

Simplifying (let k = Σxᵢ = number of heads):

$$\mathcal{L}(p) = p^k (1-p)^{n-k}$$

Step 2: Transform to log-likelihood

$$\ell(p) = \log \mathcal{L}(p) = k \log p + (n-k) \log(1-p)$$

Step 3: Differentiate and set to zero

$$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0$$

Step 4: Solve for p

Multiplying through: $$k(1-p) = (n-k)p$$ $$k - kp = np - kp$$ $$k = np$$ $$\boxed{\hat{p}_{MLE} = \frac{k}{n}}$$

The MLE for a Bernoulli parameter is simply the sample proportion!

This matches intuition perfectly: if you flip a coin 100 times and see 67 heads, your best estimate for P(heads) is 67/100 = 0.67.

Step 5: Verify it's a maximum (not minimum)

We should check the second derivative:

$$\frac{d^2\ell}{dp^2} = -\frac{k}{p^2} - \frac{n-k}{(1-p)^2}$$

For p ∈ (0, 1) with k > 0 and k < n, this is always negative, confirming we have a maximum.

Edge cases:

If k = 0 (all tails): p̂ = 0
If k = n (all heads): p̂ = 1

These extreme estimates illustrate a weakness of MLE with limited data—there's no regularization pulling the estimate toward reasonable values. MAP estimation addresses this.

bernoulli_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize_scalar
 
def bernoulli_log_likelihood(p, data):
    """
    Compute log-likelihood for Bernoulli distribution.
    
    Parameters:
        p: probability parameter (0 < p < 1)
        data: array of 0s and 1s
    
    Returns:
        log-likelihood value
    """
    if p <= 0 or p >= 1:
        return -np.inf  # Log of 0 is -infinity
    
    k = np.sum(data)
    n = len(data)
    return k * np.log(p) + (n - k) * np.log(1 - p)
 
def bernoulli_mle_closed_form(data):
    """Closed-form MLE for Bernoulli parameter."""
    return np.mean(data)
 
def bernoulli_mle_numerical(data):
    """Numerical optimization for MLE (for verification)."""
    result = minimize_scalar(
        lambda p: -bernoulli_log_likelihood(p, data),
        bounds=(1e-10, 1 - 1e-10),
        method='bounded'
    )
    return result.x
 
# Generate synthetic data
np.random.seed(42)
true_p = 0.7
n = 100
data = np.random.binomial(1, true_p, size=n)
 
# Compute MLE both ways
mle_closed = bernoulli_mle_closed_form(data)
mle_numerical = bernoulli_mle_numerical(data)
 
print(f"True parameter: {true_p}")
print(f"Observed proportion: {np.mean(data):.4f}")
print(f"MLE (closed-form): {mle_closed:.4f}")
print(f"MLE (numerical): {mle_numerical:.4f}")
 
# Visualize the log-likelihood function
p_range = np.linspace(0.01, 0.99, 1000)
log_likelihoods = [bernoulli_log_likelihood(p, data) for p in p_range]
 
plt.figure(figsize=(10, 6))
plt.plot(p_range, log_likelihoods, 'b-', linewidth=2)
plt.axvline(x=mle_closed, color='r', linestyle='--', 
            label=f'MLE = {mle_closed:.3f}')
plt.axvline(x=true_p, color='g', linestyle=':', 
            label=f'True p = {true_p}')
plt.xlabel('Parameter p', fontsize=12)
plt.ylabel('Log-Likelihood ℓ(p)', fontsize=12)
plt.title('Log-Likelihood Function for Bernoulli MLE', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

MLE for the Gaussian Distribution

The Gaussian (Normal) distribution is perhaps the most important in all of statistics and machine learning. Let's derive the MLE for both parameters: mean μ and variance σ².

Problem setup:

Data: n i.i.d. observations D = {x₁, x₂, ..., xₙ}
Model: N(μ, σ²) with unknown μ and σ²
Goal: Find (μ̂, σ̂²)ₘₗₑ

Step 1: Write the likelihood function

The Gaussian PDF is:

$$P(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

For n i.i.d. observations:

$$\mathcal{L}(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$

$$= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right)$$

Step 2: Transform to log-likelihood

$$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$

Step 3: Differentiate with respect to μ

$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu) = 0$$

$$\sum_{i=1}^n x_i - n\mu = 0$$

$$\boxed{\hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}}$$

The MLE for the Gaussian mean is the sample mean!

Step 4: Differentiate with respect to σ²

Let τ = σ² for clarity:

$$\frac{\partial \ell}{\partial \tau} = -\frac{n}{2\tau} + \frac{1}{2\tau^2}\sum_{i=1}^n(x_i-\mu)^2 = 0$$

Substituting μ̂ = x̄:

$$\frac{n}{2\tau} = \frac{1}{2\tau^2}\sum_{i=1}^n(x_i-\bar{x})^2$$

$$\boxed{\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2}$$

The MLE for variance is the sample variance with n in the denominator.

Important note on bias:

The MLE estimator σ̂²ₘₗₑ is biased! Its expected value is:

$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 \neq \sigma^2$$

This is why the unbiased sample variance uses (n-1) in the denominator (Bessel's correction). We'll explore bias in depth on Page 3.

gaussian_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from scipy import stats
 
def gaussian_mle(data):
    """
    Compute MLE for Gaussian distribution parameters.
    
    Parameters:
        data: array of observations
    
    Returns:
        mu_mle: MLE for mean
        sigma2_mle: MLE for variance (biased)
        sigma2_unbiased: unbiased variance estimator
    """
    n = len(data)
    mu_mle = np.mean(data)
    sigma2_mle = np.mean((data - mu_mle) ** 2)  # Biased (divides by n)
    sigma2_unbiased = np.var(data, ddof=1)      # Unbiased (divides by n-1)
    
    return mu_mle, sigma2_mle, sigma2_unbiased
 
# Verify MLE bias through simulation
np.random.seed(42)
true_mu = 5.0
true_sigma2 = 4.0  # σ = 2
n_samples = 20
n_simulations = 10000
 
mle_variances = []
unbiased_variances = []
 
for _ in range(n_simulations):
    data = np.random.normal(true_mu, np.sqrt(true_sigma2), size=n_samples)
    _, sigma2_mle, sigma2_unbiased = gaussian_mle(data)
    mle_variances.append(sigma2_mle)
    unbiased_variances.append(sigma2_unbiased)
 
print(f"True variance: {true_sigma2}")
print(f"Expected MLE variance (biased): {(n_samples-1)/n_samples * true_sigma2:.4f}")
print(f"Mean of MLE estimates: {np.mean(mle_variances):.4f}")
print(f"Mean of unbiased estimates: {np.mean(unbiased_variances):.4f}")
 
# The MLE systematically underestimates variance
# But bias decreases with sample size: (n-1)/n → 1 as n → ∞
 
# Practical example: Fitting a Gaussian to real data
sample_data = np.array([2.3, 3.1, 2.8, 4.2, 3.5, 3.0, 2.9, 3.8, 3.2, 3.4])
mu_hat, sigma2_hat_mle, sigma2_hat_unbiased = gaussian_mle(sample_data)
 
print(f"\nFitted Gaussian parameters:")
print(f"  μ̂ = {mu_hat:.4f}")
print(f"  σ̂² (MLE, biased) = {sigma2_hat_mle:.4f}")
print(f"  σ̂² (unbiased) = {sigma2_hat_unbiased:.4f}")
 
# Compare with scipy's implementation
mu_scipy, sigma_scipy = stats.norm.fit(sample_data)
print(f"  scipy's fit: μ = {mu_scipy:.4f}, σ = {sigma_scipy:.4f}")
print(f"  scipy's σ²: {sigma_scipy**2:.4f}")  # scipy uses MLE (biased)

Connection to Linear Regression

When you minimize squared error in linear regression, you're performing MLE under the assumption of Gaussian noise! The OLS solution is exactly the MLE solution for y = Xβ + ε where ε ~ N(0, σ²I). This connection explains why squared loss is so fundamental—it's not arbitrary, but emerges from a probabilistic model.

MLE for the Exponential Distribution

The Exponential distribution models waiting times and is fundamental in reliability analysis, queueing theory, and survival analysis.

Problem setup:

Data: n i.i.d. waiting times D = {x₁, x₂, ..., xₙ} where xᵢ > 0
Model: Exponential(λ) with rate parameter λ > 0
Goal: Find λ̂ₘₗₑ

The Exponential PDF:

$$P(x | \lambda) = \lambda e^{-\lambda x}, \quad x > 0$$

The mean of this distribution is 1/λ.

Step 1: Write the log-likelihood

$$\ell(\lambda) = \sum_{i=1}^n \log(\lambda e^{-\lambda x_i}) = n\log\lambda - \lambda\sum_{i=1}^n x_i$$

Step 2: Differentiate and solve

$$\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0$$

$$\boxed{\hat{\lambda}{MLE} = \frac{n}{\sum{i=1}^n x_i} = \frac{1}{\bar{x}}}$$

The MLE for the rate is the reciprocal of the sample mean!

Intuition: If you observe average waiting times of 5 minutes, your best estimate for the rate is 1/5 = 0.2 events per minute.

Verify it's a maximum:

$$\frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} < 0 \text{ for all } \lambda > 0 \checkmark$$

exponential_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def exponential_mle(data):
    """
    Compute MLE for Exponential distribution rate parameter.
    
    Parameters:
        data: array of positive observations (waiting times)
    
    Returns:
        lambda_mle: MLE for rate parameter
    """
    return 1.0 / np.mean(data)
 
# Real-world example: Customer service waiting times
np.random.seed(42)
true_lambda = 0.5  # True rate: 0.5 customers per minute (mean wait = 2 min)
n = 50
 
# Simulate waiting times
waiting_times = np.random.exponential(scale=1/true_lambda, size=n)
 
# Compute MLE
lambda_hat = exponential_mle(waiting_times)
mean_wait = np.mean(waiting_times)
 
print(f"True rate parameter: λ = {true_lambda}")
print(f"True mean waiting time: {1/true_lambda} minutes")
print(f"Observed mean waiting time: {mean_wait:.3f} minutes")
print(f"MLE rate estimate: λ̂ = {lambda_hat:.4f}")
 
# Visualize the fit
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Histogram with fitted distribution
x = np.linspace(0, max(waiting_times) * 1.1, 200)
axes[0].hist(waiting_times, bins=15, density=True, alpha=0.7, 
             label='Observed data', color='steelblue')
axes[0].plot(x, stats.expon.pdf(x, scale=1/lambda_hat), 'r-', lw=2,
             label=f'MLE fit (λ̂={lambda_hat:.3f})')
axes[0].plot(x, stats.expon.pdf(x, scale=1/true_lambda), 'g--', lw=2,
             label=f'True distribution (λ={true_lambda})')
axes[0].set_xlabel('Waiting Time (minutes)')
axes[0].set_ylabel('Density')
axes[0].set_title('Exponential Distribution: Data vs MLE Fit')
axes[0].legend()
 
# Right: Log-likelihood function
lambda_range = np.linspace(0.1, 1.5, 200)
log_likelihoods = [len(waiting_times) * np.log(lam) - lam * np.sum(waiting_times) 
                   for lam in lambda_range]
 
axes[1].plot(lambda_range, log_likelihoods, 'b-', lw=2)
axes[1].axvline(x=lambda_hat, color='r', linestyle='--', 
                label=f'MLE = {lambda_hat:.3f}')
axes[1].axvline(x=true_lambda, color='g', linestyle=':', 
                label=f'True λ = {true_lambda}')
axes[1].set_xlabel('Rate Parameter λ')
axes[1].set_ylabel('Log-Likelihood ℓ(λ)')
axes[1].set_title('Log-Likelihood Function')
axes[1].legend()
 
plt.tight_layout()
plt.show()

MLE as the Foundation of Machine Learning

The examples above—Bernoulli, Gaussian, Exponential—are relatively simple. But MLE's true power emerges when applied to complex models. Nearly every supervised learning algorithm can be understood as MLE under an appropriate probabilistic interpretation.

Linear Regression as MLE:

Consider the model y = w^T x + ε where ε ~ N(0, σ²). The likelihood of observing output y given input x and weights w is:

$$P(y | \mathbf{x}, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mathbf{w}^T\mathbf{x})^2}{2\sigma^2}\right)$$

The negative log-likelihood (NLL) for n observations:

$$-\ell(\mathbf{w}) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

Minimizing NLL is equivalent to minimizing Mean Squared Error! The OLS solution IS the MLE.

Logistic Regression as MLE:

For binary classification, we model P(y=1|x,w) = σ(w^T x) where σ is the sigmoid function.

The likelihood for one observation: $$P(y | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x})^y (1 - \sigma(\mathbf{w}^T\mathbf{x}))^{1-y}$$

The negative log-likelihood is the Binary Cross-Entropy Loss: $$-\ell(\mathbf{w}) = -\sum_{i=1}^n [y_i \log \sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T\mathbf{x}_i))]$$

Neural Networks as MLE:

Deep learning with cross-entropy loss is MLE under a multinomial (softmax) output distribution. The network learns to maximize the likelihood of correct class probabilities.

ML Algorithms as Maximum Likelihood Estimation
Algorithm	Probabilistic Model	MLE Objective	Standard Loss Name
Linear Regression	y = w^Tx + ε, ε~N(0,σ²)	Minimize NLL	Mean Squared Error
Logistic Regression	P(y\|x) = σ(w^Tx)	Minimize NLL	Binary Cross-Entropy
Softmax Classifier	P(y\|x) = softmax(Wx)	Minimize NLL	Categorical Cross-Entropy
Poisson Regression	P(y\|x) = Poisson(exp(w^Tx))	Minimize NLL	Poisson Deviance
Gaussian Mixture Models	P(x) = Σₖ πₖ N(μₖ,Σₖ)	Maximize Likelihood (EM)	Log-Likelihood
Neural Language Models	P(wₜ\|w₁...wₜ₋₁)	Minimize NLL	Cross-Entropy / Perplexity

The Unifying Framework

Understanding MLE provides a unified lens through which to view machine learning. Loss functions aren't arbitrary choices—they emerge from probabilistic assumptions. When you understand the underlying distributions, you can derive appropriate losses for any problem, customize them for domain-specific needs, and understand when standard losses might fail.

Theoretical Properties of MLE

MLE isn't just convenient—it has remarkable theoretical properties that make it the method of choice for many estimation problems. These properties explain why MLE dominates practical machine learning.

Key Asymptotic Properties (as n → ∞):

Theoretical Guarantees of MLE

•Consistency: θ̂ₘₗₑ → θ* as n → ∞. With enough data, the MLE converges to the true parameter. Your estimate gets arbitrarily close to the truth.
•Asymptotic Normality: √n(θ̂ₘₗₑ - θ*) →ᵈ N(0, I(θ*)⁻¹) where I(θ) is the Fisher Information. The distribution of the estimator becomes Gaussian, enabling confidence intervals.
•Asymptotic Efficiency: Among consistent estimators, MLE achieves the lowest possible variance (Cramér-Rao Lower Bound). No other estimator can do better asymptotically.
•Invariance: If θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g. Transformations preserve optimality.

The Fisher Information:

The Fisher Information measures how much the likelihood function curves around the true parameter—essentially, how much information the data provides about θ:

$$I(\theta) = -E\left[\frac{\partial^2 \log P(X|\theta)}{\partial \theta^2}\right] = E\left[\left(\frac{\partial \log P(X|\theta)}{\partial \theta}\right)^2\right]$$

High Fisher Information → Sharply peaked likelihood → Precise estimates possible Low Fisher Information → Flat likelihood → Estimation is difficult

The Cramér-Rao Lower Bound:

For any unbiased estimator θ̂:

$$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$$

MLE asymptotically achieves this bound—it extracts the maximum possible information from data.

Finite Sample Caveats

These properties are asymptotic—they hold as n → ∞. In finite samples, MLE can be biased (as we saw with Gaussian variance), sensitive to outliers, prone to overfitting without regularization, and may converge to local optima in non-convex problems. Understanding these limitations motivates the study of MAP estimation and regularization techniques.

Practical Implementation Considerations

Moving from theory to practice requires attention to several computational and numerical details:

1. Numerical Optimization:

For complex models without closed-form solutions, we use iterative optimization:

Gradient Descent: θ ← θ - α∇ℓ(θ)
Newton's Method: θ ← θ - H⁻¹∇ℓ(θ) where H is the Hessian
Quasi-Newton (L-BFGS): Approximates Hessian efficiently
Stochastic Gradient Descent: Essential for large datasets

2. Numerical Stability:

Common Numerical Issues

•Log-sum-exp trick: When computing log(Σᵢexp(aᵢ)), subtract max(a) first to prevent overflow
•Probability clipping: Clip probabilities to [ε, 1-ε] before taking logs to avoid log(0) = -∞
•Gradient clipping: Prevent exploding gradients in deep models
•Mixed precision: Use float64 for accumulators even with float32 parameters

numerical_tricks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
# ========================================
# Log-Sum-Exp Trick for Numerical Stability
# ========================================
 
def log_sum_exp_naive(x):
    """UNSTABLE: Will overflow for large values"""
    return np.log(np.sum(np.exp(x)))
 
def log_sum_exp_stable(x):
    """STABLE: Subtracts maximum before exp"""
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))
 
# Demonstration
large_values = np.array([1000, 1001, 1002])
print(f"Naive log-sum-exp: {log_sum_exp_naive(large_values)}")  # inf!
print(f"Stable log-sum-exp: {log_sum_exp_stable(large_values)}")  # Correct
 
# ========================================
# Safe Log Probability
# ========================================
 
def safe_log(x, eps=1e-10):
    """Prevent log(0) by clipping to minimum value"""
    return np.log(np.clip(x, eps, 1.0))
 
def binary_cross_entropy_naive(y_true, y_pred):
    """UNSTABLE: Can produce -inf when y_pred = 0 or 1"""
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
 
def binary_cross_entropy_stable(y_true, y_pred, eps=1e-15):
    """STABLE: Clips predictions before log"""
    y_pred_clipped = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(
        y_true * np.log(y_pred_clipped) + 
        (1 - y_true) * np.log(1 - y_pred_clipped)
    )
 
# Demonstration with edge case
y_true = np.array([1, 0, 1])
y_pred = np.array([1.0, 0.0, 0.99])  # Edge cases: exact 0 and 1
 
print(f"\nNaive BCE: {binary_cross_entropy_naive(y_true, y_pred)}")  # -inf or nan
print(f"Stable BCE: {binary_cross_entropy_stable(y_true, y_pred):.6f}")  # Finite
 
# ========================================
# Softmax with Temperature
# ========================================
 
def softmax_naive(logits):
    """UNSTABLE: Overflows for large logits"""
    exp_logits = np.exp(logits)
    return exp_logits / np.sum(exp_logits)
 
def softmax_stable(logits, temperature=1.0):
    """STABLE: Subtracts max and supports temperature scaling"""
    scaled_logits = logits / temperature
    shifted = scaled_logits - np.max(scaled_logits)
    exp_shifted = np.exp(shifted)
    return exp_shifted / np.sum(exp_shifted)
 
logits = np.array([1000, 1001, 1002])
print(f"\nNaive softmax: {softmax_naive(logits)}")  # [nan, nan, nan]
print(f"Stable softmax: {softmax_stable(logits)}")  # Correct probabilities

Summary: Maximum Likelihood Estimation

We've covered the foundational theory and practice of Maximum Likelihood Estimation. Let's consolidate the key insights:

Key Takeaways

•MLE maximizes the probability of observed data — Choose parameters that make your data most likely under the model.
•Likelihood ≠ Probability — Same formula, different interpretation. Likelihood assesses parameters given fixed data; probability assesses data given fixed parameters.
•Log-transformation is essential — Products become sums; numerical stability is preserved; convexity is often maintained.
•Classic estimators are MLEs — Sample mean for Gaussian μ, sample proportion for Bernoulli p, reciprocal of sample mean for Exponential λ.
•ML algorithms are MLE in disguise — MSE loss = Gaussian MLE; Cross-entropy = Bernoulli/Categorical MLE.
•Asymptotic optimality — Consistency, efficiency, and normality make MLE theoretically optimal for large samples.
•Finite-sample caution — Bias, overfitting, and optimization challenges require practical attention.

Looking ahead:

MLE treats parameters as fixed unknowns and seeks point estimates. But what if we want to express uncertainty about parameters? What if we have prior beliefs we want to incorporate? The next page introduces Maximum A Posteriori (MAP) estimation, which bridges MLE with Bayesian thinking and provides the foundation for regularization in machine learning.

Page Complete

You've mastered the foundations of Maximum Likelihood Estimation—the most important parameter estimation method in machine learning. You can now derive MLEs for common distributions, understand why standard loss functions take their forms, implement numerically stable MLE computations, and recognize the theoretical guarantees and limitations of this approach.

Maximum Likelihood Estimation (MLE)

The Central Problem of Learning from Data

What You Will Learn

The Philosophy of Estimation

The fundamental assumption:

Consider these concrete scenarios:

Estimation Problems in Machine Learning
Scenario	Data Observed	Unknown Parameter(s)	True Distribution
Biased coin	Sequence of H/T	Probability p of heads	Bernoulli(p)
Sensor readings	Noisy measurements	True value μ, noise σ²	Gaussian(μ, σ²)
Customer arrivals	Time between customers	Rate parameter λ	Exponential(λ)
Spam classification	Email features x, labels y	Weights w	Logistic model
Image generation	Pixel values	Neural network weights θ	Generative model

Two competing philosophies:

There are two fundamental approaches to parameter estimation:

Frequentist approach: Parameters are fixed (but unknown) constants. We seek a point estimate—a single "best" value that somehow optimizes our knowledge given the data. MLE falls in this camp.
Bayesian approach: Parameters are random variables with their own probability distributions. We maintain full distributions over parameters, updating our beliefs as data arrives. Maximum A Posteriori (MAP) estimation bridges these views.

This page focuses on the frequentist MLE approach. We'll explore MAP in the next page, where the complementary perspectives become clear.

Why Start with Frequentist Methods?

The Likelihood Function: Foundation of MLE

The likelihood function is the mathematical object at the heart of MLE. Its definition is deceptively simple but has profound implications.

Definition:

Given observed data D = {x₁, x₂, ..., xₙ} and a parametric model with parameters θ, the likelihood function is:

$$\mathcal{L}(\theta | \mathbf{D}) = P(\mathbf{D} | \theta)$$

This reads as: "the likelihood of parameters θ given data D equals the probability of observing data D given parameters θ."

Critical distinction: Probability vs. Likelihood

This is where many learners stumble. The mathematical formula is the same, but the interpretation differs:

Probability P(D | θ)

•θ is fixed (known)
•D is the variable (unknown)
•Sums to 1 over all possible data
•Answers: 'Given this model, what data might we see?'
•Forward inference

Likelihood L(θ | D)

•D is fixed (observed)
•θ is the variable (unknown)
•Does NOT sum to 1 over θ
•Answers: 'Given this data, which model is plausible?'
•Inverse inference

The i.i.d. assumption:

When data points are independent and identically distributed (i.i.d.), the joint probability factorizes into a product:

$$\mathcal{L}(\theta | \mathbf{D}) = P(x_1, x_2, ..., x_n | \theta) = \prod_{i=1}^{n} P(x_i | \theta)$$

Example: Bernoulli likelihood

Consider n coin flips with unknown probability p of heads. If we observe k heads and (n-k) tails:

$$\mathcal{L}(p | k, n) = \binom{n}{k} p^k (1-p)^{n-k}$$

For p = 0.5, 0.6, 0.7 with n = 100, k = 67:

L(0.5) = C(100,67) × 0.5¹⁰⁰ ≈ 1.4 × 10⁻⁶
L(0.6) = C(100,67) × 0.6⁶⁷ × 0.4³³ ≈ 4.1 × 10⁻³
L(0.7) = C(100,67) × 0.7⁶⁷ × 0.3³³ ≈ 8.6 × 10⁻³

The likelihood at p = 0.7 is vastly higher than at p = 0.5—quantifying our intuition that p ≈ 0.67 is more plausible than p = 0.5.

Likelihood is NOT Probability

The Maximum Likelihood Principle

With the likelihood function defined, the MLE principle is elegantly simple:

MLE Principle:

Choose the parameter value θ that maximizes the likelihood of the observed data:

$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta | \mathbf{D}) = \arg\max_{\theta} \prod_{i=1}^{n} P(x_i | \theta)$$

The hat notation (θ̂) denotes an estimate—our best guess based on data, as opposed to the true (unknowable) parameter θ*.

Intuition:

The optimization challenge:

Finding the maximum of L(θ) requires calculus. We seek:

$$\frac{\partial \mathcal{L}(\theta)}{\partial \theta} = 0$$

However, differentiating products is messy. This motivates the log-likelihood transformation.

The Log-Likelihood Transformation:

The logarithm is a monotonically increasing function: if a > b, then log(a) > log(b). This means maximizing L(θ) is equivalent to maximizing log L(θ):

$$\hat{\theta}{MLE} = \arg\max{\theta} \mathcal{L}(\theta) = \arg\max_{\theta} \ell(\theta)$$

where ℓ(θ) = log L(θ) is the log-likelihood.

Why log-likelihood is computationally superior:

Advantages of Log-Likelihood

•Products → Sums: log(∏ᵢ P(xᵢ|θ)) = Σᵢ log P(xᵢ|θ). Sums are easier to differentiate than products.
•Numerical stability: Products of many small probabilities underflow to zero. Sums of log-probabilities remain stable.
•Additive structure: Adding new data points just adds more terms—no recomputation needed.
•Convexity preservation: For many important distributions (exponential family), the negative log-likelihood is convex, guaranteeing a unique global maximum.

numerical_stability_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
 
# Numerical instability with raw likelihood
def compute_likelihood_naive(data, theta):
    """Compute likelihood directly - NUMERICALLY UNSTABLE"""
    likelihood = 1.0
    for x in data:
        likelihood *= bernoulli_pmf(x, theta)  # Products of small numbers
    return likelihood  # Will underflow to 0 for large n!
 
# Numerical stability with log-likelihood
def compute_log_likelihood(data, theta):
    """Compute log-likelihood - NUMERICALLY STABLE"""
    log_likelihood = 0.0
    for x in data:
        log_likelihood += np.log(bernoulli_pmf(x, theta))
    return log_likelihood  # Never underflows!
 
def bernoulli_pmf(x, p):
    """Bernoulli probability mass function"""
    return p if x == 1 else (1 - p)
 
# Demonstration
np.random.seed(42)
n = 1000  # 1000 coin flips
true_p = 0.7
data = np.random.binomial(1, true_p, size=n)
 
# Naive likelihood computation - FAILS
naive_likelihood = compute_likelihood_naive(data, 0.7)
print(f"Naive likelihood: {naive_likelihood}")  # Will print 0.0 (underflow!)
 
# Log-likelihood computation - WORKS
log_likelihood = compute_log_likelihood(data, 0.7)
print(f"Log-likelihood: {log_likelihood:.4f}")  # Stable result
 
# We can recover likelihood via exp if needed (for small n)
# But typically we just work in log-space

MLE for the Bernoulli Distribution

Let's derive the MLE for the simplest non-trivial case: estimating the bias of a coin. This example illustrates the complete MLE workflow.

Problem setup:

Data: n independent coin flips D = {x₁, x₂, ..., xₙ} where xᵢ ∈ {0, 1}
Model: Bernoulli(p) with unknown p ∈ [0, 1]
Goal: Find p̂ₘₗₑ that maximizes the likelihood

Step 1: Write the likelihood function

$$\mathcal{L}(p | \mathbf{D}) = \prod_{i=1}^{n} P(x_i | p) = \prod_{i=1}^{n} p^{x_i}(1-p)^{1-x_i}$$

Simplifying (let k = Σxᵢ = number of heads):

$$\mathcal{L}(p) = p^k (1-p)^{n-k}$$

Step 2: Transform to log-likelihood

$$\ell(p) = \log \mathcal{L}(p) = k \log p + (n-k) \log(1-p)$$

Step 3: Differentiate and set to zero

$$\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p} = 0$$

Step 4: Solve for p

Multiplying through: $$k(1-p) = (n-k)p$$ $$k - kp = np - kp$$ $$k = np$$ $$\boxed{\hat{p}_{MLE} = \frac{k}{n}}$$

The MLE for a Bernoulli parameter is simply the sample proportion!

This matches intuition perfectly: if you flip a coin 100 times and see 67 heads, your best estimate for P(heads) is 67/100 = 0.67.

Step 5: Verify it's a maximum (not minimum)

We should check the second derivative:

$$\frac{d^2\ell}{dp^2} = -\frac{k}{p^2} - \frac{n-k}{(1-p)^2}$$

For p ∈ (0, 1) with k > 0 and k < n, this is always negative, confirming we have a maximum.

Edge cases:

If k = 0 (all tails): p̂ = 0
If k = n (all heads): p̂ = 1

These extreme estimates illustrate a weakness of MLE with limited data—there's no regularization pulling the estimate toward reasonable values. MAP estimation addresses this.

bernoulli_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize_scalar
 
def bernoulli_log_likelihood(p, data):
    """
    Compute log-likelihood for Bernoulli distribution.
    
    Parameters:
        p: probability parameter (0 < p < 1)
        data: array of 0s and 1s
    
    Returns:
        log-likelihood value
    """
    if p <= 0 or p >= 1:
        return -np.inf  # Log of 0 is -infinity
    
    k = np.sum(data)
    n = len(data)
    return k * np.log(p) + (n - k) * np.log(1 - p)
 
def bernoulli_mle_closed_form(data):
    """Closed-form MLE for Bernoulli parameter."""
    return np.mean(data)
 
def bernoulli_mle_numerical(data):
    """Numerical optimization for MLE (for verification)."""
    result = minimize_scalar(
        lambda p: -bernoulli_log_likelihood(p, data),
        bounds=(1e-10, 1 - 1e-10),
        method='bounded'
    )
    return result.x
 
# Generate synthetic data
np.random.seed(42)
true_p = 0.7
n = 100
data = np.random.binomial(1, true_p, size=n)
 
# Compute MLE both ways
mle_closed = bernoulli_mle_closed_form(data)
mle_numerical = bernoulli_mle_numerical(data)
 
print(f"True parameter: {true_p}")
print(f"Observed proportion: {np.mean(data):.4f}")
print(f"MLE (closed-form): {mle_closed:.4f}")
print(f"MLE (numerical): {mle_numerical:.4f}")
 
# Visualize the log-likelihood function
p_range = np.linspace(0.01, 0.99, 1000)
log_likelihoods = [bernoulli_log_likelihood(p, data) for p in p_range]
 
plt.figure(figsize=(10, 6))
plt.plot(p_range, log_likelihoods, 'b-', linewidth=2)
plt.axvline(x=mle_closed, color='r', linestyle='--', 
            label=f'MLE = {mle_closed:.3f}')
plt.axvline(x=true_p, color='g', linestyle=':', 
            label=f'True p = {true_p}')
plt.xlabel('Parameter p', fontsize=12)
plt.ylabel('Log-Likelihood ℓ(p)', fontsize=12)
plt.title('Log-Likelihood Function for Bernoulli MLE', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

MLE for the Gaussian Distribution

The Gaussian (Normal) distribution is perhaps the most important in all of statistics and machine learning. Let's derive the MLE for both parameters: mean μ and variance σ².

Problem setup:

Data: n i.i.d. observations D = {x₁, x₂, ..., xₙ}
Model: N(μ, σ²) with unknown μ and σ²
Goal: Find (μ̂, σ̂²)ₘₗₑ

Step 1: Write the likelihood function

The Gaussian PDF is:

$$P(x | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

For n i.i.d. observations:

$$\mathcal{L}(\mu, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right)$$

$$= \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} \exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right)$$

Step 2: Transform to log-likelihood

$$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2$$

Step 3: Differentiate with respect to μ

$$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n(x_i - \mu) = 0$$

$$\sum_{i=1}^n x_i - n\mu = 0$$

$$\boxed{\hat{\mu}{MLE} = \frac{1}{n}\sum{i=1}^n x_i = \bar{x}}$$

The MLE for the Gaussian mean is the sample mean!

Step 4: Differentiate with respect to σ²

Let τ = σ² for clarity:

$$\frac{\partial \ell}{\partial \tau} = -\frac{n}{2\tau} + \frac{1}{2\tau^2}\sum_{i=1}^n(x_i-\mu)^2 = 0$$

Substituting μ̂ = x̄:

$$\frac{n}{2\tau} = \frac{1}{2\tau^2}\sum_{i=1}^n(x_i-\bar{x})^2$$

$$\boxed{\hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2}$$

The MLE for variance is the sample variance with n in the denominator.

Important note on bias:

The MLE estimator σ̂²ₘₗₑ is biased! Its expected value is:

$$E[\hat{\sigma}^2_{MLE}] = \frac{n-1}{n}\sigma^2 \neq \sigma^2$$

This is why the unbiased sample variance uses (n-1) in the denominator (Bessel's correction). We'll explore bias in depth on Page 3.

gaussian_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
from scipy import stats
 
def gaussian_mle(data):
    """
    Compute MLE for Gaussian distribution parameters.
    
    Parameters:
        data: array of observations
    
    Returns:
        mu_mle: MLE for mean
        sigma2_mle: MLE for variance (biased)
        sigma2_unbiased: unbiased variance estimator
    """
    n = len(data)
    mu_mle = np.mean(data)
    sigma2_mle = np.mean((data - mu_mle) ** 2)  # Biased (divides by n)
    sigma2_unbiased = np.var(data, ddof=1)      # Unbiased (divides by n-1)
    
    return mu_mle, sigma2_mle, sigma2_unbiased
 
# Verify MLE bias through simulation
np.random.seed(42)
true_mu = 5.0
true_sigma2 = 4.0  # σ = 2
n_samples = 20
n_simulations = 10000
 
mle_variances = []
unbiased_variances = []
 
for _ in range(n_simulations):
    data = np.random.normal(true_mu, np.sqrt(true_sigma2), size=n_samples)
    _, sigma2_mle, sigma2_unbiased = gaussian_mle(data)
    mle_variances.append(sigma2_mle)
    unbiased_variances.append(sigma2_unbiased)
 
print(f"True variance: {true_sigma2}")
print(f"Expected MLE variance (biased): {(n_samples-1)/n_samples * true_sigma2:.4f}")
print(f"Mean of MLE estimates: {np.mean(mle_variances):.4f}")
print(f"Mean of unbiased estimates: {np.mean(unbiased_variances):.4f}")
 
# The MLE systematically underestimates variance
# But bias decreases with sample size: (n-1)/n → 1 as n → ∞
 
# Practical example: Fitting a Gaussian to real data
sample_data = np.array([2.3, 3.1, 2.8, 4.2, 3.5, 3.0, 2.9, 3.8, 3.2, 3.4])
mu_hat, sigma2_hat_mle, sigma2_hat_unbiased = gaussian_mle(sample_data)
 
print(f"\nFitted Gaussian parameters:")
print(f"  μ̂ = {mu_hat:.4f}")
print(f"  σ̂² (MLE, biased) = {sigma2_hat_mle:.4f}")
print(f"  σ̂² (unbiased) = {sigma2_hat_unbiased:.4f}")
 
# Compare with scipy's implementation
mu_scipy, sigma_scipy = stats.norm.fit(sample_data)
print(f"  scipy's fit: μ = {mu_scipy:.4f}, σ = {sigma_scipy:.4f}")
print(f"  scipy's σ²: {sigma_scipy**2:.4f}")  # scipy uses MLE (biased)

Connection to Linear Regression

MLE for the Exponential Distribution

The Exponential distribution models waiting times and is fundamental in reliability analysis, queueing theory, and survival analysis.

Problem setup:

Data: n i.i.d. waiting times D = {x₁, x₂, ..., xₙ} where xᵢ > 0
Model: Exponential(λ) with rate parameter λ > 0
Goal: Find λ̂ₘₗₑ

The Exponential PDF:

$$P(x | \lambda) = \lambda e^{-\lambda x}, \quad x > 0$$

The mean of this distribution is 1/λ.

Step 1: Write the log-likelihood

$$\ell(\lambda) = \sum_{i=1}^n \log(\lambda e^{-\lambda x_i}) = n\log\lambda - \lambda\sum_{i=1}^n x_i$$

Step 2: Differentiate and solve

$$\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum_{i=1}^n x_i = 0$$

$$\boxed{\hat{\lambda}{MLE} = \frac{n}{\sum{i=1}^n x_i} = \frac{1}{\bar{x}}}$$

The MLE for the rate is the reciprocal of the sample mean!

Intuition: If you observe average waiting times of 5 minutes, your best estimate for the rate is 1/5 = 0.2 events per minute.

Verify it's a maximum:

$$\frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} < 0 \text{ for all } \lambda > 0 \checkmark$$

exponential_mle.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def exponential_mle(data):
    """
    Compute MLE for Exponential distribution rate parameter.
    
    Parameters:
        data: array of positive observations (waiting times)
    
    Returns:
        lambda_mle: MLE for rate parameter
    """
    return 1.0 / np.mean(data)
 
# Real-world example: Customer service waiting times
np.random.seed(42)
true_lambda = 0.5  # True rate: 0.5 customers per minute (mean wait = 2 min)
n = 50
 
# Simulate waiting times
waiting_times = np.random.exponential(scale=1/true_lambda, size=n)
 
# Compute MLE
lambda_hat = exponential_mle(waiting_times)
mean_wait = np.mean(waiting_times)
 
print(f"True rate parameter: λ = {true_lambda}")
print(f"True mean waiting time: {1/true_lambda} minutes")
print(f"Observed mean waiting time: {mean_wait:.3f} minutes")
print(f"MLE rate estimate: λ̂ = {lambda_hat:.4f}")
 
# Visualize the fit
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Left: Histogram with fitted distribution
x = np.linspace(0, max(waiting_times) * 1.1, 200)
axes[0].hist(waiting_times, bins=15, density=True, alpha=0.7, 
             label='Observed data', color='steelblue')
axes[0].plot(x, stats.expon.pdf(x, scale=1/lambda_hat), 'r-', lw=2,
             label=f'MLE fit (λ̂={lambda_hat:.3f})')
axes[0].plot(x, stats.expon.pdf(x, scale=1/true_lambda), 'g--', lw=2,
             label=f'True distribution (λ={true_lambda})')
axes[0].set_xlabel('Waiting Time (minutes)')
axes[0].set_ylabel('Density')
axes[0].set_title('Exponential Distribution: Data vs MLE Fit')
axes[0].legend()
 
# Right: Log-likelihood function
lambda_range = np.linspace(0.1, 1.5, 200)
log_likelihoods = [len(waiting_times) * np.log(lam) - lam * np.sum(waiting_times) 
                   for lam in lambda_range]
 
axes[1].plot(lambda_range, log_likelihoods, 'b-', lw=2)
axes[1].axvline(x=lambda_hat, color='r', linestyle='--', 
                label=f'MLE = {lambda_hat:.3f}')
axes[1].axvline(x=true_lambda, color='g', linestyle=':', 
                label=f'True λ = {true_lambda}')
axes[1].set_xlabel('Rate Parameter λ')
axes[1].set_ylabel('Log-Likelihood ℓ(λ)')
axes[1].set_title('Log-Likelihood Function')
axes[1].legend()
 
plt.tight_layout()
plt.show()

MLE as the Foundation of Machine Learning

Linear Regression as MLE:

Consider the model y = w^T x + ε where ε ~ N(0, σ²). The likelihood of observing output y given input x and weights w is:

$$P(y | \mathbf{x}, \mathbf{w}) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - \mathbf{w}^T\mathbf{x})^2}{2\sigma^2}\right)$$

The negative log-likelihood (NLL) for n observations:

$$-\ell(\mathbf{w}) = \frac{n}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\sum_{i=1}^n(y_i - \mathbf{w}^T\mathbf{x}_i)^2$$

Minimizing NLL is equivalent to minimizing Mean Squared Error! The OLS solution IS the MLE.

Logistic Regression as MLE:

For binary classification, we model P(y=1|x,w) = σ(w^T x) where σ is the sigmoid function.

The likelihood for one observation: $$P(y | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x})^y (1 - \sigma(\mathbf{w}^T\mathbf{x}))^{1-y}$$

Neural Networks as MLE:

Deep learning with cross-entropy loss is MLE under a multinomial (softmax) output distribution. The network learns to maximize the likelihood of correct class probabilities.

ML Algorithms as Maximum Likelihood Estimation
Algorithm	Probabilistic Model	MLE Objective	Standard Loss Name
Linear Regression	y = w^Tx + ε, ε~N(0,σ²)	Minimize NLL	Mean Squared Error
Logistic Regression	P(y\|x) = σ(w^Tx)	Minimize NLL	Binary Cross-Entropy
Softmax Classifier	P(y\|x) = softmax(Wx)	Minimize NLL	Categorical Cross-Entropy
Poisson Regression	P(y\|x) = Poisson(exp(w^Tx))	Minimize NLL	Poisson Deviance
Gaussian Mixture Models	P(x) = Σₖ πₖ N(μₖ,Σₖ)	Maximize Likelihood (EM)	Log-Likelihood
Neural Language Models	P(wₜ\|w₁...wₜ₋₁)	Minimize NLL	Cross-Entropy / Perplexity

The Unifying Framework

Theoretical Properties of MLE

Key Asymptotic Properties (as n → ∞):

Theoretical Guarantees of MLE

•Consistency: θ̂ₘₗₑ → θ* as n → ∞. With enough data, the MLE converges to the true parameter. Your estimate gets arbitrarily close to the truth.
•Asymptotic Normality: √n(θ̂ₘₗₑ - θ*) →ᵈ N(0, I(θ*)⁻¹) where I(θ) is the Fisher Information. The distribution of the estimator becomes Gaussian, enabling confidence intervals.
•Asymptotic Efficiency: Among consistent estimators, MLE achieves the lowest possible variance (Cramér-Rao Lower Bound). No other estimator can do better asymptotically.
•Invariance: If θ̂ is the MLE of θ, then g(θ̂) is the MLE of g(θ) for any function g. Transformations preserve optimality.

The Fisher Information:

The Fisher Information measures how much the likelihood function curves around the true parameter—essentially, how much information the data provides about θ:

$$I(\theta) = -E\left[\frac{\partial^2 \log P(X|\theta)}{\partial \theta^2}\right] = E\left[\left(\frac{\partial \log P(X|\theta)}{\partial \theta}\right)^2\right]$$

High Fisher Information → Sharply peaked likelihood → Precise estimates possible Low Fisher Information → Flat likelihood → Estimation is difficult

The Cramér-Rao Lower Bound:

For any unbiased estimator θ̂:

$$\text{Var}(\hat{\theta}) \geq \frac{1}{nI(\theta)}$$

MLE asymptotically achieves this bound—it extracts the maximum possible information from data.

Finite Sample Caveats

Practical Implementation Considerations

Moving from theory to practice requires attention to several computational and numerical details:

1. Numerical Optimization:

For complex models without closed-form solutions, we use iterative optimization:

Gradient Descent: θ ← θ - α∇ℓ(θ)
Newton's Method: θ ← θ - H⁻¹∇ℓ(θ) where H is the Hessian
Quasi-Newton (L-BFGS): Approximates Hessian efficiently
Stochastic Gradient Descent: Essential for large datasets

2. Numerical Stability:

Common Numerical Issues

•Log-sum-exp trick: When computing log(Σᵢexp(aᵢ)), subtract max(a) first to prevent overflow
•Probability clipping: Clip probabilities to [ε, 1-ε] before taking logs to avoid log(0) = -∞
•Gradient clipping: Prevent exploding gradients in deep models
•Mixed precision: Use float64 for accumulators even with float32 parameters

numerical_tricks.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
 
# ========================================
# Log-Sum-Exp Trick for Numerical Stability
# ========================================
 
def log_sum_exp_naive(x):
    """UNSTABLE: Will overflow for large values"""
    return np.log(np.sum(np.exp(x)))
 
def log_sum_exp_stable(x):
    """STABLE: Subtracts maximum before exp"""
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))
 
# Demonstration
large_values = np.array([1000, 1001, 1002])
print(f"Naive log-sum-exp: {log_sum_exp_naive(large_values)}")  # inf!
print(f"Stable log-sum-exp: {log_sum_exp_stable(large_values)}")  # Correct
 
# ========================================
# Safe Log Probability
# ========================================
 
def safe_log(x, eps=1e-10):
    """Prevent log(0) by clipping to minimum value"""
    return np.log(np.clip(x, eps, 1.0))
 
def binary_cross_entropy_naive(y_true, y_pred):
    """UNSTABLE: Can produce -inf when y_pred = 0 or 1"""
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
 
def binary_cross_entropy_stable(y_true, y_pred, eps=1e-15):
    """STABLE: Clips predictions before log"""
    y_pred_clipped = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(
        y_true * np.log(y_pred_clipped) + 
        (1 - y_true) * np.log(1 - y_pred_clipped)
    )
 
# Demonstration with edge case
y_true = np.array([1, 0, 1])
y_pred = np.array([1.0, 0.0, 0.99])  # Edge cases: exact 0 and 1
 
print(f"\nNaive BCE: {binary_cross_entropy_naive(y_true, y_pred)}")  # -inf or nan
print(f"Stable BCE: {binary_cross_entropy_stable(y_true, y_pred):.6f}")  # Finite
 
# ========================================
# Softmax with Temperature
# ========================================
 
def softmax_naive(logits):
    """UNSTABLE: Overflows for large logits"""
    exp_logits = np.exp(logits)
    return exp_logits / np.sum(exp_logits)
 
def softmax_stable(logits, temperature=1.0):
    """STABLE: Subtracts max and supports temperature scaling"""
    scaled_logits = logits / temperature
    shifted = scaled_logits - np.max(scaled_logits)
    exp_shifted = np.exp(shifted)
    return exp_shifted / np.sum(exp_shifted)
 
logits = np.array([1000, 1001, 1002])
print(f"\nNaive softmax: {softmax_naive(logits)}")  # [nan, nan, nan]
print(f"Stable softmax: {softmax_stable(logits)}")  # Correct probabilities

Summary: Maximum Likelihood Estimation

We've covered the foundational theory and practice of Maximum Likelihood Estimation. Let's consolidate the key insights:

Key Takeaways

•MLE maximizes the probability of observed data — Choose parameters that make your data most likely under the model.
•Likelihood ≠ Probability — Same formula, different interpretation. Likelihood assesses parameters given fixed data; probability assesses data given fixed parameters.
•Log-transformation is essential — Products become sums; numerical stability is preserved; convexity is often maintained.
•Classic estimators are MLEs — Sample mean for Gaussian μ, sample proportion for Bernoulli p, reciprocal of sample mean for Exponential λ.
•ML algorithms are MLE in disguise — MSE loss = Gaussian MLE; Cross-entropy = Bernoulli/Categorical MLE.
•Asymptotic optimality — Consistency, efficiency, and normality make MLE theoretically optimal for large samples.
•Finite-sample caution — Bias, overfitting, and optimization challenges require practical attention.

Looking ahead:

Page Complete