Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

1 / 5

Likelihood Function

From Probability to Parameters

In the previous module, we explored the logistic regression model—how the sigmoid function transforms linear combinations of features into probabilities, how the log-odds provide a natural interpretation, and how decision boundaries emerge from the model structure.

But we left a crucial question unanswered: How do we find the optimal model parameters?

In linear regression, we derived elegant closed-form solutions: the ordinary least squares estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ minimizes squared error and equals the maximum likelihood estimator under Gaussian noise. The mathematics aligned beautifully.

Logistic regression presents a fundamentally different challenge. The response is binary—not continuous. The errors follow a Bernoulli distribution—not Gaussian. The optimization landscape is curved in ways that preclude simple matrix algebra solutions. Yet the principle remains the same: we seek parameters that make the observed data most probable.

This principle—Maximum Likelihood Estimation—forms the theoretical backbone of modern statistical learning. Understanding it deeply is essential not just for logistic regression, but for neural networks, probabilistic graphical models, and virtually every parametric model in machine learning.

What You Will Learn

By the end of this page, you will deeply understand the likelihood function for logistic regression—how it arises from the Bernoulli probability model, what it means geometrically and statistically, why maximizing likelihood produces good classifiers, and how the likelihood connects to concepts you already know from linear regression. This foundation is essential for understanding modern machine learning.

The Probabilistic Model

Let us formally establish the probabilistic foundations of logistic regression. Recall that we model the conditional probability of the positive class:

$$P(Y = 1 | \mathbf{x}; \boldsymbol{\theta}) = \sigma(\boldsymbol{\theta}^T \mathbf{x}) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T \mathbf{x}}}$$

where:

$Y \in {0, 1}$ is the binary response variable
$\mathbf{x} \in \mathbb{R}^{p+1}$ is the feature vector (including a 1 for the intercept)
$\boldsymbol{\theta} \in \mathbb{R}^{p+1}$ is the parameter vector we seek to estimate
$\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid (logistic) function

The Bernoulli Distribution Connection:

Given the features $\mathbf{x}$, the response $Y$ follows a Bernoulli distribution with success probability $\pi(\mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x})$:

$$Y | \mathbf{x} \sim \text{Bernoulli}(\pi(\mathbf{x}))$$

The probability mass function is:

$$P(Y = y | \mathbf{x}; \boldsymbol{\theta}) = \pi(\mathbf{x})^y (1 - \pi(\mathbf{x}))^{1-y}$$

This compact expression unifies both cases:

When $y = 1$: $P(Y = 1) = \pi(\mathbf{x})^1 (1-\pi(\mathbf{x}))^0 = \pi(\mathbf{x})$
When $y = 0$: $P(Y = 0) = \pi(\mathbf{x})^0 (1-\pi(\mathbf{x}))^1 = 1 - \pi(\mathbf{x})$

Notation Conventions

Different sources use different parameter notations: $\boldsymbol{\theta}$, $\boldsymbol{\beta}$, or $\mathbf{w}$. We use $\boldsymbol{\theta}$ to emphasize that these are unknown parameters to be estimated from data. The notation $P(Y = y | \mathbf{x}; \boldsymbol{\theta})$ indicates $\boldsymbol{\theta}$ is a fixed (unknown) parameter, not a random variable being conditioned on.

Key Properties of the Bernoulli Model:

Bernoulli Distribution Properties

•Mean: $\mathbb{E}[Y | \mathbf{x}] = \pi(\mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x})$ — the expected value equals the probability
•Variance: $\text{Var}(Y | \mathbf{x}) = \pi(\mathbf{x})(1 - \pi(\mathbf{x}))$ — maximum variance at $\pi = 0.5$, zero variance at boundaries
•Heteroscedasticity: Variance depends on $\mathbf{x}$ through $\pi(\mathbf{x})$—this is intrinsic to binary data
•Support: $Y \in {0, 1}$—the response is discrete, not continuous
•Exponential Family: Bernoulli belongs to the exponential family, connecting to GLM theory

What Is Likelihood?

Before constructing the likelihood function for logistic regression, let us deeply understand what "likelihood" means in statistics.

Likelihood vs. Probability — The Critical Distinction:

Both likelihood and probability involve the same mathematical formula, but they answer fundamentally different questions:

$$\mathcal{L}(\boldsymbol{\theta}; \text{data}) = P(\text{data}; \boldsymbol{\theta})$$

Concept	Fixed	Variable	Question Answered
Probability	Parameters $\boldsymbol{\theta}$	Data	"Given these parameters, how probable is this data?"
Likelihood	Data	Parameters $\boldsymbol{\theta}$	"Given this data, how plausible are these parameters?"

Intuitive Example:

Suppose you flip a coin 10 times and observe 7 heads.

Probability question: "If the coin is fair ($\theta = 0.5$), what's the probability of exactly 7 heads?"
Likelihood question: "Given we observed 7 heads, how plausible is it that $\theta = 0.5$ vs. $\theta = 0.7$ vs. $\theta = 0.9$?"

The probability calculation is: $$P(7 \text{ heads} | \theta) = \binom{10}{7} \theta^7 (1-\theta)^3$$

The same formula, viewed as a function of $\theta$ with data fixed, gives the likelihood: $$\mathcal{L}(\theta; 7 \text{ heads}) = \binom{10}{7} \theta^7 (1-\theta)^3$$

Understanding Likelihood
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb
 
def likelihood_coin(theta, n_heads, n_total):
    """
    Likelihood function for a sequence of coin flips.
    
    L(θ; data) = C(n, k) * θ^k * (1-θ)^(n-k)
    
    where k = n_heads, n = n_total
    """
    n_tails = n_total - n_heads
    # Binomial coefficient can be ignored for MLE (doesn't depend on θ)
    return comb(n_total, n_heads) * (theta ** n_heads) * ((1 - theta) ** n_tails)
 
# Observed data: 7 heads out of 10 flips
n_heads = 7
n_total = 10
 
# Compute likelihood for various θ values
theta_values = np.linspace(0.01, 0.99, 1000)
likelihoods = [likelihood_coin(theta, n_heads, n_total) for theta in theta_values]
 
# Find maximum likelihood estimate
mle_theta = theta_values[np.argmax(likelihoods)]
print(f"Maximum Likelihood Estimate: θ̂ = {mle_theta:.3f}")
print(f"Observed proportion: {n_heads/n_total:.3f}")
 
# Note: MLE = sample proportion (for Bernoulli/Binomial)
# This is not a coincidence—it's a fundamental result!

The Maximum Likelihood Principle:

Given observed data, we estimate parameters by choosing values that maximize the likelihood—the values under which the observed data would be most probable:

$$\hat{\boldsymbol{\theta}}_{\text{MLE}} = \underset{\boldsymbol{\theta}}{\arg\max} ; \mathcal{L}(\boldsymbol{\theta}; \text{data})$$

This is intuitive: we prefer parameter values that render our observations unsurprising. If parameters $\boldsymbol{\theta}_A$ would make the data extremely unlikely, while $\boldsymbol{\theta}_B$ would make it probable, we favor $\boldsymbol{\theta}_B$.

Why Maximum Likelihood?

MLE Properties

•Consistency: As sample size $n \to \infty$, $\hat{\boldsymbol{\theta}}{\text{MLE}} \xrightarrow{p} \boldsymbol{\theta}{\text{true}}$ (converges to the true parameters)
•Asymptotic Efficiency: Among consistent estimators, MLE achieves the smallest possible variance (Cramér-Rao lower bound)
•Asymptotic Normality: $\sqrt{n}(\hat{\boldsymbol{\theta}}{\text{MLE}} - \boldsymbol{\theta}{\text{true}}) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\boldsymbol{\theta})^{-1})$ where $\mathcal{I}$ is Fisher information
•Invariance: If $\hat{\theta}$ is MLE of $\theta$, then $g(\hat{\theta})$ is MLE of $g(\theta)$ for any function $g$
•Sufficiency: MLE uses all the information in sufficient statistics (if they exist)

Important Caveat

MLE properties are asymptotic—they hold as sample size grows. For small samples, MLE can be biased and may overfit. This is why regularization (which modifies the likelihood) becomes important in practice. We'll explore regularized logistic regression in a later module.

Constructing the Likelihood for Binary Classification

Now we construct the likelihood function for logistic regression. Suppose we have $n$ independent observations:

$${(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$$

where each $\mathbf{x}_i \in \mathbb{R}^{p+1}$ (including intercept) and each $y_i \in {0, 1}$.

Independence Assumption:

We assume observations are conditionally independent given the features. This means the probability of observing the entire dataset is the product of individual observation probabilities:

$$P(y_1, y_2, \ldots, y_n | \mathbf{x}_1, \ldots, \mathbf{x}n; \boldsymbol{\theta}) = \prod{i=1}^{n} P(y_i | \mathbf{x}_i; \boldsymbol{\theta})$$

The Likelihood Function:

Substituting the Bernoulli probability mass function:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

where we use shorthand $\pi_i = \pi(\mathbf{x}_i) = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$.

Expanding this expression:

For a concrete dataset, consider what this product looks like:

Observation	$y_i$	Contribution to Likelihood
$i=1$	1	$\pi_1$
$i=2$	0	$1 - \pi_2$
$i=3$	1	$\pi_3$
$i=4$	1	$\pi_4$
$i=5$	0	$1 - \pi_5$
...	...	...

The total likelihood is: $$\mathcal{L}(\boldsymbol{\theta}) = \pi_1 \cdot (1-\pi_2) \cdot \pi_3 \cdot \pi_4 \cdot (1-\pi_5) \cdot \ldots$$

Interpretation:

The likelihood measures how well our model "explains" the observed labels:

For each positive example ($y_i = 1$), we want $\pi_i$ to be high
For each negative example ($y_i = 0$), we want $1 - \pi_i$ to be high (equivalently, $\pi_i$ to be low)

Good parameters $\boldsymbol{\theta}$ assign high probability to positive examples and low probability to negative examples—exactly what a good classifier should do.

Numerical Consideration

The likelihood is a product of probabilities, each between 0 and 1. For large datasets, this product becomes astronomically small—so small that it underflows floating-point representation. A dataset with 1000 observations might have likelihood $\approx 10^{-400}$, far below machine precision. This is why we work with the log-likelihood instead, which we'll develop in the next page.

The Likelihood as a Function of Parameters:

It's crucial to view the likelihood as a function of $\boldsymbol{\theta}$. Substituting $\pi_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \left[\sigma(\boldsymbol{\theta}^T \mathbf{x}_i)\right]^{y_i} \left[1 - \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)\right]^{1-y_i}$$

This is a nonlinear function of $\boldsymbol{\theta}$ due to the sigmoid. The product structure and the nonlinearity are what make logistic regression fundamentally different from linear regression.

Using the Sigmoid Property:

Recall that the sigmoid has a beautiful symmetry: $1 - \sigma(z) = \sigma(-z)$. This allows us to write:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)^{y_i} \cdot \sigma(-\boldsymbol{\theta}^T \mathbf{x}_i)^{1-y_i}$$

This symmetric form hints at deeper structure we'll exploit in optimization.

Visualizing the Likelihood Surface

To build intuition, let's visualize the likelihood surface for a simple one-dimensional example (one feature plus intercept, so $\boldsymbol{\theta} = (\theta_0, \theta_1)^T$).

Visualization Setup:

Consider a small dataset where we vary the parameters and compute the likelihood at each point. The likelihood surface reveals the optimization landscape.

Likelihood Surface Visualization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))
 
def compute_likelihood(theta0, theta1, X, y):
    """
    Compute likelihood for logistic regression.
    
    L(θ) = ∏ π_i^y_i (1-π_i)^(1-y_i)
    
    Returns log of likelihood to avoid underflow.
    """
    z = theta0 + theta1 * X
    pi = sigmoid(z)
    
    # Clip to avoid log(0)
    pi = np.clip(pi, 1e-15, 1 - 1e-15)
    
    # Log-likelihood (sum of logs instead of product)
    log_lik = np.sum(y * np.log(pi) + (1 - y) * np.log(1 - pi))
    
    return log_lik
 
# Generate simple dataset
np.random.seed(42)
n = 50
 
# True parameters
theta0_true = -1.0
theta1_true = 2.0
 
# Generate features and labels
X = np.random.randn(n)
z = theta0_true + theta1_true * X
prob = sigmoid(z)
y = (np.random.rand(n) < prob).astype(float)
 
# Create grid of parameter values
theta0_range = np.linspace(-4, 2, 100)
theta1_range = np.linspace(-1, 5, 100)
T0, T1 = np.meshgrid(theta0_range, theta1_range)
 
# Compute log-likelihood at each point
log_lik_surface = np.zeros_like(T0)
for i in range(len(theta0_range)):
    for j in range(len(theta1_range)):
        log_lik_surface[j, i] = compute_likelihood(
            theta0_range[i], theta1_range[j], X, y
        )
 
# Find MLE (maximum of the surface)
max_idx = np.unravel_index(np.argmax(log_lik_surface), log_lik_surface.shape)
mle_theta0 = theta0_range[max_idx[1]]
mle_theta1 = theta1_range[max_idx[0]]
 
print(f"True parameters: θ₀ = {theta0_true}, θ₁ = {theta1_true}")
print(f"MLE estimates:   θ̂₀ = {mle_theta0:.2f}, θ̂₁ = {mle_theta1:.2f}")
print(f"Maximum log-likelihood: {log_lik_surface[max_idx]:.2f}")

What the Likelihood Surface Reveals:

Likelihood Surface Properties

•Single Global Maximum: The log-likelihood for logistic regression is concave (we'll prove this later), meaning there's exactly one maximum. No local optima traps!
•Smooth Curvature: The surface curves smoothly, enabling gradient-based optimization methods
•Parameter Correlation: The elliptical contours indicate correlation between parameter estimates—changing one affects the optimal value of the other
•Sensitivity: The sharpness of the peak indicates how precisely parameters are determined by the data—sharper peaks mean more certain estimates
•Asymptotic Behavior: As parameters push predictions toward 0 or 1 with certainty, the likelihood surface has specific boundary behavior

Concavity Is Key

The concavity of the log-likelihood is perhaps the most important property for optimization. It guarantees that gradient descent will find the global maximum, that Newton's method converges, and that the optimization landscape has no pathological features. This concavity emerges from the properties of the sigmoid and Bernoulli distribution—it's not an accident but a deep structural property.

The Role of Each Observation

Understanding how individual observations contribute to the likelihood provides deep insight into the learning process.

Decomposing the Likelihood:

Each observation $(\mathbf{x}_i, y_i)$ contributes a factor to the likelihood:

$$\mathcal{L}_i(\boldsymbol{\theta}) = \pi_i^{y_i} (1-\pi_i)^{1-y_i}$$

The total likelihood is the product: $\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^n \mathcal{L}_i(\boldsymbol{\theta})$

How Different Observations Affect the Likelihood:

Observation Types and Their Likelihood Contributions
Observation Type	Model Prediction $\pi_i$	Likelihood Contribution	Effect
$y_i = 1$, correctly classified	High (e.g., 0.95)	$0.95$	Large positive contribution
$y_i = 1$, misclassified	Low (e.g., 0.05)	$0.05$	Severely penalizes likelihood
$y_i = 0$, correctly classified	Low (e.g., 0.05)	$0.95$	Large positive contribution
$y_i = 0$, misclassified	High (e.g., 0.95)	$0.05$	Severely penalizes likelihood
Near decision boundary	~0.5	~0.5	Moderate contribution either way

The Multiplicative Effect:

Because likelihood uses multiplication, a single misclassified example can dramatically reduce the total likelihood. If one observation has $\mathcal{L}_i = 0.001$, the entire dataset's likelihood is reduced by a factor of 1000.

Consider two parameter settings:

Setting A: 99 observations with $\mathcal{L}_i = 0.9$, 1 observation with $\mathcal{L}_i = 0.01$
Setting B: All 100 observations with $\mathcal{L}_i = 0.7$

Setting A: $\mathcal{L} = 0.9^{99} \times 0.01 \approx 2.5 \times 10^{-7}$ Setting B: $\mathcal{L} = 0.7^{100} \approx 3.2 \times 10^{-16}$

Wait—Setting B is worse! But the numbers are hard to parse. This is why we use log-likelihood:

Setting A: $\ell = 99 \cdot \log(0.9) + \log(0.01) \approx -10.43 - 4.61 = -15.04$
Setting B: $\ell = 100 \cdot \log(0.7) \approx -35.67$

Setting A is actually better despite the poor fit on one observation.

Outliers and Robustness

The multiplicative nature of likelihood makes it sensitive to extreme observations. A single mislabeled example or outlier can have outsized influence. This sensitivity motivates robust estimation techniques and careful data preprocessing. In later modules, we'll see how regularization helps address this.

Gradient Intuition:

When optimizing, the gradient of the likelihood (or log-likelihood) tells us how to adjust parameters. Observations that are confidently correct provide less gradient signal—they're already well-handled. Observations that are misclassified or uncertain provide strong gradient signal—they're where the model needs to improve.

This is fundamentally different from squared error, where all observations contribute proportionally to their error magnitude. The logistic likelihood naturally focuses optimization effort on the observations that matter most for classification.

Connection to Information Theory

The likelihood function has deep connections to information theory, revealing why Maximum Likelihood Estimation produces good classifiers.

Likelihood and Coding Theory:

From an information-theoretic perspective, the likelihood measures how "surprised" we would be to see the data under a given model. High likelihood means low surprise—the model "expected" this data.

The negative log-likelihood is directly related to the codelength required to transmit the data using a code based on the model's probability distribution:

$$\text{Codelength} = -\log_2 \mathcal{L}(\boldsymbol{\theta}) = -\sum_{i=1}^n \log_2 P(y_i | \mathbf{x}_i; \boldsymbol{\theta})$$

Minimizing negative log-likelihood (equivalently, maximizing likelihood) corresponds to finding the model that most efficiently "compresses" the observed labels.

Cross-Entropy Connection:

The cross-entropy between the true label distribution and model predictions is:

$$H(p, q) = -\sum_y p(y) \log q(y)$$

For our binary case with empirical distribution:

$$H(y, \pi) = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$

This is exactly the negative log-likelihood divided by $n$! The cross-entropy loss used in neural network training is directly derived from the likelihood principle.

Key Information-Theoretic Insights

•Maximizing likelihood = Minimizing cross-entropy — These are equivalent optimization objectives
•Cross-entropy ≥ Entropy — You can never do better than the optimal code for the true distribution (Gibbs' inequality)
•KL Divergence — Maximizing likelihood also minimizes the KL divergence $D_{KL}(p_{\text{true}} | p_\theta)$ between true and model distributions
•Bits per observation — Negative log-likelihood measures average "bits of surprise" per observation
•Model comparison — Likelihood ratios naturally compare models in terms of relative evidence

Why Cross-Entropy Loss?

When training neural networks for classification, we use cross-entropy loss. Now you understand why: cross-entropy loss is the negative log-likelihood of the predictions under a Bernoulli (or categorical) model. It's not an arbitrary choice—it's the principled choice arising from maximum likelihood estimation.

Likelihood in the Exponential Family

Logistic regression fits within the broader framework of Generalized Linear Models (GLMs), which are based on the exponential family of distributions. This perspective illuminates deep structural properties.

Exponential Family Form:

A distribution belongs to the exponential family if its probability density/mass function can be written as:

$$p(y; \eta) = h(y) \exp\left( \eta \cdot T(y) - A(\eta) \right)$$

where:

$\eta$ is the natural (canonical) parameter
$T(y)$ is the sufficient statistic
$A(\eta)$ is the log-partition function (normalizing constant)
$h(y)$ is the base measure

Bernoulli as Exponential Family:

For Bernoulli$(\pi)$:

$$P(Y = y) = \pi^y (1-\pi)^{1-y}$$

Taking the logarithm and rearranging:

$$\log P(Y = y) = y \log\pi + (1-y)\log(1-\pi) = y\log\frac{\pi}{1-\pi} + \log(1-\pi)$$

Therefore:

Natural parameter: $\eta = \log\frac{\pi}{1-\pi}$ (the log-odds!)
Sufficient statistic: $T(y) = y$
Log-partition function: $A(\eta) = \log(1 + e^\eta)$
Base measure: $h(y) = 1$

Why the Exponential Family Matters:

The exponential family structure has profound implications:

Exponential Family Properties

•Log-concavity: The log-likelihood is concave in $\eta$, guaranteeing unique global optima
•Sufficient statistics: All information about $\eta$ is captured by $\sum_i T(y_i) = \sum_i y_i$ (the total count of positive labels)
•Moment-parameter duality: $\mathbb{E}[T(Y)] = \nabla A(\eta) = \pi$ — the derivative of $A$ gives the mean
•Variance from second derivative: $\text{Var}(T(Y)) = \nabla^2 A(\eta) = \pi(1-\pi)$
•Conjugate priors: Exponential families have natural conjugate priors for Bayesian analysis
•GLM framework: All GLMs use exponential families, enabling unified theory and software

The Log-Odds Are Fundamental

The exponential family perspective reveals why log-odds are natural: they are the canonical parameter $\eta$ of the Bernoulli distribution. Logistic regression models $\eta = \boldsymbol{\theta}^T\mathbf{x}$ directly, which is why it's called the "canonical link" in GLM terminology. This choice ensures desirable properties like concave log-likelihood.

Summary and Preview

We have established the theoretical foundation for Maximum Likelihood Estimation in logistic regression. Let's consolidate the key concepts before moving to the log-likelihood derivation.

Key Takeaways

•Bernoulli Model: Given features $\mathbf{x}$, the binary response $Y$ follows Bernoulli$(\sigma(\boldsymbol{\theta}^T\mathbf{x}))$, where the probability is determined by the logistic function
•Likelihood Function: The probability of observing all data is $\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^n \pi_i^{y_i}(1-\pi_i)^{1-y_i}$, a product over all observations
•Likelihood vs. Probability: Probability asks "how likely is this data given parameters?" while likelihood asks "how plausible are these parameters given data?"—same formula, different interpretations
•Maximum Likelihood Principle: We seek parameters that maximize the likelihood—values under which the observed data would be most probable
•Multiplicative Structure: Each observation contributes a multiplicative factor; misclassified examples severely penalize likelihood
•Information Theory Connection: Negative log-likelihood equals cross-entropy loss; maximizing likelihood minimizes "surprise" about the data
•Exponential Family: Bernoulli belongs to the exponential family with log-odds as the natural parameter, ensuring concave log-likelihood

What's Next:

The likelihood as written involves a product of many terms—computationally problematic and difficult to differentiate. In the next page, we'll transform to log-likelihood, which converts products to sums, eliminates numerical underflow, and reveals the beautiful concave structure that makes optimization tractable. We'll derive the log-likelihood explicitly and prove its concavity—the property that guarantees global optimality of gradient-based solutions.

Page Complete

You now understand the likelihood function for logistic regression: how it arises from the Bernoulli probability model, what it means to maximize likelihood, how individual observations contribute, and how likelihood connects to information theory and the exponential family. Next, we develop the log-likelihood formulation essential for practical optimization.

1 / 5

Loading learning content...

Machine LearningLogistic Regression & GLMs

Maximum Likelihood Estimation for Logistic Regression

LevelIntermediate

Duration120 mins

TopicLogistic Regression & GLMs

1 / 5

Likelihood Function

From Probability to Parameters

But we left a crucial question unanswered: How do we find the optimal model parameters?

What You Will Learn

The Probabilistic Model

Let us formally establish the probabilistic foundations of logistic regression. Recall that we model the conditional probability of the positive class:

$$P(Y = 1 | \mathbf{x}; \boldsymbol{\theta}) = \sigma(\boldsymbol{\theta}^T \mathbf{x}) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T \mathbf{x}}}$$

where:

$Y \in {0, 1}$ is the binary response variable
$\mathbf{x} \in \mathbb{R}^{p+1}$ is the feature vector (including a 1 for the intercept)
$\boldsymbol{\theta} \in \mathbb{R}^{p+1}$ is the parameter vector we seek to estimate
$\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid (logistic) function

The Bernoulli Distribution Connection:

Given the features $\mathbf{x}$, the response $Y$ follows a Bernoulli distribution with success probability $\pi(\mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x})$:

$$Y | \mathbf{x} \sim \text{Bernoulli}(\pi(\mathbf{x}))$$

The probability mass function is:

$$P(Y = y | \mathbf{x}; \boldsymbol{\theta}) = \pi(\mathbf{x})^y (1 - \pi(\mathbf{x}))^{1-y}$$

This compact expression unifies both cases:

When $y = 1$: $P(Y = 1) = \pi(\mathbf{x})^1 (1-\pi(\mathbf{x}))^0 = \pi(\mathbf{x})$
When $y = 0$: $P(Y = 0) = \pi(\mathbf{x})^0 (1-\pi(\mathbf{x}))^1 = 1 - \pi(\mathbf{x})$

Notation Conventions

Key Properties of the Bernoulli Model:

Bernoulli Distribution Properties

•Mean: $\mathbb{E}[Y | \mathbf{x}] = \pi(\mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x})$ — the expected value equals the probability
•Variance: $\text{Var}(Y | \mathbf{x}) = \pi(\mathbf{x})(1 - \pi(\mathbf{x}))$ — maximum variance at $\pi = 0.5$, zero variance at boundaries
•Heteroscedasticity: Variance depends on $\mathbf{x}$ through $\pi(\mathbf{x})$—this is intrinsic to binary data
•Support: $Y \in {0, 1}$—the response is discrete, not continuous
•Exponential Family: Bernoulli belongs to the exponential family, connecting to GLM theory

What Is Likelihood?

Before constructing the likelihood function for logistic regression, let us deeply understand what "likelihood" means in statistics.

Likelihood vs. Probability — The Critical Distinction:

Both likelihood and probability involve the same mathematical formula, but they answer fundamentally different questions:

$$\mathcal{L}(\boldsymbol{\theta}; \text{data}) = P(\text{data}; \boldsymbol{\theta})$$

Concept	Fixed	Variable	Question Answered
Probability	Parameters $\boldsymbol{\theta}$	Data	"Given these parameters, how probable is this data?"
Likelihood	Data	Parameters $\boldsymbol{\theta}$	"Given this data, how plausible are these parameters?"

Intuitive Example:

Suppose you flip a coin 10 times and observe 7 heads.

Probability question: "If the coin is fair ($\theta = 0.5$), what's the probability of exactly 7 heads?"
Likelihood question: "Given we observed 7 heads, how plausible is it that $\theta = 0.5$ vs. $\theta = 0.7$ vs. $\theta = 0.9$?"

The probability calculation is: $$P(7 \text{ heads} | \theta) = \binom{10}{7} \theta^7 (1-\theta)^3$$

The same formula, viewed as a function of $\theta$ with data fixed, gives the likelihood: $$\mathcal{L}(\theta; 7 \text{ heads}) = \binom{10}{7} \theta^7 (1-\theta)^3$$

Understanding Likelihood
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb
 
def likelihood_coin(theta, n_heads, n_total):
    """
    Likelihood function for a sequence of coin flips.
    
    L(θ; data) = C(n, k) * θ^k * (1-θ)^(n-k)
    
    where k = n_heads, n = n_total
    """
    n_tails = n_total - n_heads
    # Binomial coefficient can be ignored for MLE (doesn't depend on θ)
    return comb(n_total, n_heads) * (theta ** n_heads) * ((1 - theta) ** n_tails)
 
# Observed data: 7 heads out of 10 flips
n_heads = 7
n_total = 10
 
# Compute likelihood for various θ values
theta_values = np.linspace(0.01, 0.99, 1000)
likelihoods = [likelihood_coin(theta, n_heads, n_total) for theta in theta_values]
 
# Find maximum likelihood estimate
mle_theta = theta_values[np.argmax(likelihoods)]
print(f"Maximum Likelihood Estimate: θ̂ = {mle_theta:.3f}")
print(f"Observed proportion: {n_heads/n_total:.3f}")
 
# Note: MLE = sample proportion (for Bernoulli/Binomial)
# This is not a coincidence—it's a fundamental result!

The Maximum Likelihood Principle:

Given observed data, we estimate parameters by choosing values that maximize the likelihood—the values under which the observed data would be most probable:

$$\hat{\boldsymbol{\theta}}_{\text{MLE}} = \underset{\boldsymbol{\theta}}{\arg\max} ; \mathcal{L}(\boldsymbol{\theta}; \text{data})$$

Why Maximum Likelihood?

MLE Properties

•Consistency: As sample size $n \to \infty$, $\hat{\boldsymbol{\theta}}{\text{MLE}} \xrightarrow{p} \boldsymbol{\theta}{\text{true}}$ (converges to the true parameters)
•Asymptotic Efficiency: Among consistent estimators, MLE achieves the smallest possible variance (Cramér-Rao lower bound)
•Asymptotic Normality: $\sqrt{n}(\hat{\boldsymbol{\theta}}{\text{MLE}} - \boldsymbol{\theta}{\text{true}}) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\boldsymbol{\theta})^{-1})$ where $\mathcal{I}$ is Fisher information
•Invariance: If $\hat{\theta}$ is MLE of $\theta$, then $g(\hat{\theta})$ is MLE of $g(\theta)$ for any function $g$
•Sufficiency: MLE uses all the information in sufficient statistics (if they exist)

Important Caveat

Constructing the Likelihood for Binary Classification

Now we construct the likelihood function for logistic regression. Suppose we have $n$ independent observations:

$${(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$$

where each $\mathbf{x}_i \in \mathbb{R}^{p+1}$ (including intercept) and each $y_i \in {0, 1}$.

Independence Assumption:

We assume observations are conditionally independent given the features. This means the probability of observing the entire dataset is the product of individual observation probabilities:

$$P(y_1, y_2, \ldots, y_n | \mathbf{x}_1, \ldots, \mathbf{x}n; \boldsymbol{\theta}) = \prod{i=1}^{n} P(y_i | \mathbf{x}_i; \boldsymbol{\theta})$$

The Likelihood Function:

Substituting the Bernoulli probability mass function:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$

where we use shorthand $\pi_i = \pi(\mathbf{x}_i) = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$.

Expanding this expression:

For a concrete dataset, consider what this product looks like:

Observation	$y_i$	Contribution to Likelihood
$i=1$	1	$\pi_1$
$i=2$	0	$1 - \pi_2$
$i=3$	1	$\pi_3$
$i=4$	1	$\pi_4$
$i=5$	0	$1 - \pi_5$
...	...	...

The total likelihood is: $$\mathcal{L}(\boldsymbol{\theta}) = \pi_1 \cdot (1-\pi_2) \cdot \pi_3 \cdot \pi_4 \cdot (1-\pi_5) \cdot \ldots$$

Interpretation:

The likelihood measures how well our model "explains" the observed labels:

For each positive example ($y_i = 1$), we want $\pi_i$ to be high
For each negative example ($y_i = 0$), we want $1 - \pi_i$ to be high (equivalently, $\pi_i$ to be low)

Good parameters $\boldsymbol{\theta}$ assign high probability to positive examples and low probability to negative examples—exactly what a good classifier should do.

Numerical Consideration

The Likelihood as a Function of Parameters:

It's crucial to view the likelihood as a function of $\boldsymbol{\theta}$. Substituting $\pi_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \left[\sigma(\boldsymbol{\theta}^T \mathbf{x}_i)\right]^{y_i} \left[1 - \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)\right]^{1-y_i}$$

This is a nonlinear function of $\boldsymbol{\theta}$ due to the sigmoid. The product structure and the nonlinearity are what make logistic regression fundamentally different from linear regression.

Using the Sigmoid Property:

Recall that the sigmoid has a beautiful symmetry: $1 - \sigma(z) = \sigma(-z)$. This allows us to write:

$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)^{y_i} \cdot \sigma(-\boldsymbol{\theta}^T \mathbf{x}_i)^{1-y_i}$$

This symmetric form hints at deeper structure we'll exploit in optimization.

Visualizing the Likelihood Surface

To build intuition, let's visualize the likelihood surface for a simple one-dimensional example (one feature plus intercept, so $\boldsymbol{\theta} = (\theta_0, \theta_1)^T$).

Visualization Setup:

Consider a small dataset where we vary the parameters and compute the likelihood at each point. The likelihood surface reveals the optimization landscape.

Likelihood Surface Visualization
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
 
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))
 
def compute_likelihood(theta0, theta1, X, y):
    """
    Compute likelihood for logistic regression.
    
    L(θ) = ∏ π_i^y_i (1-π_i)^(1-y_i)
    
    Returns log of likelihood to avoid underflow.
    """
    z = theta0 + theta1 * X
    pi = sigmoid(z)
    
    # Clip to avoid log(0)
    pi = np.clip(pi, 1e-15, 1 - 1e-15)
    
    # Log-likelihood (sum of logs instead of product)
    log_lik = np.sum(y * np.log(pi) + (1 - y) * np.log(1 - pi))
    
    return log_lik
 
# Generate simple dataset
np.random.seed(42)
n = 50
 
# True parameters
theta0_true = -1.0
theta1_true = 2.0
 
# Generate features and labels
X = np.random.randn(n)
z = theta0_true + theta1_true * X
prob = sigmoid(z)
y = (np.random.rand(n) < prob).astype(float)
 
# Create grid of parameter values
theta0_range = np.linspace(-4, 2, 100)
theta1_range = np.linspace(-1, 5, 100)
T0, T1 = np.meshgrid(theta0_range, theta1_range)
 
# Compute log-likelihood at each point
log_lik_surface = np.zeros_like(T0)
for i in range(len(theta0_range)):
    for j in range(len(theta1_range)):
        log_lik_surface[j, i] = compute_likelihood(
            theta0_range[i], theta1_range[j], X, y
        )
 
# Find MLE (maximum of the surface)
max_idx = np.unravel_index(np.argmax(log_lik_surface), log_lik_surface.shape)
mle_theta0 = theta0_range[max_idx[1]]
mle_theta1 = theta1_range[max_idx[0]]
 
print(f"True parameters: θ₀ = {theta0_true}, θ₁ = {theta1_true}")
print(f"MLE estimates:   θ̂₀ = {mle_theta0:.2f}, θ̂₁ = {mle_theta1:.2f}")
print(f"Maximum log-likelihood: {log_lik_surface[max_idx]:.2f}")

What the Likelihood Surface Reveals:

Likelihood Surface Properties

•Single Global Maximum: The log-likelihood for logistic regression is concave (we'll prove this later), meaning there's exactly one maximum. No local optima traps!
•Smooth Curvature: The surface curves smoothly, enabling gradient-based optimization methods
•Parameter Correlation: The elliptical contours indicate correlation between parameter estimates—changing one affects the optimal value of the other
•Sensitivity: The sharpness of the peak indicates how precisely parameters are determined by the data—sharper peaks mean more certain estimates
•Asymptotic Behavior: As parameters push predictions toward 0 or 1 with certainty, the likelihood surface has specific boundary behavior

Concavity Is Key

The Role of Each Observation

Understanding how individual observations contribute to the likelihood provides deep insight into the learning process.

Decomposing the Likelihood:

Each observation $(\mathbf{x}_i, y_i)$ contributes a factor to the likelihood:

$$\mathcal{L}_i(\boldsymbol{\theta}) = \pi_i^{y_i} (1-\pi_i)^{1-y_i}$$

The total likelihood is the product: $\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^n \mathcal{L}_i(\boldsymbol{\theta})$

How Different Observations Affect the Likelihood:

Observation Types and Their Likelihood Contributions
Observation Type	Model Prediction $\pi_i$	Likelihood Contribution	Effect
$y_i = 1$, correctly classified	High (e.g., 0.95)	$0.95$	Large positive contribution
$y_i = 1$, misclassified	Low (e.g., 0.05)	$0.05$	Severely penalizes likelihood
$y_i = 0$, correctly classified	Low (e.g., 0.05)	$0.95$	Large positive contribution
$y_i = 0$, misclassified	High (e.g., 0.95)	$0.05$	Severely penalizes likelihood
Near decision boundary	~0.5	~0.5	Moderate contribution either way

The Multiplicative Effect:

Consider two parameter settings:

Setting A: 99 observations with $\mathcal{L}_i = 0.9$, 1 observation with $\mathcal{L}_i = 0.01$
Setting B: All 100 observations with $\mathcal{L}_i = 0.7$

Setting A: $\mathcal{L} = 0.9^{99} \times 0.01 \approx 2.5 \times 10^{-7}$ Setting B: $\mathcal{L} = 0.7^{100} \approx 3.2 \times 10^{-16}$

Wait—Setting B is worse! But the numbers are hard to parse. This is why we use log-likelihood:

Setting A: $\ell = 99 \cdot \log(0.9) + \log(0.01) \approx -10.43 - 4.61 = -15.04$
Setting B: $\ell = 100 \cdot \log(0.7) \approx -35.67$

Setting A is actually better despite the poor fit on one observation.

Outliers and Robustness

Gradient Intuition:

Connection to Information Theory

The likelihood function has deep connections to information theory, revealing why Maximum Likelihood Estimation produces good classifiers.

Likelihood and Coding Theory:

The negative log-likelihood is directly related to the codelength required to transmit the data using a code based on the model's probability distribution:

$$\text{Codelength} = -\log_2 \mathcal{L}(\boldsymbol{\theta}) = -\sum_{i=1}^n \log_2 P(y_i | \mathbf{x}_i; \boldsymbol{\theta})$$

Minimizing negative log-likelihood (equivalently, maximizing likelihood) corresponds to finding the model that most efficiently "compresses" the observed labels.

Cross-Entropy Connection:

The cross-entropy between the true label distribution and model predictions is:

$$H(p, q) = -\sum_y p(y) \log q(y)$$

For our binary case with empirical distribution:

$$H(y, \pi) = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$

This is exactly the negative log-likelihood divided by $n$! The cross-entropy loss used in neural network training is directly derived from the likelihood principle.

Key Information-Theoretic Insights

•Maximizing likelihood = Minimizing cross-entropy — These are equivalent optimization objectives
•Cross-entropy ≥ Entropy — You can never do better than the optimal code for the true distribution (Gibbs' inequality)
•KL Divergence — Maximizing likelihood also minimizes the KL divergence $D_{KL}(p_{\text{true}} | p_\theta)$ between true and model distributions
•Bits per observation — Negative log-likelihood measures average "bits of surprise" per observation
•Model comparison — Likelihood ratios naturally compare models in terms of relative evidence

Why Cross-Entropy Loss?

Likelihood in the Exponential Family

Exponential Family Form:

A distribution belongs to the exponential family if its probability density/mass function can be written as:

$$p(y; \eta) = h(y) \exp\left( \eta \cdot T(y) - A(\eta) \right)$$

where:

$\eta$ is the natural (canonical) parameter
$T(y)$ is the sufficient statistic
$A(\eta)$ is the log-partition function (normalizing constant)
$h(y)$ is the base measure

Bernoulli as Exponential Family:

For Bernoulli$(\pi)$:

$$P(Y = y) = \pi^y (1-\pi)^{1-y}$$

Taking the logarithm and rearranging:

$$\log P(Y = y) = y \log\pi + (1-y)\log(1-\pi) = y\log\frac{\pi}{1-\pi} + \log(1-\pi)$$

Therefore:

Natural parameter: $\eta = \log\frac{\pi}{1-\pi}$ (the log-odds!)
Sufficient statistic: $T(y) = y$
Log-partition function: $A(\eta) = \log(1 + e^\eta)$
Base measure: $h(y) = 1$

Why the Exponential Family Matters:

The exponential family structure has profound implications:

Exponential Family Properties

•Log-concavity: The log-likelihood is concave in $\eta$, guaranteeing unique global optima
•Sufficient statistics: All information about $\eta$ is captured by $\sum_i T(y_i) = \sum_i y_i$ (the total count of positive labels)
•Moment-parameter duality: $\mathbb{E}[T(Y)] = \nabla A(\eta) = \pi$ — the derivative of $A$ gives the mean
•Variance from second derivative: $\text{Var}(T(Y)) = \nabla^2 A(\eta) = \pi(1-\pi)$
•Conjugate priors: Exponential families have natural conjugate priors for Bayesian analysis
•GLM framework: All GLMs use exponential families, enabling unified theory and software

The Log-Odds Are Fundamental

Summary and Preview

We have established the theoretical foundation for Maximum Likelihood Estimation in logistic regression. Let's consolidate the key concepts before moving to the log-likelihood derivation.

Key Takeaways

•Bernoulli Model: Given features $\mathbf{x}$, the binary response $Y$ follows Bernoulli$(\sigma(\boldsymbol{\theta}^T\mathbf{x}))$, where the probability is determined by the logistic function
•Likelihood Function: The probability of observing all data is $\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^n \pi_i^{y_i}(1-\pi_i)^{1-y_i}$, a product over all observations
•Likelihood vs. Probability: Probability asks "how likely is this data given parameters?" while likelihood asks "how plausible are these parameters given data?"—same formula, different interpretations
•Maximum Likelihood Principle: We seek parameters that maximize the likelihood—values under which the observed data would be most probable
•Multiplicative Structure: Each observation contributes a multiplicative factor; misclassified examples severely penalize likelihood
•Information Theory Connection: Negative log-likelihood equals cross-entropy loss; maximizing likelihood minimizes "surprise" about the data
•Exponential Family: Bernoulli belongs to the exponential family with log-odds as the natural parameter, ensuring concave log-likelihood

What's Next:

Page Complete

1 / 5