Loading learning content...
In the previous module, we explored the logistic regression model—how the sigmoid function transforms linear combinations of features into probabilities, how the log-odds provide a natural interpretation, and how decision boundaries emerge from the model structure.
But we left a crucial question unanswered: How do we find the optimal model parameters?
In linear regression, we derived elegant closed-form solutions: the ordinary least squares estimator $\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ minimizes squared error and equals the maximum likelihood estimator under Gaussian noise. The mathematics aligned beautifully.
Logistic regression presents a fundamentally different challenge. The response is binary—not continuous. The errors follow a Bernoulli distribution—not Gaussian. The optimization landscape is curved in ways that preclude simple matrix algebra solutions. Yet the principle remains the same: we seek parameters that make the observed data most probable.
This principle—Maximum Likelihood Estimation—forms the theoretical backbone of modern statistical learning. Understanding it deeply is essential not just for logistic regression, but for neural networks, probabilistic graphical models, and virtually every parametric model in machine learning.
By the end of this page, you will deeply understand the likelihood function for logistic regression—how it arises from the Bernoulli probability model, what it means geometrically and statistically, why maximizing likelihood produces good classifiers, and how the likelihood connects to concepts you already know from linear regression. This foundation is essential for understanding modern machine learning.
Let us formally establish the probabilistic foundations of logistic regression. Recall that we model the conditional probability of the positive class:
$$P(Y = 1 | \mathbf{x}; \boldsymbol{\theta}) = \sigma(\boldsymbol{\theta}^T \mathbf{x}) = \frac{1}{1 + e^{-\boldsymbol{\theta}^T \mathbf{x}}}$$
where:
The Bernoulli Distribution Connection:
Given the features $\mathbf{x}$, the response $Y$ follows a Bernoulli distribution with success probability $\pi(\mathbf{x}) = \sigma(\boldsymbol{\theta}^T \mathbf{x})$:
$$Y | \mathbf{x} \sim \text{Bernoulli}(\pi(\mathbf{x}))$$
The probability mass function is:
$$P(Y = y | \mathbf{x}; \boldsymbol{\theta}) = \pi(\mathbf{x})^y (1 - \pi(\mathbf{x}))^{1-y}$$
This compact expression unifies both cases:
Different sources use different parameter notations: $\boldsymbol{\theta}$, $\boldsymbol{\beta}$, or $\mathbf{w}$. We use $\boldsymbol{\theta}$ to emphasize that these are unknown parameters to be estimated from data. The notation $P(Y = y | \mathbf{x}; \boldsymbol{\theta})$ indicates $\boldsymbol{\theta}$ is a fixed (unknown) parameter, not a random variable being conditioned on.
Key Properties of the Bernoulli Model:
Before constructing the likelihood function for logistic regression, let us deeply understand what "likelihood" means in statistics.
Likelihood vs. Probability — The Critical Distinction:
Both likelihood and probability involve the same mathematical formula, but they answer fundamentally different questions:
$$\mathcal{L}(\boldsymbol{\theta}; \text{data}) = P(\text{data}; \boldsymbol{\theta})$$
| Concept | Fixed | Variable | Question Answered |
|---|---|---|---|
| Probability | Parameters $\boldsymbol{\theta}$ | Data | "Given these parameters, how probable is this data?" |
| Likelihood | Data | Parameters $\boldsymbol{\theta}$ | "Given this data, how plausible are these parameters?" |
Intuitive Example:
Suppose you flip a coin 10 times and observe 7 heads.
The probability calculation is: $$P(7 \text{ heads} | \theta) = \binom{10}{7} \theta^7 (1-\theta)^3$$
The same formula, viewed as a function of $\theta$ with data fixed, gives the likelihood: $$\mathcal{L}(\theta; 7 \text{ heads}) = \binom{10}{7} \theta^7 (1-\theta)^3$$
12345678910111213141516171819202122232425262728293031
import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import comb def likelihood_coin(theta, n_heads, n_total): """ Likelihood function for a sequence of coin flips. L(θ; data) = C(n, k) * θ^k * (1-θ)^(n-k) where k = n_heads, n = n_total """ n_tails = n_total - n_heads # Binomial coefficient can be ignored for MLE (doesn't depend on θ) return comb(n_total, n_heads) * (theta ** n_heads) * ((1 - theta) ** n_tails) # Observed data: 7 heads out of 10 flipsn_heads = 7n_total = 10 # Compute likelihood for various θ valuestheta_values = np.linspace(0.01, 0.99, 1000)likelihoods = [likelihood_coin(theta, n_heads, n_total) for theta in theta_values] # Find maximum likelihood estimatemle_theta = theta_values[np.argmax(likelihoods)]print(f"Maximum Likelihood Estimate: θ̂ = {mle_theta:.3f}")print(f"Observed proportion: {n_heads/n_total:.3f}") # Note: MLE = sample proportion (for Bernoulli/Binomial)# This is not a coincidence—it's a fundamental result!The Maximum Likelihood Principle:
Given observed data, we estimate parameters by choosing values that maximize the likelihood—the values under which the observed data would be most probable:
$$\hat{\boldsymbol{\theta}}_{\text{MLE}} = \underset{\boldsymbol{\theta}}{\arg\max} ; \mathcal{L}(\boldsymbol{\theta}; \text{data})$$
This is intuitive: we prefer parameter values that render our observations unsurprising. If parameters $\boldsymbol{\theta}_A$ would make the data extremely unlikely, while $\boldsymbol{\theta}_B$ would make it probable, we favor $\boldsymbol{\theta}_B$.
Why Maximum Likelihood?
MLE properties are asymptotic—they hold as sample size grows. For small samples, MLE can be biased and may overfit. This is why regularization (which modifies the likelihood) becomes important in practice. We'll explore regularized logistic regression in a later module.
Now we construct the likelihood function for logistic regression. Suppose we have $n$ independent observations:
$${(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$$
where each $\mathbf{x}_i \in \mathbb{R}^{p+1}$ (including intercept) and each $y_i \in {0, 1}$.
Independence Assumption:
We assume observations are conditionally independent given the features. This means the probability of observing the entire dataset is the product of individual observation probabilities:
$$P(y_1, y_2, \ldots, y_n | \mathbf{x}_1, \ldots, \mathbf{x}n; \boldsymbol{\theta}) = \prod{i=1}^{n} P(y_i | \mathbf{x}_i; \boldsymbol{\theta})$$
The Likelihood Function:
Substituting the Bernoulli probability mass function:
$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \pi_i^{y_i} (1 - \pi_i)^{1-y_i}$$
where we use shorthand $\pi_i = \pi(\mathbf{x}_i) = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$.
Expanding this expression:
For a concrete dataset, consider what this product looks like:
| Observation | $y_i$ | Contribution to Likelihood |
|---|---|---|
| $i=1$ | 1 | $\pi_1$ |
| $i=2$ | 0 | $1 - \pi_2$ |
| $i=3$ | 1 | $\pi_3$ |
| $i=4$ | 1 | $\pi_4$ |
| $i=5$ | 0 | $1 - \pi_5$ |
| ... | ... | ... |
The total likelihood is: $$\mathcal{L}(\boldsymbol{\theta}) = \pi_1 \cdot (1-\pi_2) \cdot \pi_3 \cdot \pi_4 \cdot (1-\pi_5) \cdot \ldots$$
Interpretation:
The likelihood measures how well our model "explains" the observed labels:
Good parameters $\boldsymbol{\theta}$ assign high probability to positive examples and low probability to negative examples—exactly what a good classifier should do.
The likelihood is a product of probabilities, each between 0 and 1. For large datasets, this product becomes astronomically small—so small that it underflows floating-point representation. A dataset with 1000 observations might have likelihood $\approx 10^{-400}$, far below machine precision. This is why we work with the log-likelihood instead, which we'll develop in the next page.
The Likelihood as a Function of Parameters:
It's crucial to view the likelihood as a function of $\boldsymbol{\theta}$. Substituting $\pi_i = \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)$:
$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \left[\sigma(\boldsymbol{\theta}^T \mathbf{x}_i)\right]^{y_i} \left[1 - \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)\right]^{1-y_i}$$
This is a nonlinear function of $\boldsymbol{\theta}$ due to the sigmoid. The product structure and the nonlinearity are what make logistic regression fundamentally different from linear regression.
Using the Sigmoid Property:
Recall that the sigmoid has a beautiful symmetry: $1 - \sigma(z) = \sigma(-z)$. This allows us to write:
$$\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^{n} \sigma(\boldsymbol{\theta}^T \mathbf{x}_i)^{y_i} \cdot \sigma(-\boldsymbol{\theta}^T \mathbf{x}_i)^{1-y_i}$$
This symmetric form hints at deeper structure we'll exploit in optimization.
To build intuition, let's visualize the likelihood surface for a simple one-dimensional example (one feature plus intercept, so $\boldsymbol{\theta} = (\theta_0, \theta_1)^T$).
Visualization Setup:
Consider a small dataset where we vary the parameters and compute the likelihood at each point. The likelihood surface reveals the optimization landscape.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
import numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3D def sigmoid(z): """Numerically stable sigmoid function.""" return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def compute_likelihood(theta0, theta1, X, y): """ Compute likelihood for logistic regression. L(θ) = ∏ π_i^y_i (1-π_i)^(1-y_i) Returns log of likelihood to avoid underflow. """ z = theta0 + theta1 * X pi = sigmoid(z) # Clip to avoid log(0) pi = np.clip(pi, 1e-15, 1 - 1e-15) # Log-likelihood (sum of logs instead of product) log_lik = np.sum(y * np.log(pi) + (1 - y) * np.log(1 - pi)) return log_lik # Generate simple datasetnp.random.seed(42)n = 50 # True parameterstheta0_true = -1.0theta1_true = 2.0 # Generate features and labelsX = np.random.randn(n)z = theta0_true + theta1_true * Xprob = sigmoid(z)y = (np.random.rand(n) < prob).astype(float) # Create grid of parameter valuestheta0_range = np.linspace(-4, 2, 100)theta1_range = np.linspace(-1, 5, 100)T0, T1 = np.meshgrid(theta0_range, theta1_range) # Compute log-likelihood at each pointlog_lik_surface = np.zeros_like(T0)for i in range(len(theta0_range)): for j in range(len(theta1_range)): log_lik_surface[j, i] = compute_likelihood( theta0_range[i], theta1_range[j], X, y ) # Find MLE (maximum of the surface)max_idx = np.unravel_index(np.argmax(log_lik_surface), log_lik_surface.shape)mle_theta0 = theta0_range[max_idx[1]]mle_theta1 = theta1_range[max_idx[0]] print(f"True parameters: θ₀ = {theta0_true}, θ₁ = {theta1_true}")print(f"MLE estimates: θ̂₀ = {mle_theta0:.2f}, θ̂₁ = {mle_theta1:.2f}")print(f"Maximum log-likelihood: {log_lik_surface[max_idx]:.2f}")What the Likelihood Surface Reveals:
The concavity of the log-likelihood is perhaps the most important property for optimization. It guarantees that gradient descent will find the global maximum, that Newton's method converges, and that the optimization landscape has no pathological features. This concavity emerges from the properties of the sigmoid and Bernoulli distribution—it's not an accident but a deep structural property.
Understanding how individual observations contribute to the likelihood provides deep insight into the learning process.
Decomposing the Likelihood:
Each observation $(\mathbf{x}_i, y_i)$ contributes a factor to the likelihood:
$$\mathcal{L}_i(\boldsymbol{\theta}) = \pi_i^{y_i} (1-\pi_i)^{1-y_i}$$
The total likelihood is the product: $\mathcal{L}(\boldsymbol{\theta}) = \prod_{i=1}^n \mathcal{L}_i(\boldsymbol{\theta})$
How Different Observations Affect the Likelihood:
| Observation Type | Model Prediction $\pi_i$ | Likelihood Contribution | Effect |
|---|---|---|---|
| $y_i = 1$, correctly classified | High (e.g., 0.95) | $0.95$ | Large positive contribution |
| $y_i = 1$, misclassified | Low (e.g., 0.05) | $0.05$ | Severely penalizes likelihood |
| $y_i = 0$, correctly classified | Low (e.g., 0.05) | $0.95$ | Large positive contribution |
| $y_i = 0$, misclassified | High (e.g., 0.95) | $0.05$ | Severely penalizes likelihood |
| Near decision boundary | ~0.5 | ~0.5 | Moderate contribution either way |
The Multiplicative Effect:
Because likelihood uses multiplication, a single misclassified example can dramatically reduce the total likelihood. If one observation has $\mathcal{L}_i = 0.001$, the entire dataset's likelihood is reduced by a factor of 1000.
Consider two parameter settings:
Setting A: $\mathcal{L} = 0.9^{99} \times 0.01 \approx 2.5 \times 10^{-7}$ Setting B: $\mathcal{L} = 0.7^{100} \approx 3.2 \times 10^{-16}$
Wait—Setting B is worse! But the numbers are hard to parse. This is why we use log-likelihood:
Setting A is actually better despite the poor fit on one observation.
The multiplicative nature of likelihood makes it sensitive to extreme observations. A single mislabeled example or outlier can have outsized influence. This sensitivity motivates robust estimation techniques and careful data preprocessing. In later modules, we'll see how regularization helps address this.
Gradient Intuition:
When optimizing, the gradient of the likelihood (or log-likelihood) tells us how to adjust parameters. Observations that are confidently correct provide less gradient signal—they're already well-handled. Observations that are misclassified or uncertain provide strong gradient signal—they're where the model needs to improve.
This is fundamentally different from squared error, where all observations contribute proportionally to their error magnitude. The logistic likelihood naturally focuses optimization effort on the observations that matter most for classification.
The likelihood function has deep connections to information theory, revealing why Maximum Likelihood Estimation produces good classifiers.
Likelihood and Coding Theory:
From an information-theoretic perspective, the likelihood measures how "surprised" we would be to see the data under a given model. High likelihood means low surprise—the model "expected" this data.
The negative log-likelihood is directly related to the codelength required to transmit the data using a code based on the model's probability distribution:
$$\text{Codelength} = -\log_2 \mathcal{L}(\boldsymbol{\theta}) = -\sum_{i=1}^n \log_2 P(y_i | \mathbf{x}_i; \boldsymbol{\theta})$$
Minimizing negative log-likelihood (equivalently, maximizing likelihood) corresponds to finding the model that most efficiently "compresses" the observed labels.
Cross-Entropy Connection:
The cross-entropy between the true label distribution and model predictions is:
$$H(p, q) = -\sum_y p(y) \log q(y)$$
For our binary case with empirical distribution:
$$H(y, \pi) = -\frac{1}{n}\sum_{i=1}^n \left[ y_i \log \pi_i + (1-y_i) \log(1-\pi_i) \right]$$
This is exactly the negative log-likelihood divided by $n$! The cross-entropy loss used in neural network training is directly derived from the likelihood principle.
When training neural networks for classification, we use cross-entropy loss. Now you understand why: cross-entropy loss is the negative log-likelihood of the predictions under a Bernoulli (or categorical) model. It's not an arbitrary choice—it's the principled choice arising from maximum likelihood estimation.
Logistic regression fits within the broader framework of Generalized Linear Models (GLMs), which are based on the exponential family of distributions. This perspective illuminates deep structural properties.
Exponential Family Form:
A distribution belongs to the exponential family if its probability density/mass function can be written as:
$$p(y; \eta) = h(y) \exp\left( \eta \cdot T(y) - A(\eta) \right)$$
where:
Bernoulli as Exponential Family:
For Bernoulli$(\pi)$:
$$P(Y = y) = \pi^y (1-\pi)^{1-y}$$
Taking the logarithm and rearranging:
$$\log P(Y = y) = y \log\pi + (1-y)\log(1-\pi) = y\log\frac{\pi}{1-\pi} + \log(1-\pi)$$
Therefore:
Why the Exponential Family Matters:
The exponential family structure has profound implications:
The exponential family perspective reveals why log-odds are natural: they are the canonical parameter $\eta$ of the Bernoulli distribution. Logistic regression models $\eta = \boldsymbol{\theta}^T\mathbf{x}$ directly, which is why it's called the "canonical link" in GLM terminology. This choice ensures desirable properties like concave log-likelihood.
We have established the theoretical foundation for Maximum Likelihood Estimation in logistic regression. Let's consolidate the key concepts before moving to the log-likelihood derivation.
What's Next:
The likelihood as written involves a product of many terms—computationally problematic and difficult to differentiate. In the next page, we'll transform to log-likelihood, which converts products to sums, eliminates numerical underflow, and reveals the beautiful concave structure that makes optimization tractable. We'll derive the log-likelihood explicitly and prove its concavity—the property that guarantees global optimality of gradient-based solutions.
You now understand the likelihood function for logistic regression: how it arises from the Bernoulli probability model, what it means to maximize likelihood, how individual observations contribute, and how likelihood connects to information theory and the exponential family. Next, we develop the log-likelihood formulation essential for practical optimization.